Skip to content

Commit 7b6fd1a

Browse files
authored
[deploy] Merge pull request #208 from microsoft/dev
Dev
2 parents a9d7748 + 5939cc3 commit 7b6fd1a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+2145
-2099
lines changed

README.md

Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,42 @@
1-
<h1>
2-
<img src="./public/favicon.ico" alt="Data Formulator icon" width="28"> <b>Data Formulator: Vibe with data, in control</b>
1+
<h1 align="center">
2+
<img src="./public/favicon.ico" alt="Data Formulator icon" width="28">&nbsp;
3+
Data Formulator: AI-powered Data Visualization
34
</h1>
45

5-
<div>
6-
7-
[![arxiv](https://img.shields.io/badge/Paper-arXiv:2408.16119-b31b1b.svg)](https://arxiv.org/abs/2408.16119)&ensp;
8-
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)&ensp;
9-
[![YouTube](https://img.shields.io/badge/YouTube-white?logo=youtube&logoColor=%23FF0000)](https://www.youtube.com/watch?v=GfTE2FLyMrs)&ensp;
10-
[![build](https://github.com/microsoft/data-formulator/actions/workflows/python-build.yml/badge.svg)](https://github.com/microsoft/data-formulator/actions/workflows/python-build.yml)
11-
[![Discord](https://img.shields.io/badge/discord-chat-green?logo=discord)](https://discord.gg/mYCZMQKYZb)
126

13-
</div>
7+
<p align="center">
8+
🪄 Explore data with visualizations, powered by AI agents.
9+
</p>
1410

15-
🪄 Turn data into insights with AI Agents, with the exploration paths you choose. Try Data Formulator now!
11+
<p align="center">
12+
<a href="https://data-formulator.ai"><img src="https://img.shields.io/badge/🚀_Try_Online_Demo-data--formulator.ai-F59E0B?style=for-the-badge" alt="Try Online Demo"></a>
13+
&nbsp;
14+
<a href="#get-started"><img src="https://img.shields.io/badge/💻_Install_Locally-pip_install-3776AB?style=for-the-badge" alt="Install Locally"></a>
15+
</p>
1616

17-
- 🤖 New in v0.5: agent model + interative control [(video)](https://www.youtube.com/watch?v=GfTE2FLyMrs)
18-
- 🔥🔥🔥 Try our online demo at [https://data-formulator.ai](https://data-formulator.ai)
19-
- Any questions, thoughts? Discuss in the Discord channel! [![Discord](https://img.shields.io/badge/discord-chat-green?logo=discord)](https://discord.gg/mYCZMQKYZb)
17+
<p align="center">
18+
<a href="https://arxiv.org/abs/2408.16119"><img src="https://img.shields.io/badge/Paper-arXiv:2408.16119-b31b1b.svg" alt="arXiv"></a>&ensp;
19+
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>&ensp;
20+
<a href="https://www.youtube.com/watch?v=GfTE2FLyMrs"><img src="https://img.shields.io/badge/YouTube-white?logo=youtube&logoColor=%23FF0000" alt="YouTube"></a>&ensp;
21+
<a href="https://github.com/microsoft/data-formulator/actions/workflows/python-build.yml"><img src="https://github.com/microsoft/data-formulator/actions/workflows/python-build.yml/badge.svg" alt="build"></a>&ensp;
22+
<a href="https://discord.gg/mYCZMQKYZb"><img src="https://img.shields.io/badge/discord-chat-green?logo=discord" alt="Discord"></a>
23+
</p>
2024

2125
<!-- [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/microsoft/data-formulator?quickstart=1) -->
26+
<!--
27+
https://github.com/user-attachments/assets/8ca57b68-4d7a-42cb-bcce-43f8b1681ce2 -->
2228

23-
https://github.com/user-attachments/assets/8ca57b68-4d7a-42cb-bcce-43f8b1681ce2
24-
25-
<!-- <kbd>
26-
<a target="_blank" rel="noopener noreferrer" href="https://codespaces.new/microsoft/data-formulator?quickstart=1" title="open Data Formulator in GitHub Codespaces"><img src="public/data-formulator-screenshot-v0.5.png"></a>
27-
</kbd> -->
29+
<kbd>
30+
<img src="public/data-formulator-screenshot-v0.5.png">
31+
</kbd>
2832

2933

3034
## News 🔥🔥🔥
35+
[12-08-2025] **Data Formulator 0.5.1** — Connect more, visualize more, move faster
36+
- 🔌 **Community data loaders**: Google BigQuery, MySQL, Postgres, MongoDB
37+
- 📊 **New chart types**: US Map & Pie Chart (more to be added soon)
38+
- ✏️ **Editable reports**: Refine generated reports with [Chartifact](https://github.com/microsoft/chartifact) in markdown style. [demo](https://github.com/microsoft/data-formulator/pull/200#issue-3635408217)
39+
-**Snappier UI**: Noticeably faster interactions across the board
3140

3241
[11-07-2025] Data Formulator 0.5: Vibe with your data, in control
3342

@@ -109,9 +118,9 @@ Here are milestones that lead to the current design:
109118

110119
## Overview
111120

112-
**Data Formulator** is an application from Microsoft Research that uses AI agents to make it easier to turn data into insights.
121+
**Data Formulator** is a Microsoft Research prototype for data exploration with visualizations powered by AI agents.
113122

114-
Data Formulator is an AI-powered tool for analysts to iteratively explore and visualize data. Started with data in any format (screenshot, text, csv, or database), users can work with AI agents with a novel blended interface that combines *user interface interactions (UI)* and *natural language (NL) inputs* to communicate their intents, control branching exploration directions, and create reports to share their insights.
123+
Data Formulator enables analysts to iteratively explore and visualize data. Started with data in any format (screenshot, text, csv, or database), users can work with AI agents with a novel blended interface that combines *user interface interactions (UI)* and *natural language (NL) inputs* to communicate their intents, control branching exploration directions, and create reports to share their insights.
115124

116125
## Get Started
117126

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@
4444
"redux-persist": "^6.0.0",
4545
"typescript": "^4.9.5",
4646
"validator": "^13.15.20",
47-
"vega": "^5.32.0",
47+
"vega": "^6.2.0",
4848
"vega-embed": "^6.21.0",
4949
"vega-lite": "^5.5.0",
5050
"vm-browserify": "^1.1.2"

py-src/data_formulator/agents/agent_interactive_explore.py

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -67,54 +67,42 @@
6767
* when the exploration context is provided, make your suggestion based on the context as well as the original dataset; otherwise leverage the original dataset to suggest questions.
6868
6969
Guidelines for question suggestions:
70-
1. Suggest a list of question_groups of interesting analytical questions that are not obvious that can uncover nontrivial insights, including both breadth and depth questions.
71-
70+
1. Suggest a list of question_groups of interesting analytical questions that are not obvious that can uncover nontrivial insights.
7271
2. Use a diverse language style to display the questions (can be questions, statements etc)
7372
3. If there are multiple datasets in a thread, consider relationships between them
7473
4. CONCISENESS: the questions should be concise and to the point
7574
5. QUESTION GROUP GENERATION:
7675
- different questions groups should cover different aspects of the data analysis for user to choose from.
77-
- each question_group should include both 'breadth_questions' and 'depth_questions':
78-
- breadth_questions: a group of questions that are all relatively simple that helps the user understand the data in a broad sense.
79-
- depth_questions: a sequence of questions that build on top of each other to answer a specific aspect of the user's goal.
80-
- you have a budget of generating 4 questions in total (or as directed by the user).
81-
- allocate 2-3 questions to 'breadth_questions' and 2-3 questions to 'depth_questions' based on the user's goal and the data.
82-
- each question group should slightly lean towards 'breadth' or 'depth' exploration, but not too much.
83-
- the more focused area can have more questions than the other area.
76+
- each question_group is a sequence of 'questions' that builds on top of each other to answer the user's goal.
8477
- each question group should have a difficulty level (easy / medium / hard),
8578
- simple questions should be short -- single sentence exploratory questions
8679
- medium questions can be 1-2 sentences exploratory questions
8780
- hard questions should introduce some new analysis concept but still make it concise
8881
- if suitable, include a group of questions that are related to statistical analysis: forecasting, regression, or clustering.
8982
6. QUESTIONS WITHIN A QUESTION GROUP:
90-
- all questions should be a new question based on the thread of exploration the user provided, do not repeat questions that have already been explored in the thread
83+
- raise new questions that are related to the user's goal, do not repeat questions that have already been explored in the context provided to you.
9184
- if the user provides a start question, suggested questions should be related to the start question.
92-
- when suggesting 'breadth_questions' in a question_group, they should be a group of questions:
93-
- they are related to the user's goal, they should each explore a different aspect of the user's goal in parallel.
94-
- questions should consider different fields, metrics and statistical methods.
95-
- each question within the group should be distinct from each other that they will lead to different insights and visualizations
96-
- when suggesting 'depth_questions' in a question_group, they should be a sequence of questions:
97-
- start of the question should provide an overview of the data in the direction going to be explored, and it will be refined in the subsequent questions.
98-
- they progressively dive deeper into the data, building on top of the previous question.
99-
- each question should be related to the previous question, introducing refined analysis (e.g., updated computation, filtering, different grouping, etc.)
85+
- the questions should progressively dive deeper into the data, building on top of the previous question.
86+
- start of the question should provide an overview of the data in the direction going to be explored.
87+
- followup questions should refine the previous question, introducing refined analysis to deep dive into the data (e.g., updated computation, filtering, different grouping, etc.)
88+
- don't jump too far from the previous question so that readers can understand the flow of the questions.
10089
- every question should be answerable with a visualization.
10190
7. FORMATTING:
102-
- include "breadth_questions" and "depth_questions" in the question group:
103-
- each question group should have 2-3 questions (or as directed by the user).
91+
- include "questions" in the question group:
92+
- each question group should have 2-4 questions (or as directed by the user).
10493
- For each question group, include a 'goal' that summarizes the goal of the question group.
10594
- The goal should all be a short single sentence (<12 words).
10695
- Meaning of the 'goal' should be clear that the user won't misunderstand the actual question descibed in 'text'.
10796
- It should capture the key computation and exploration direction of the question (do not omit any information that may lead to ambiguity), but also keep it concise.
10897
- include the **bold** keywords for the attributes / metrics that are important to the question, especially when the goal mentions fields / metrics in the original dataset (don't have to be exact match)
10998
- include 'difficulty' to indicate the difficulty of the question, it should be one of 'easy', 'medium', 'hard'
110-
- a 'focus' field to indicate whether the overall question group leans more on 'breadth' or 'depth' exploration.
11199
112100
Output should be a list of json objects in the following format, each line should be a json object representing a question group, starting with 'data: ':
113101
114102
Format:
115103
116-
data: {"breadth_questions": [...], "depth_questions": [...], "goal": ..., "difficulty": ..., "focus": "..."}
117-
data: {"breadth_questions": [...], "depth_questions": [...], "goal": ..., "difficulty": ..., "focus": "..."}
104+
data: {"questions": [...], "goal": ..., "difficulty": ...}
105+
data: {"questions": [...], "goal": ..., "difficulty": ...}
118106
... // more question groups
119107
'''
120108

py-src/data_formulator/agents/agent_query_completion.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,10 @@ def __init__(self, client):
5454

5555
def run(self, data_source_metadata, query):
5656

57+
# For MongoDB, treat it as a SQL-like data source for query generation
58+
if data_source_metadata['data_loader_type'] == "mongodb":
59+
data_source_metadata['data_loader_type'] = "SQL"
60+
5761
user_query = f"[DATA SOURCE]\n\n{json.dumps(data_source_metadata, indent=2)}\n\n[USER INPUTS]\n\n{query}\n\n"
5862

5963
logger.info(user_query)

py-src/data_formulator/data_loader/__init__.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,18 @@
55
from data_formulator.data_loader.s3_data_loader import S3DataLoader
66
from data_formulator.data_loader.azure_blob_data_loader import AzureBlobDataLoader
77
from data_formulator.data_loader.postgresql_data_loader import PostgreSQLDataLoader
8+
from data_formulator.data_loader.mongodb_data_loader import MongoDBDataLoader
9+
from data_formulator.data_loader.bigquery_data_loader import BigQueryDataLoader
810

911
DATA_LOADERS = {
1012
"mysql": MySQLDataLoader,
1113
"mssql": MSSQLDataLoader,
1214
"kusto": KustoDataLoader,
1315
"s3": S3DataLoader,
1416
"azure_blob": AzureBlobDataLoader,
15-
"postgresql": PostgreSQLDataLoader
17+
"postgresql": PostgreSQLDataLoader,
18+
"mongodb": MongoDBDataLoader,
19+
"bigquery": BigQueryDataLoader
1620
}
1721

18-
__all__ = ["ExternalDataLoader", "MySQLDataLoader", "MSSQLDataLoader", "KustoDataLoader", "S3DataLoader", "AzureBlobDataLoader","PostgreSQLDataLoader","DATA_LOADERS"]
19-
22+
__all__ = ["ExternalDataLoader", "MySQLDataLoader", "MSSQLDataLoader", "KustoDataLoader", "S3DataLoader", "AzureBlobDataLoader","PostgreSQLDataLoader", "MongoDBDataLoader", "BigQueryDataLoader", "DATA_LOADERS"]

py-src/data_formulator/data_loader/azure_blob_data_loader.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@
77
from typing import Dict, Any, List
88
from data_formulator.security import validate_sql_query
99

10+
try:
11+
from azure.storage.blob import BlobServiceClient, ContainerClient
12+
from azure.identity import DefaultAzureCredential, AzureCliCredential, ManagedIdentityCredential, EnvironmentCredential, ChainedTokenCredential
13+
AZURE_BLOB_AVAILABLE = True
14+
except ImportError:
15+
AZURE_BLOB_AVAILABLE = False
16+
1017
class AzureBlobDataLoader(ExternalDataLoader):
1118

1219
@staticmethod
@@ -59,6 +66,12 @@ def auth_instructions() -> str:
5966
"""
6067

6168
def __init__(self, params: Dict[str, Any], duck_db_conn: duckdb.DuckDBPyConnection):
69+
if not AZURE_BLOB_AVAILABLE:
70+
raise ImportError(
71+
"Azure storage libraries are required for Azure Blob connections. "
72+
"Install with: pip install azure-storage-blob azure-identity"
73+
)
74+
6275
self.params = params
6376
self.duck_db_conn = duck_db_conn
6477

@@ -368,7 +381,7 @@ def view_query_sample(self, query: str) -> List[Dict[str, Any]]:
368381
if not result:
369382
raise ValueError(error_message)
370383

371-
return self.duck_db_conn.execute(query).df().head(10).to_dict(orient="records")
384+
return json.loads(self.duck_db_conn.execute(query).df().head(10).to_json(orient="records"))
372385

373386
def ingest_data_from_query(self, query: str, name_as: str):
374387
# Execute the query and get results as a DataFrame

0 commit comments

Comments
 (0)