Skip to content

OpenDCAI/DataFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1,159 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DataFlow

Generate, Clean, and Prepare LLM Data, All-in-One

issue resolution issue resolution

PyPI version PyPI - Python Version PyPI - Downloads Downloads

Colab Docker Documents Arxiv Ask DeepWiki

Discord Online wechat

OpenDCAI%2FDataFlow | Trendshift

Visual, low-code pipelines with flexible orchestration across domains and use cases.๐Ÿ’ช

Turn raw data into high-quality LLM training datasets.๐Ÿ”ง

๐ŸŽ‰ Get smarter LLMs cheaply โ€” give us a star โญ on GitHub for the latest update.

Beginner-friendly learning resources (continuously updated): [๐ŸŽฌ Video Tutorials] [๐Ÿ“š Written Tutorials]

็ฎ€ไฝ“ไธญๆ–‡ | English

๐Ÿ“ฐ 0. News

  • [2026-02-02] ๐Ÿ–ฅ๏ธ DataFlow WebUI is now available! Launch the visual pipeline builder with a single command: dataflow webui. Build and run DataFlow pipelines through an intuitive web interface. ๐Ÿ‘‰ WebUI Docs

  • [2026-01-20] ๐ŸŒŸ Awesome Works Using DataFlow is now live! A new section showcasing open-source projects and research built on DataFlow. Contributions are welcome! ๐Ÿ‘‰ Awesome Works

  • [2025-12-19] ๐ŸŽ‰ Our DataFlow technical report is now available! Read and cite our work on arXiv: https://arxiv.org/abs/2512.16676

  • [2025-11-20] ๐Ÿค– Introducing New Data Agents for DataFlow! Try them out and follow the tutorial on Bilibili: https://space.bilibili.com/3546929239689711/lists/6761342?type=season

  • [2025-06-28] ๐ŸŽ‰ DataFlow is officially released! Our data-centric AI system is now public. Stay tuned for future updates.

๐Ÿ” 1. What is DataFlow๏ผŸ

DataFlow is a data preparation and training system designed to generate, refine, evaluate, and filter high-quality data for AI from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG system, in domains such as healthcare, finance, legal, and academic research.

Through an operator-based design, DataFlow turns the entire data cleaning workflow into a reproducible, reusable, and shareable pipeline, providing core infrastructure for the Data-Centric AI community. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing or creating new operators on demand.

df_overview_final_300

๐Ÿ” 2. Key Features

โœ…2.1 Ready-to-Use Data Synthesis and Cleaning Pipelines

  • High-Quality Training Data Generation
    • Text, Math, and Code data generation (see DataFlow-Instruct-10K for results)
    • Data generation via tools like AgenticRAG and Text2SQL
  • Structured Data Extraction
    • Large-scale PDF โ†’ QA conversion
    • Large-scale book PDF โ†’ Visual-QA conversion
  • Scientific Data Workflow Management
    • Text2SQL workflow management (Accepted by ICDE 2026)
    • Math data workflows (Accepted by KDD 2026)

โš™๏ธ2.2 Flexible Custom Pipeline Orchestration

  • 10+ core operators define interaction patterns and design principles
  • 100+ pipeline-specific operators available for reuse or reference
  • Full support for creating custom operators โ€” plug-and-play, easily packaged and distributed via GitHub or PyPI

๐Ÿง 2.3 Reproducible, Reusable, and Shareable Data-Centric AI System

  • Data governance algorithms are encapsulated as operator pipelines, enabling reproducibility and fair comparison of different data governance strategies (โค๏ธresearch-friendly)
  • Easily reuse swap underlying large models to analyze the relationship between model performance and data quality quickly
  • Built on Python and Git ecosystems for easy distribution, management, and traceability of high-quality, user-defined data governance operators and pipelines (โค๏ธenterprise-friendly)

๐Ÿ› ๏ธ 3. DataFlow Suite

The DataFlow Suite provides the essential infrastructure to automate and scale LLM data preparation with DataFlow main repository. It comprises four tightly integrated layers:

  • DataFlow-WebUI โ€“ An intuitive, visual interface for constructing and managing complex data pipelines through a drag-and-drop operator workflow.

  • DataFlow-Agent โ€“ An AI-powered assistant that dynamically composes, executes, and optimizes operators and pipelines based on high-level user intent.

  • DataFlow-Ecosystem โ€“ A modular distribution layer that standardizes operator registration. It enables domain-specific modules (e.g., DataFlow-MM, DataFlow-AI4S) to contribute extensible libraries under a unified abstraction.

  • RayOrch โ€“ A high-performance orchestration layer built on Ray, providing distributed compute scheduling and resource management for massive-scale data tasks.

Together, these components form a unified, extensible environment that transforms raw data into model-ready intelligence.

โœ… 4. Why use DataFlow?

Data generation and cleaning are crucial for high-quality models, but for both enterprises and individuals, these tasks are often time-consuming, labor-intensive, and costly. DataFlow provides a one-stop solution to tackle these challenges efficiently. Compared with systems like Nemo-Curator and Data-Juicer, DataFlow offers:

  • Enhanced Support for Data Synthesis Modules โ€“ Seamlessly integrates text, code, and math data generation pipeline for high-quality training datasets.
  • PyTorch-like Programming Management โ€“ Clear Pipeline โ†’ Operator โ†’ Prompt hierarchical structure for workflow control.
  • Principled and Multi-Category Operator Classification โ€“ Operators are systematically organized into multiple functional categories such as generation, evaluation, filtering, and refinement, forming a scientifically grounded, multi-dimensional taxonomy that reflects different stages of data preparation and enables precise operator selection and composition.
  • User-Friendly Design for Easy Debugging and Onboarding โ€“ Simplified workflow patterns that reduce the learning curve and accelerate experimentation.

๐Ÿ”ง 5. How do operators work๏ผŸ

DataFlow operators are designed with simplicity and clarity in mind.

Operators take structured inputs (JSON, JSONL, CSV) and produce high-quality outputs after intelligent processing. Each operator encapsulates a specific data processing task, providing a clean and consistent API that is easy to understand and integrate. The PyTorch-like design makes them intuitive and ready to use, allowing you to quickly build, combine, and customize pipelines without dealing with complex boilerplate code.

For more details, refer to the Operator Documentation. Below is a minimal example demonstrating how to invoke the PromptedGenerator operator:

dataflow_operator

Example input data (json/jsonl-style):

// input.json
[
  {"problem": "What is 17 + 25?"},
  {"problem": "If x = 3, compute 2x^2 + 1."}
]

Operator invocation code:

from dataflow.operators.core_text import PromptedGenerator
from dataflow.utils.storage import FileStorage
from dataflow.serving import APILLMServing_request

# set input file to global storage class
storage = FileStorage(first_entry_file_name="./input.json",)

# configure LLM serving (e.g., OpenAI API)
# api key needs to be set via `export DF_API_KEY=sk-xxx`
llm_serving = APILLMServing_request(
    api_url="https://api.openai.com/v1/chat/completions",
)

prompted_generator = PromptedGenerator(
    llm_serving=llm_serving,  # pre-configured LLM backend
    system_prompt="Please solve this math problem."
)

prompted_generator.run(
    storage=self.storage.step(),  # data management (details omitted)
    input_key="problem",          # read from this column
    output_key="solution"         # write to this column
)

After running, the operator will append the generated results into output_key. For example, the output data (json/jsonl-style) becomes:

// dataflow_step1.json
[
    {"problem":"What is 17 + 25?","solution":"42"},
    {"problem":"If x = 3, compute 2x^2 + 1.","solution":"19"}
]

๐Ÿ› ๏ธ 6. Pipelines (Click to expand)

๐Ÿ”ง 6.1 Ready-to-Use PipeLines

Current Pipelines in Dataflow are as follows:

โš™๏ธ 6.2 Flexible Operator PipeLines

In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.

๐Ÿค– 6.3 Agent Guided Pipelines

โšก 7. Quick Start

๐Ÿ› ๏ธ 7.1 Environment Setup and Installation

DataFlow supports Python>=3.10 environments, tested passed on Windows, Linux, and MacOS with Python 3.10, 3.11, and 3.12.

Please use the following commands for environment setup and installation๐Ÿ‘‡

We recommend use uv to install DataFlow for speed up.

pip install uv
uv pip install open-dataflow

If you want to use your own GPU for local inference, please use:

pip install uv
uv pip install open-dataflow[vllm]

After installation, you can use the following command to check if dataflow has been installed correctly:

dataflow -v

If installed correctly, you should see:

open-dataflow codebase version: 1.0.0
        Checking for updates...
        Local version:  1.0.0
        PyPI newest version:  1.0.0
You are using the latest version: 1.0.0.

๐Ÿณ 7.2 Docker Installation (Alternative)

We also provide a Dockerfile for easy deployment and a pre-built Docker image for immediate use.

Option 1: Use Pre-built Docker Image

You can directly pull and use our pre-built Docker image:

# Pull the pre-built image
docker pull molyheci/dataflow:cu124

# Run the container with GPU support
docker run --gpus all -it molyheci/dataflow:cu124

# Inside the container, verify installation
dataflow -v
Option 2: Build from Dockerfile

Alternatively, you can build the Docker image from the provided Dockerfile:

# Clone the repository (HTTPS)
git clone https://github.com/OpenDCAI/DataFlow.git
# Or use SSH
# git clone git@github.com:OpenDCAI/DataFlow.git

cd DataFlow

# Build the Docker image
docker build -t dataflow:custom .

# Run the container
docker run --gpus all -it dataflow:custom

# Inside the container, verify installation
dataflow -v

Note: The Docker image includes CUDA 12.4.1 support and comes with vLLM pre-installed for GPU acceleration. Make sure you have NVIDIA Container Toolkit installed to use GPU features.

๐Ÿš€ 7.3 Quick Start with Google Colab

You can start your first DataFlow translation project directly on Google Colab. By following the provided guidelines, you can seamlessly scale from a simple translation example to more complex DataFlow pipelines.

๐Ÿ‘‰ Start DataFlow with Google Colab

๐Ÿ“– 7.3 Reference Project Documentation

For detailed usage instructions and getting started guide, please visit our DataFlow Documentation.

Documents

๐Ÿ–ฅ๏ธ 7.4 DataFlow-WebUI

DataFlow provides a Web-based UI (WebUI) for visual pipeline construction and execution.

To start DataFlow-WebUI, simply run following command after install the DataFlow main repository:

dataflow webui

This will automatically download and launch the latest DataFlow-WebUI and open it in your browser (http://localhost:8000/ if it does not open automatically).

๐Ÿ“š 7.4.1 WebUI Documentation

๐Ÿ› ๏ธ 7.4.2 Development Repository

๐Ÿงช 8. Experimental Results

8.1 DataFlow-Instruct-10k

DataFlow-Instruct-10K is a unified multi-domain instruction dataset generated by the DataFlow framework. It is constructed through several automated data preparation pipelines spanning mathematical reasoning, code, and general text instructions. Each pipeline follows a generateโ€“evaluateโ€“filterโ€“refine workflow to synthesize and curate high-quality instructionโ€“response pairs. The resulting dataset contains approximately 10K samples and provides high-quality supervision for instruction tuning, enabling base models to approach the performance of fully trained instruct models with significantly fewer training examples.

For Detailed Experiments setting, please visit our DataFlow Technical Report.

Model Math-Avg Code-Avg Knowledge-Avg
Qwen2-7B Series
Base 20.1 66.3 76.2
+ Infinity-Instruct-10K 29.0 67.8 76.2
+ Infinity-Instruct-1M 27.9 68.2 76.2
+ DataFlow-Instruct-10K 32.4 66.2 76.1
Qwen2.5-7B Series
Base 37.1 76.5 76.0
+ Infinity-Instruct-10K 22.6 77.6 75.8
+ Infinity-Instruct-1M 33.3 78.0 75.8
+ DataFlow-Instruct-10K 46.7 78.6 76.2

๐Ÿ› ๏ธ 8.2 Other Pipeline Results (Click to expand)

8.2.1 Text Pipeline

8.2.1.1 Pre-training data filter pipeline

From the SlimPajama-627B corpus, we extract a 100B-token subset and apply multiple DataFlow text-pretraining filters. We train a Qwen2.5-0.5B model from scratch for 30B tokens using the Megatron-DeepSpeed framework, the results are as follows:

Methods ARC-C ARC-E MMLU HellaSwag WinoGrande Gaokao-MathQA Avg
Random-30B 25.26 43.94 27.03 37.02 50.99 27.35 35.26
Qurating-30B 25.00 43.14 27.50 37.03 50.67 26.78 35.02
FineWeb-Edu-30B 26.45 45.41 27.41 38.06 50.43 25.64 35.57
DataFlow-30B 25.51 45.58 27.42 37.58 50.67 27.35 35.69
8.2.1.2 SFT data filter and synthesis pipeline

To study small-scale SFT data quality, we fine-tune the Qwen2.5-7B base model using LLaMA-Factory on WizardLM and Alpaca datasets. For each dataset, we compared a randomly sampled set of 5K instances against a set of 5K instances filtered by DataFlow's SFT pipeline. Additionally, we synthesize a 15k-size dataset, DataFlow-SFT-15K, using DataFlowโ€™s Condor Generator and Condor Refiner pipeline, followed by DataFlowโ€™s SFT filtering pipeline (excluding the Instagram filter). Benchmarks include comprehensive Math, Code, and Knowledge evaluation suites.

8.2.2 Math Benchmarks

Methods math gsm8k aime24 minerva olympiad Avg
Alpaca (random) 54.9 77.2 13.3 14.0 27.0 37.3
Alpaca (filtered) 60.3 80.0 13.3 14.7 30.7 39.8
WizardLM (random) 61.1 84.2 6.7 18.0 29.3 39.9
WizardLM (filtered) 69.7 88.8 10.0 19.9 35.4 44.8
DataFlow-SFT-15K (random) 72.6 89.6 13.3 37.9 32.9 49.3
DataFlow-SFT-15K (filtered) 73.3 90.2 13.3 36.0 35.9 49.7

8.2.3 Code Benchmarks

Methods HumanEval MBPP Avg
Alpaca (random) 71.3 75.9 73.6
Alpaca (filtered) 73.8 75.7 74.8
WizardLM (random) 75.6 82.0 78.8
WizardLM (filtered) 77.4 80.4 78.9
DataFlow-SFT-15K (random) 79.9 75.9 77.9
DataFlow-SFT-15K (filtered) 82.9 74.9 78.9

8.2.4 Knowledge Benchmarks

Methods MMLU C-EVAL Avg
Alpaca (random) 71.8 80.0 75.9
Alpaca (filtered) 71.8 80.0 75.9
WizardLM (random) 71.8 79.2 75.5
WizardLM (filtered) 71.9 79.6 75.8
DataFlow-SFT-15K (random) 72.1 80.0 76.1
DataFlow-SFT-15K (filtered) 72.2 80.4 76.3

8.2.5 Conversation Synthesis Pipeline

We synthesize DataFlow-Chat-15K using DataFlow's conversation-generation pipeline and fine-tune Qwen2.5-7B-Base on it. Baselines include ShareGPT-15K, UltraChat-15K, and their full (non-truncated) versions. We evaluate on domain-specific tasks (TopDial, Light) and general benchmarks (MMLU, AlpacaEval, Arena-Hard).

8.2.5.1 Conversation Benchmarks
Model TopDial Light Avg
Qwen2.5-7B 7.71 7.79 7.75
+ ShareGPT-15K 7.75 6.72 7.24
+ UltraChat-15K 7.72 6.83 7.28
+ DataFlow-Chat-15K 7.98 8.10 8.04
8.2.5.2 General Benchmarks
Model MMLU AlpacaEval Arena-Hard Avg
Qwen2.5-7B 71.45 7.05 0.60 26.36
+ ShareGPT-15K 73.09 3.70 1.30 26.03
+ UltraChat-15K 72.97 3.97 0.80 25.91
+ DataFlow-Chat-15K 73.41 10.11 1.10 28.21

8.2.6 Reasoning Pipeline

We adopt the NuminaMath dataset as a high-quality seed dataset. We compare three training sources: (1) a random 10K subset from Open-R1, (2) a random 10K subset from Synthetic-1, and (3) our 10K synthesized DataFlow-Reasoning-10K dataset constructed using DataFlow.

Setting Model gsm8k math amc23 olympiad gaokao24_mix minerva AIME24@32 AIME25@32 Avg
Baseline Qwen2.5-32B-Instruct 95.8 73.5 70.0 38.5 42.9 26.5 16.8 11.6 46.95
1 Epoch + SYNTHETIC-1-10k 92.9 71.8 52.5 38.4 23.1 24.3 35.6 34.0 46.6
1 Epoch + Open-R1-10k 91.5 72.3 65.0 38.4 20.9 24.6 43.0 33.5 48.7
1 Epoch + DataFlow-Reasoning-10K 93.9 72.3 72.5 38.7 38.5 26.5 35.9 34.5 51.6
2 Epochs + SYNTHETIC-1-10k 94.5 78.4 75.0 45.0 24.2 28.3 48.4 37.9 54.0
2 Epochs + Open-R1-10k 93.9 77.2 80.0 44.1 20.9 25.4 51.0 40.7 54.2
2 Epochs + DataFlow-Reasoning-10K 94.4 76.6 75.0 45.2 42.9 25.7 45.4 40.0 55.7

8.2.7 Code Pipeline

We randomly sample 20k instances from the Ling-Coder-SFT corpus and process them through the DataFlow Code Pipeline. This yields three curated code instruction datasets of different scales, DataFlow-Code-1K, DataFlow-Code-5K, and DataFlow-Code-10K, each designed to provide high-quality, pipeline-refined supervision signals for code generation tasks. We compare our synthesized datasets against Code-Alpaca-1k and Self-OSS-Instruct-SC2-Exec-Filter-1k.

8.2.7.1 Trained on Qwen2.5-7B-Instruct
Training Data BigCodeBench LiveCodeBench (v6) CruxEval (I) CruxEval (O) HumanEval+ Avg
Qwen2.5-7B-Instruct 35.3 23.4 44.8 43.9 72.6 44.0
+ Code Alpaca-1K 33.3 18.7 45.6 46.4 66.5 42.1
+ Self-OSS 31.9 21.4 46.9 45.9 70.1 43.2
+ DataFlow-Code-1K 35.5 25.7 48.0 45.1 72.6 45.4
+ DataFlow-Code-5K 36.2 26.4 48.6 45.0 73.2 45.9
+ DataFlow-Code-10K 36.8 26.0 48.8 45.4 73.8 46.2
8.2.7.2 Trained on Qwen2.5-14B-Instruct
Training Data BigCodeBench LiveCodeBench (v6) CruxEval (I) CruxEval (O) HumanEval+ Avg
Qwen2.5-14B-Instruct 37.5 33.4 48.0 48.5 74.4 48.4
+ Code Alpaca-1K 37.0 28.2 50.2 49.6 71.3 47.3
+ Self-OSS 36.9 22.3 52.6 50.1 68.3 46.0
+ DataFlow-Code-1K 41.4 33.7 51.0 50.9 77.3 50.9
+ DataFlow-Code-5K 41.1 33.2 52.5 50.6 76.2 50.7
+ DataFlow-Code-10K 41.9 33.2 52.9 51.0 76.2 51.0

๐Ÿ“„ 9. Publications

Our team has published the following papers that form core components of the DataFlow system:

Paper Title DataFlow Component Venue Year
Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL Text2SQL Data Augmentation ICDE 2026
Let's Verify Math Questions Step by Step Math question quality evaluation KDD 2026
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification Multimodal reasoning verification framework for data processing and evaluation ACL 2025
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration Multi-actor collaborative data selection mechanism for enhanced data filtering and processing ACL 2025

Contributing Institutions: PKU HKUST CAS Shanghai AI Lab Baichuan Ant Group

๐Ÿ† 10. Awards & Achievements

We are honored to have received first-place awards in two major international AI competitions, recognizing the excellence and robustness of DataFlow and its reasoning capabilities:

Competition Track Award Organizer Date
ICML 2025 Challenges on Automated Math Reasoning and Extensions Track 2:Physics Reasoning with Diagrams and Expressions ๐Ÿฅ‡First Place Winner ICML AI for Math Workshop & AWS Codabench July 18, 2025
2025 Language and Intelligence Challenge (LIC) Track 2:Beijing Academy of Artificial Intelligence ๐Ÿฅ‡First Prize Beijing Academy of Artificial Intelligence (BAAI) & Baidu August 10, 2025
ICML 2025 Certificate
ICML 2025 Automated Math Reasoning Challenge โ€” First Place Winner
LIC 2025 Certificate
BAAI Language & Intelligence Challenge 2025 โ€” First Prize

๐ŸŒŸ 11. Awesome Work Using DataFlow & DataFlow Ecosystem

This section highlights projects, research works, and applications built on top of DataFlow or deeply integrated with the DataFlow ecosystem.

๐Ÿ“Œ Curated list of featured projects: [Awesome Work Using DataFlow]

We warmly welcome the community to contribute new entries via Pull Requests. ๐Ÿ™Œ Detailed Guidance can help you creating a Dataflow extension repository from DataFlow-CLI.

๐Ÿ’ 12. Acknowledgements

We sincerely thank MinerU for their outstanding work, whose powerful PDF/document text extraction capabilities provided essential support for our data loading process. We also thank LLaMA-Factory for offering an efficient and user-friendly framework for large model fine-tuning, which greatly facilitated rapid iteration in our training and experimentation workflows. Our gratitude extends to all contributors in the open-source communityโ€”their efforts collectively drive the development of DataFlow. We thank Zhongguancun Academy for their API and GPU support.

๐Ÿค 13. Community & Support

Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!

โ€ข ๐Ÿ“ฎ GitHub Issues: Report bugs or suggest features

โ€ข ๐Ÿ”ง GitHub Pull Requests: Contribute code improvements

โ€ข ๐Ÿ’ฌ Join our community groups to connect with us and other contributors!

๐Ÿ“œ 14. Citation

If you use DataFlow in your research, feel free to give us a cite.

@article{liang2025dataflow,
  title={DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI},
  author={Liang, Hao and Ma, Xiaochen and Liu, Zhou and Wong, Zhen Hao and Zhao, Zhengyang and Meng, Zimo and He, Runming and Shen, Chengyu and Cai, Qifeng and Han, Zhaoyang and others},
  journal={arXiv preprint arXiv:2512.16676},
  year={2025}
}
Connect with the PKU-DCAI Research Team on Xiaohongshu: 26133106768