🚀 DataFlow v1.0.6 更新日志

🔑 主要功能更新

Prompt 注册系统：
引入统一的 Prompt Registry，使每个算子（Operator）可以绑定多个 Prompt 模板，实现“一对多”结构化注册机制，方便不同任务场景的复用与扩展。感谢 @SunnyHaze。
新增 Code 处理 Pipeline：
新增完整的代码处理 Pipeline 及相关算子，支持代码数据集的分析、过滤与质量清洗，助力代码智能与数据清理任务。感谢 @beccabai。
Reasoning Pipeline 获奖验证：
DataFlow 的 Reasoning Pipeline 在 BAAI LIC Reasoning Competition 中荣获 第一名，充分验证了系统在逻辑推理与数据流调度上的鲁棒性与创新性。感谢 @miaode74 及数学推理团队 @scuuy @wongzhenhao @HeRunming @haolpku。
自动化 PDF2Model 功能：
新增 PDF-to-Model 自动生成模块，可将输入 PDF 或数据集自动转换为结构化 QA 数据，用LlamaFactory训练下游模型。此功能实现从文档到模型数据的端到端自动构建。感谢 @YalinFeng01 与 @ZhaoyangHan04。
自动基准评测模块：
新增 DataFlow Eval 模块，支持在 Pipeline 内对文本类基准（如字符串匹配、语义匹配）进行自动评测。感谢 @YalinFeng01。
统一数据库管理的 Text2SQL Pipeline：
全新改造 Text2SQL Pipeline，加入 DB Manager，统一支持 MySQL、SQLite 等多种数据库类型，并增强 Prompt 模板管理与算子复用性。感谢 @TechNomad-ds。
JSON Schema 结构化输出：
LLMServing 与 LiteLLMServing 现已支持 JSON Schema 输出，可直接生成结构化响应结果，提升多模态任务兼容性。感谢 @wongzhenhao。
书籍结构化 QA 抽取 Pipeline：
新增 BookQA 抽取 Pipeline 及相关算子，可从书籍、长文本中自动提取结构化问答数据。感谢 @HeRunming。
Science 算子扩展：
新增科学类（Science）算子，支持科研类与多模态数据集的处理。感谢 @haolpku。
彩色 Logger 美化：
升级日志系统为彩色输出，提升调试与监控体验。感谢 @MOLYHECI。
官方教学视频上线：
发布全新 Bilibili 教程系列，系统讲解 DataFlow 的核心概念、工作流与实操案例。
🔗 观看教程 >>
感谢 @Qmeiyi。

🧩 重要改进

增加 Prompt 注册与自动校验机制（@SunnyHaze）
支持 VLLM Serving 的结构化输出（@wongzhenhao）
增强 Pipeline 编译时检查机制（@SunnyHaze）
优化 PDF2Model 与 Benchmark 自动评测功能（@YalinFeng01）
发布官方教程系列（@Qmeiyi）
Agent 重构计划预告：
DataFlow Agent 模块正在全面重构中，将迁移至 LangGraph 架构，实现更高效的多 Agent 管理与任务编排，敬请期待。

🚀 DataFlow v1.0.6 Key Feature Updates

Prompt Registration System
Introduced a unified Prompt Registry that supports one-to-many prompt bindings per operator, allowing flexible task adaptation and consistent structure. Thanks to @SunnyHaze.
New Code Processing Pipeline
Added a comprehensive code pipeline and related operators for analyzing, filtering, and processing code datasets. Thanks to @beccabai.
Reasoning Pipeline Achievements
The Reasoning pipeline achieved 1st place in the BAAI LIC Reasoning Competition, validating DataFlow’s reasoning robustness and system scalability. Thanks to @miaode74, @scuuy, @wongzhenhao, @HeRunming, and @haolpku.
Automatic PDF2Model Functionality
Added an automated PDF-to-Model module that converts PDF documents or datasets into structured QA pairs, enabling downstream model training with LlamaFactory. Thanks to @YalinFeng01 and @ZhaoyangHan04.
Automatic Benchmark Evaluation
Introduced the DataFlow Eval module for automatic text benchmark evaluation (e.g., string match and semantic match). Thanks to @YalinFeng01.
Text2SQL Pipeline with Unified DB Manager
Refactored the Text2SQL pipeline with a new DB Manager supporting MySQL, SQLite, and more. Enhanced prompt modularity and operator reuse. Thanks to @TechNomad-ds.
JSON Schema Structural Output
LLMServing and LiteLLMServing now support JSON Schema structured outputs, allowing models to produce well-formed structured results. Thanks to @wongzhenhao.
Structured QA Extraction from Books
Added a BookQA Extraction Pipeline to automatically extract structured QA pairs from book-style documents. Thanks to @HeRunming.
Science Operators Added
Introduced Science operators for scientific and multimodal data processing. Thanks to @haolpku.
Colorful and Informative Logger
Enhanced logging with a colorful output format for better readability and debugging. Thanks to @MOLYHECI.
New Tutorial Series
Released a Bilibili tutorial series introducing key DataFlow concepts and practical demos.
🎥 Watch here — Thanks to @Qmeiyi.

🧩 Notable Improvements

Added prompt registration and validation – @SunnyHaze
Added structured output support for VLLM Serving – @wongzhenhao
Enhanced pipeline compilation checks – @SunnyHaze
Improved PDF2Model and benchmark evaluation – @YalinFeng01
Added official tutorial series – @Qmeiyi
Agent Refactor Announcement
The DataFlow Agent is undergoing a major refactor and will soon migrate to a LangGraph-based architecture, supporting advanced multi-agent orchestration.

What's Changed

[webui] debug for WebUI, revise func name 'type' to 'serving_type' by @SunnyHaze in #186
[Debug] Fix API bug with adding a button to write env value DF_API_KEY by @HeRunming in #187
[WebUI] Add pdf knowledge base clean WebUI by @HeRunming in #189
unify _api_chat usage by @MOLYHECI in #190
为llm_serving添加请求失败重试机制 && 修复当 llm_serving 出现调用失败 cleaned 结果中会有 None 出现 TypeError: argument of t… by @xyxhchb in #188
[debug] fix dir not exist by @HeRunming in #192
[webui] debug import error by @SunnyHaze in #193
add adp and update gradio by @Qmeiyi in #194
Fix the execution classifier operator in the text2sql pipeline by @TechNomad-ds in #197
Dataflow agent Console bug fix by @DeepThinkingZhouLiu in #199
move non-key params to init function by @ZhaoyangHan04 in #200
[Update] unified i/o keys with input/output_* format by @wongzhenhao in #201
migrate from DataFlow421 by @yuwenkai2003 in #202
[Pipeline] add Automatic Speech Recognition module and corresponding pipeline. by @gty1829 in #207
[issue temp] update issue template; and add sglang & mineru to dataflow env by @SunnyHaze in #208
Bug Fix by @DeepThinkingZhouLiu in #210
[Compiled Pipeline] Add naive logic of Compiled pipeline for pre-check of key logic & Serving management. by @MOLYHECI in #191
[compile] Added gradient color transition from step=0 to step=n in th… by @SunnyHaze in #213
[Agent] significant reduce debug time when writing pipeline with dataflow agent by calling pipeline.complie() by @DeepThinkingZhouLiu in #214
[compile] Report all KeyErrors in a single, consolidated compilation … by @SunnyHaze in #215
[Agent] fix prompt_template issue for Dataflow agent when autorun by @DeepThinkingZhouLiu in #216
add encoding check during write storage by @ZhaoyangHan04 in #217
rewrite LALMServing by @gty1829 in #219
add get_desc functions for new ops by @scuuy in #224
fix bug in diy prompt by @scuuy in #225
Add core_text and chemistry smiles extraction pipeline by @haolpku in #226
[refactor] Rename operators and revise op structure at 2025-08-21 by @SunnyHaze in #227
update text2sql pipeline, reconstruct prompt template by @TechNomad-ds in #230
[debug] fix #231, Prompted Generator issue after #227 by @haolpku in #232
Add material pipeline and pairwise prompted generator by @haolpku in #234
fix serving name and fix import LocalModelLLMServing bug by @haolpku in #235
add atomic operation by @Fengzhongzhihan in #236
fix chunk logic when length of tokens greater than model max token size by @CheinTian in #239
fix chemical pipeline output schema bug by @ZhaoyangHan04 in #240
update response format for chemistry pipelines by @haolpku in #243
[PDF2model/text2model] dataflow PDF2model/text2model function added to dataflow cli by @dataflow-fyl in #242
Dataflow agent SH by @DeepThinkingZhouLiu in #241
fix api serving by @haolpku in #245
Update the Text2SQL pipeline, refactored the database manager to support better database extensibility; manage prompts through prompt template classes to improve operator reusability. by @TechNomad-ds in #244
[README] Add a documentation link for the pipeline in the README file. fix #221 by @miaode74 in #250
add bench eval pipeline (string match and semantic match) by @scuuy in #238
rename kbc ops and prompt class by @ZhaoyangHan04 in #249
[Refactor] AgenticRAG pipeline & Doc2QA pipeline & KCenterGreedy by @wongzhenhao in #246
[refactor] moving example file to right path by @wongzhenhao in #253
divide general_text operators into core_text, general_text, text_pt, text_sft by @MOLYHECI in #248
fix issue #254 and some bugs by @zzy1127 in #255
[requirements] rm kenlm, rm redundent requirements file by @SunnyHaze in #256
adapt the pdf2model and text2model files to use the new operator names by @dataflow-fyl in #252
[debug] fix import bug for webui by @MOLYHECI in #258
modified Pdf2model pipeline by @YalinFeng01 in #260
[prompt] add prompt template & prompt registry to dataflow by @SunnyHaze in #259
[Update] update for prompt_template registration by @wongzhenhao in #263
Add code processing pipelines and operators by @beccabai in #261
fix: fix CodeQualityScoreFilter rename problem by @beccabai in #270
[refactor] RARE pipeline by @Rise-1210 in #262
[prompt] remove sensitive word for dev & debug, fix #273. by @SunnyHaze in #274
[KBC, db_pool] fix registry bug in KBC pipeline; add myscale db_pool to support connect pool for myscale #271 by @leaderwolfpipi in #275
change the prompt format to register version for reasoning operators by @scuuy in #276
Unified text2sql and fix bugs by @TechNomad-ds in #277
fix bug for text2sql by @TechNomad-ds in #278
Unified text2sql by @yaodongwen in #247
support vectorsql pipeline by @yaodongwen in #281
fix bug for text2sql pipeline by @TechNomad-ds in #283
[Update] Structural output feature added by using json_schema by @wongzhenhao in #282
[text2sql] refactor to create a unified text2sql pipeline by @SunnyHaze in #284
EvalPipeline by @YalinFeng01 in #285
fix some bugs and upgrade text pipeline by @zzy1127 in #288
[Update] Merging and renaming text2qa op by @wongzhenhao in #267
EvalPipeline by @YalinFeng01 in #286
make extra prompt function private by @ZhaoyangHan04 in #290
[overview] add function check for prompts and auto check by @SunnyHaze in #289
[feature] vllm serving support structural output by @wongzhenhao in #291
fix the name of input/output key by @scuuy in #292
relative path in gpu pipeline by @ZhaoyangHan04 in #298
Mathfusion pipeline release & Embedding Generator release by @wongzhenhao in #296
update prompt template and prompt restrict for text2sql pipeline by @TechNomad-ds in #295
general text prompt rewrite by @MOLYHECI in #293
[test] add key prefix auto-checking for all operator.run() by @SunnyHaze in #294
[feature] colorful logger by @MOLYHECI in #299
remove redundant code by @scuuy in #301
[Update] LiteLLMServing now consistent with other serving by @wongzhenhao in #304
Add Unified Prompt by @Fengzhongzhihan in #305
[operators] update func call operators & rewrite prompts by @MOLYHECI in #306
remove mathverifyjudger (useless op), fix bug in tokeninfoevaluator by @scuuy in #309
Apivlmserving abstarct method added by @wongzhenhao in #311
LiteLLMServing supports json_schema structural output for compatible LLM by @wongzhenhao in #307

New Contributors

@xyxhchb made their first contribution in #188
@yuwenkai2003 made their first contribution in #202
@gty1829 made their first contribution in #207
@Fengzhongzhihan made their first contribution in #236
@CheinTian made their first contribution in #239
@dataflow-fyl made their first contribution in #242
@miaode74 made their first contribution in #250
@YalinFeng01 made their first contribution in #260
@beccabai made their first contribution in #261
@Rise-1210 made their first contribution in #262
@yaodongwen made their first contribution in #247

Full Changelog: v1.0.5...v1.0.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataflow v1.0.6 Release Note

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 DataFlow v1.0.6 更新日志

🔑 主要功能更新

🧩 重要改进

🚀 DataFlow v1.0.6 Key Feature Updates

🧩 Notable Improvements

What's Changed

New Contributors

Contributors

Uh oh!