🚀 DataFlow v1.0.6 更新日志
🔑 主要功能更新
-
Prompt 注册系统:
引入统一的 Prompt Registry,使每个算子(Operator)可以绑定多个 Prompt 模板,实现“一对多”结构化注册机制,方便不同任务场景的复用与扩展。感谢 @SunnyHaze。 -
新增 Code 处理 Pipeline:
新增完整的代码处理 Pipeline 及相关算子,支持代码数据集的分析、过滤与质量清洗,助力代码智能与数据清理任务。感谢 @beccabai。 -
Reasoning Pipeline 获奖验证:
DataFlow 的 Reasoning Pipeline 在 BAAI LIC Reasoning Competition 中荣获 第一名,充分验证了系统在逻辑推理与数据流调度上的鲁棒性与创新性。感谢 @miaode74 及数学推理团队 @scuuy @wongzhenhao @HeRunming @haolpku。 -
自动化 PDF2Model 功能:
新增 PDF-to-Model 自动生成模块,可将输入 PDF 或数据集自动转换为结构化 QA 数据,用LlamaFactory训练下游模型。此功能实现从文档到模型数据的端到端自动构建。感谢 @YalinFeng01 与 @ZhaoyangHan04。 -
自动基准评测模块:
新增 DataFlow Eval 模块,支持在 Pipeline 内对文本类基准(如字符串匹配、语义匹配)进行自动评测。感谢 @YalinFeng01。 -
统一数据库管理的 Text2SQL Pipeline:
全新改造 Text2SQL Pipeline,加入 DB Manager,统一支持 MySQL、SQLite 等多种数据库类型,并增强 Prompt 模板管理与算子复用性。感谢 @TechNomad-ds。 -
JSON Schema 结构化输出:
LLMServing与LiteLLMServing现已支持 JSON Schema 输出,可直接生成结构化响应结果,提升多模态任务兼容性。感谢 @wongzhenhao。 -
书籍结构化 QA 抽取 Pipeline:
新增 BookQA 抽取 Pipeline 及相关算子,可从书籍、长文本中自动提取结构化问答数据。感谢 @HeRunming。 -
Science 算子扩展:
新增科学类(Science)算子,支持科研类与多模态数据集的处理。感谢 @haolpku。 -
彩色 Logger 美化:
升级日志系统为彩色输出,提升调试与监控体验。感谢 @MOLYHECI。 -
官方教学视频上线:
发布全新 Bilibili 教程系列,系统讲解 DataFlow 的核心概念、工作流与实操案例。
🔗 观看教程 >>
感谢 @Qmeiyi。
🧩 重要改进
- 增加 Prompt 注册与自动校验机制(@SunnyHaze)
- 支持 VLLM Serving 的结构化输出(@wongzhenhao)
- 增强 Pipeline 编译时检查机制(@SunnyHaze)
- 优化 PDF2Model 与 Benchmark 自动评测功能(@YalinFeng01)
- 发布官方教程系列(@Qmeiyi)
- Agent 重构计划预告:
DataFlow Agent 模块正在全面重构中,将迁移至 LangGraph 架构,实现更高效的多 Agent 管理与任务编排,敬请期待。
🚀 DataFlow v1.0.6 Key Feature Updates
-
Prompt Registration System
Introduced a unified Prompt Registry that supports one-to-many prompt bindings per operator, allowing flexible task adaptation and consistent structure. Thanks to @SunnyHaze. -
New Code Processing Pipeline
Added a comprehensive code pipeline and related operators for analyzing, filtering, and processing code datasets. Thanks to @beccabai. -
Reasoning Pipeline Achievements
The Reasoning pipeline achieved 1st place in the BAAI LIC Reasoning Competition, validating DataFlow’s reasoning robustness and system scalability. Thanks to @miaode74, @scuuy, @wongzhenhao, @HeRunming, and @haolpku. -
Automatic PDF2Model Functionality
Added an automated PDF-to-Model module that converts PDF documents or datasets into structured QA pairs, enabling downstream model training with LlamaFactory. Thanks to @YalinFeng01 and @ZhaoyangHan04. -
Automatic Benchmark Evaluation
Introduced the DataFlow Eval module for automatic text benchmark evaluation (e.g., string match and semantic match). Thanks to @YalinFeng01. -
Text2SQL Pipeline with Unified DB Manager
Refactored the Text2SQL pipeline with a new DB Manager supporting MySQL, SQLite, and more. Enhanced prompt modularity and operator reuse. Thanks to @TechNomad-ds. -
JSON Schema Structural Output
LLMServingandLiteLLMServingnow support JSON Schema structured outputs, allowing models to produce well-formed structured results. Thanks to @wongzhenhao. -
Structured QA Extraction from Books
Added a BookQA Extraction Pipeline to automatically extract structured QA pairs from book-style documents. Thanks to @HeRunming. -
Science Operators Added
Introduced Science operators for scientific and multimodal data processing. Thanks to @haolpku. -
Colorful and Informative Logger
Enhanced logging with a colorful output format for better readability and debugging. Thanks to @MOLYHECI. -
New Tutorial Series
Released a Bilibili tutorial series introducing key DataFlow concepts and practical demos.
🎥 Watch here — Thanks to @Qmeiyi.
🧩 Notable Improvements
- Added prompt registration and validation – @SunnyHaze
- Added structured output support for VLLM Serving – @wongzhenhao
- Enhanced pipeline compilation checks – @SunnyHaze
- Improved PDF2Model and benchmark evaluation – @YalinFeng01
- Added official tutorial series – @Qmeiyi
- Agent Refactor Announcement
The DataFlow Agent is undergoing a major refactor and will soon migrate to a LangGraph-based architecture, supporting advanced multi-agent orchestration.
What's Changed
- [webui] debug for WebUI, revise func name 'type' to 'serving_type' by @SunnyHaze in #186
- [Debug] Fix API bug with adding a button to write env value DF_API_KEY by @HeRunming in #187
- [WebUI] Add pdf knowledge base clean WebUI by @HeRunming in #189
- unify _api_chat usage by @MOLYHECI in #190
- 为llm_serving添加请求失败重试机制 && 修复当 llm_serving 出现调用失败 cleaned 结果中会有 None 出现 TypeError: argument of t… by @xyxhchb in #188
- [debug] fix dir not exist by @HeRunming in #192
- [webui] debug import error by @SunnyHaze in #193
- add adp and update gradio by @Qmeiyi in #194
- Fix the execution classifier operator in the text2sql pipeline by @TechNomad-ds in #197
- Dataflow agent Console bug fix by @DeepThinkingZhouLiu in #199
- move non-key params to init function by @ZhaoyangHan04 in #200
- [Update] unified i/o keys with input/output_* format by @wongzhenhao in #201
- migrate from DataFlow421 by @yuwenkai2003 in #202
- [Pipeline] add Automatic Speech Recognition module and corresponding pipeline. by @gty1829 in #207
- [issue temp] update issue template; and add
sglang&minerutodataflow envby @SunnyHaze in #208 - Bug Fix by @DeepThinkingZhouLiu in #210
- [Compiled Pipeline] Add naive logic of Compiled pipeline for pre-check of key logic & Serving management. by @MOLYHECI in #191
- [compile] Added gradient color transition from step=0 to step=n in th… by @SunnyHaze in #213
- [Agent] significant reduce debug time when writing pipeline with
dataflow agentby calling pipeline.complie() by @DeepThinkingZhouLiu in #214 - [compile] Report all KeyErrors in a single, consolidated compilation … by @SunnyHaze in #215
- [Agent] fix prompt_template issue for Dataflow agent when autorun by @DeepThinkingZhouLiu in #216
- add encoding check during write storage by @ZhaoyangHan04 in #217
- rewrite LALMServing by @gty1829 in #219
- add get_desc functions for new ops by @scuuy in #224
- fix bug in diy prompt by @scuuy in #225
- Add core_text and chemistry smiles extraction pipeline by @haolpku in #226
- [refactor] Rename operators and revise op structure at 2025-08-21 by @SunnyHaze in #227
- update text2sql pipeline, reconstruct prompt template by @TechNomad-ds in #230
- [debug] fix #231, Prompted Generator issue after #227 by @haolpku in #232
- Add material pipeline and pairwise prompted generator by @haolpku in #234
- fix serving name and fix import LocalModelLLMServing bug by @haolpku in #235
- add atomic operation by @Fengzhongzhihan in #236
- fix chunk logic when length of tokens greater than model max token size by @CheinTian in #239
- fix chemical pipeline output schema bug by @ZhaoyangHan04 in #240
- update response format for chemistry pipelines by @haolpku in #243
- [PDF2model/text2model] dataflow PDF2model/text2model function added to dataflow cli by @dataflow-fyl in #242
- Dataflow agent SH by @DeepThinkingZhouLiu in #241
- fix api serving by @haolpku in #245
- Update the Text2SQL pipeline, refactored the database manager to support better database extensibility; manage prompts through prompt template classes to improve operator reusability. by @TechNomad-ds in #244
- [README] Add a documentation link for the pipeline in the README file. fix #221 by @miaode74 in #250
- add bench eval pipeline (string match and semantic match) by @scuuy in #238
- rename kbc ops and prompt class by @ZhaoyangHan04 in #249
- [Refactor] AgenticRAG pipeline & Doc2QA pipeline & KCenterGreedy by @wongzhenhao in #246
- [refactor] moving example file to right path by @wongzhenhao in #253
- divide general_text operators into core_text, general_text, text_pt, text_sft by @MOLYHECI in #248
- fix issue #254 and some bugs by @zzy1127 in #255
- [requirements] rm
kenlm, rm redundent requirements file by @SunnyHaze in #256 - adapt the pdf2model and text2model files to use the new operator names by @dataflow-fyl in #252
- [debug] fix import bug for webui by @MOLYHECI in #258
- modified Pdf2model pipeline by @YalinFeng01 in #260
- [prompt] add prompt template & prompt registry to dataflow by @SunnyHaze in #259
- [Update] update for prompt_template registration by @wongzhenhao in #263
- Add code processing pipelines and operators by @beccabai in #261
- fix: fix CodeQualityScoreFilter rename problem by @beccabai in #270
- [refactor] RARE pipeline by @Rise-1210 in #262
- [prompt] remove sensitive word for dev & debug, fix #273. by @SunnyHaze in #274
- [KBC, db_pool] fix registry bug in KBC pipeline; add myscale db_pool to support connect pool for myscale #271 by @leaderwolfpipi in #275
- change the prompt format to register version for reasoning operators by @scuuy in #276
- Unified text2sql and fix bugs by @TechNomad-ds in #277
- fix bug for text2sql by @TechNomad-ds in #278
- Unified text2sql by @yaodongwen in #247
- support vectorsql pipeline by @yaodongwen in #281
- fix bug for text2sql pipeline by @TechNomad-ds in #283
- [Update] Structural output feature added by using json_schema by @wongzhenhao in #282
- [text2sql] refactor to create a unified text2sql pipeline by @SunnyHaze in #284
- EvalPipeline by @YalinFeng01 in #285
- fix some bugs and upgrade text pipeline by @zzy1127 in #288
- [Update] Merging and renaming text2qa op by @wongzhenhao in #267
- EvalPipeline by @YalinFeng01 in #286
- make extra prompt function private by @ZhaoyangHan04 in #290
- [overview] add function check for prompts and auto check by @SunnyHaze in #289
- [feature] vllm serving support structural output by @wongzhenhao in #291
- fix the name of input/output key by @scuuy in #292
- relative path in gpu pipeline by @ZhaoyangHan04 in #298
- Mathfusion pipeline release & Embedding Generator release by @wongzhenhao in #296
- update prompt template and prompt restrict for text2sql pipeline by @TechNomad-ds in #295
- general text prompt rewrite by @MOLYHECI in #293
- [test] add key prefix auto-checking for all operator.run() by @SunnyHaze in #294
- [feature] colorful logger by @MOLYHECI in #299
- remove redundant code by @scuuy in #301
- [Update] LiteLLMServing now consistent with other serving by @wongzhenhao in #304
- Add Unified Prompt by @Fengzhongzhihan in #305
- [operators] update func call operators & rewrite prompts by @MOLYHECI in #306
- remove mathverifyjudger (useless op), fix bug in tokeninfoevaluator by @scuuy in #309
- Apivlmserving abstarct method added by @wongzhenhao in #311
- LiteLLMServing supports json_schema structural output for compatible LLM by @wongzhenhao in #307
New Contributors
- @xyxhchb made their first contribution in #188
- @yuwenkai2003 made their first contribution in #202
- @gty1829 made their first contribution in #207
- @Fengzhongzhihan made their first contribution in #236
- @CheinTian made their first contribution in #239
- @dataflow-fyl made their first contribution in #242
- @miaode74 made their first contribution in #250
- @YalinFeng01 made their first contribution in #260
- @beccabai made their first contribution in #261
- @Rise-1210 made their first contribution in #262
- @yaodongwen made their first contribution in #247
Full Changelog: v1.0.5...v1.0.6