Skip to content

Dataflow v1.0.6 Release Note

Latest

Choose a tag to compare

@haolpku haolpku released this 15 Oct 15:26
· 42 commits to main since this release

🚀 DataFlow v1.0.6 更新日志

🔑 主要功能更新

  • Prompt 注册系统
    引入统一的 Prompt Registry,使每个算子(Operator)可以绑定多个 Prompt 模板,实现“一对多”结构化注册机制,方便不同任务场景的复用与扩展。感谢 @SunnyHaze

  • 新增 Code 处理 Pipeline
    新增完整的代码处理 Pipeline 及相关算子,支持代码数据集的分析、过滤与质量清洗,助力代码智能与数据清理任务。感谢 @beccabai

  • Reasoning Pipeline 获奖验证
    DataFlow 的 Reasoning Pipeline 在 BAAI LIC Reasoning Competition 中荣获 第一名,充分验证了系统在逻辑推理与数据流调度上的鲁棒性与创新性。感谢 @miaode74 及数学推理团队 @scuuy @wongzhenhao @HeRunming @haolpku

  • 自动化 PDF2Model 功能
    新增 PDF-to-Model 自动生成模块,可将输入 PDF 或数据集自动转换为结构化 QA 数据,用LlamaFactory训练下游模型。此功能实现从文档到模型数据的端到端自动构建。感谢 @YalinFeng01@ZhaoyangHan04

  • 自动基准评测模块
    新增 DataFlow Eval 模块,支持在 Pipeline 内对文本类基准(如字符串匹配、语义匹配)进行自动评测。感谢 @YalinFeng01

  • 统一数据库管理的 Text2SQL Pipeline
    全新改造 Text2SQL Pipeline,加入 DB Manager,统一支持 MySQL、SQLite 等多种数据库类型,并增强 Prompt 模板管理与算子复用性。感谢 @TechNomad-ds

  • JSON Schema 结构化输出
    LLMServingLiteLLMServing 现已支持 JSON Schema 输出,可直接生成结构化响应结果,提升多模态任务兼容性。感谢 @wongzhenhao

  • 书籍结构化 QA 抽取 Pipeline
    新增 BookQA 抽取 Pipeline 及相关算子,可从书籍、长文本中自动提取结构化问答数据。感谢 @HeRunming

  • Science 算子扩展
    新增科学类(Science)算子,支持科研类与多模态数据集的处理。感谢 @haolpku

  • 彩色 Logger 美化
    升级日志系统为彩色输出,提升调试与监控体验。感谢 @MOLYHECI

  • 官方教学视频上线
    发布全新 Bilibili 教程系列,系统讲解 DataFlow 的核心概念、工作流与实操案例。
    🔗 观看教程 >>
    感谢 @Qmeiyi


🧩 重要改进

  • 增加 Prompt 注册与自动校验机制(@SunnyHaze
  • 支持 VLLM Serving 的结构化输出(@wongzhenhao
  • 增强 Pipeline 编译时检查机制(@SunnyHaze
  • 优化 PDF2Model 与 Benchmark 自动评测功能(@YalinFeng01
  • 发布官方教程系列(@Qmeiyi
  • Agent 重构计划预告
    DataFlow Agent 模块正在全面重构中,将迁移至 LangGraph 架构,实现更高效的多 Agent 管理与任务编排,敬请期待。

🚀 DataFlow v1.0.6 Key Feature Updates

  • Prompt Registration System
    Introduced a unified Prompt Registry that supports one-to-many prompt bindings per operator, allowing flexible task adaptation and consistent structure. Thanks to @SunnyHaze.

  • New Code Processing Pipeline
    Added a comprehensive code pipeline and related operators for analyzing, filtering, and processing code datasets. Thanks to @beccabai.

  • Reasoning Pipeline Achievements
    The Reasoning pipeline achieved 1st place in the BAAI LIC Reasoning Competition, validating DataFlow’s reasoning robustness and system scalability. Thanks to @miaode74, @scuuy, @wongzhenhao, @HeRunming, and @haolpku.

  • Automatic PDF2Model Functionality
    Added an automated PDF-to-Model module that converts PDF documents or datasets into structured QA pairs, enabling downstream model training with LlamaFactory. Thanks to @YalinFeng01 and @ZhaoyangHan04.

  • Automatic Benchmark Evaluation
    Introduced the DataFlow Eval module for automatic text benchmark evaluation (e.g., string match and semantic match). Thanks to @YalinFeng01.

  • Text2SQL Pipeline with Unified DB Manager
    Refactored the Text2SQL pipeline with a new DB Manager supporting MySQL, SQLite, and more. Enhanced prompt modularity and operator reuse. Thanks to @TechNomad-ds.

  • JSON Schema Structural Output
    LLMServing and LiteLLMServing now support JSON Schema structured outputs, allowing models to produce well-formed structured results. Thanks to @wongzhenhao.

  • Structured QA Extraction from Books
    Added a BookQA Extraction Pipeline to automatically extract structured QA pairs from book-style documents. Thanks to @HeRunming.

  • Science Operators Added
    Introduced Science operators for scientific and multimodal data processing. Thanks to @haolpku.

  • Colorful and Informative Logger
    Enhanced logging with a colorful output format for better readability and debugging. Thanks to @MOLYHECI.

  • New Tutorial Series
    Released a Bilibili tutorial series introducing key DataFlow concepts and practical demos.
    🎥 Watch here — Thanks to @Qmeiyi.


🧩 Notable Improvements

  • Added prompt registration and validation – @SunnyHaze
  • Added structured output support for VLLM Serving – @wongzhenhao
  • Enhanced pipeline compilation checks – @SunnyHaze
  • Improved PDF2Model and benchmark evaluation – @YalinFeng01
  • Added official tutorial series – @Qmeiyi
  • Agent Refactor Announcement
    The DataFlow Agent is undergoing a major refactor and will soon migrate to a LangGraph-based architecture, supporting advanced multi-agent orchestration.

What's Changed

New Contributors

Full Changelog: v1.0.5...v1.0.6