Skip to content

Commit bc650c2

Browse files
committed
merge main branch
2 parents b9679cb + 4148016 commit bc650c2

File tree

101 files changed

+2791
-1730
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

101 files changed

+2791
-1730
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
name: Deploy Sphinx documentation to Pages
22

33
on:
4-
release:
5-
types: [published]
6-
workflow_dispatch:
4+
pull_request:
5+
types: [opened, synchronize]
6+
paths:
7+
- 'docs/sphinx_doc/**/*'
8+
push:
9+
branches:
10+
- main
711

812
jobs:
913
pages:
@@ -19,14 +23,18 @@ jobs:
1923
run: |
2024
python -m pip install --upgrade pip
2125
pip install -v -e .[dev]
22-
- id: deployment
23-
uses: sphinx-notes/pages@v3
26+
- id: build
27+
name: Build Documentation
28+
run: |
29+
cd docs/sphinx_doc
30+
bash build_doc.sh
31+
- name: Upload Documentation
32+
uses: actions/upload-artifact@v3
2433
with:
25-
documentation_path: ./docs/sphinx_doc/source
26-
python_version: ${{ matrix.python-version }}
27-
publish: false
28-
requirements_path: ./environments/dev_requires.txt
34+
name: SphinxDoc
35+
path: 'docs/sphinx_doc/build/html'
2936
- uses: peaceiris/actions-gh-pages@v3
37+
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
3038
with:
3139
github_token: ${{ secrets.GITHUB_TOKEN }}
32-
publish_dir: ${{ steps.deployment.outputs.artifact }}
40+
publish_dir: 'docs/sphinx_doc/build/html'

.github/workflows/unit-test.yml

+1
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ jobs:
2727
df -h
2828
- name: Install dependencies
2929
run: |
30+
sudo apt-get install ffmpeg
3031
python -m pip install --upgrade pip
3132
pip install -v -e .[all]
3233
- name: Increase swapfile

README.md

+27-14
Original file line numberDiff line numberDiff line change
@@ -32,11 +32,9 @@ We welcome you to join us in promoting LLM data development and research!
3232

3333
We provide a [Playground](http://8.130.100.170/) with a managed JupyterLab. [Try Data-Juicer](http://8.130.100.170/) straight away in your browser!
3434

35-
If you find Data-Juicer useful for your research or development, please kindly
36-
cite our [work](#references). Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion.
37-
38-
<img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />
39-
35+
If you find Data-Juicer useful for your research or development, please kindly cite our [work](#references).
36+
Welcome any issues/PRs and to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw)
37+
or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion!
4038

4139
----
4240

@@ -76,6 +74,7 @@ Table of Contents
7674
- [Data Analysis](#data-analysis)
7775
- [Data Visualization](#data-visualization)
7876
- [Build Up Config Files](#build-up-config-files)
77+
- [Sandbox](#sandbox)
7978
- [Preprocess Raw Data (Optional)](#preprocess-raw-data-optional)
8079
- [For Docker Users](#for-docker-users)
8180
- [Data Recipes](#data-recipes)
@@ -92,25 +91,25 @@ Table of Contents
9291
- **Systematic & Reusable**:
9392
Empowering users with a systematic library of 80+ core [OPs](docs/Operators.md), 20+ reusable [config recipes](configs), and 20+ feature-rich
9493
dedicated [toolkits](#documentation), designed to
95-
function independently of specific LLM datasets and processing pipelines.
94+
function independently of specific multimodal LLM datasets and processing pipelines.
9695

97-
- **Data-in-the-loop**: Allowing detailed data analyses with an automated
98-
report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
96+
- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
97+
through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
98+
visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
9999
![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
100100

101+
- **Enhanced Efficiency**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
102+
requiring less memory and CPU usage, optimized for maximum productivity.
103+
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
104+
101105
- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
102106
processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
103107
reference LLaMA and LLaVA models.
104108
![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
105109

106-
- **Enhanced Efficiency**: Providing a speedy data processing pipeline
107-
requiring less memory and CPU usage, optimized for maximum productivity.
108-
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
109-
110-
111110
- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing.
112111

113-
- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
112+
- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documents), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
114113

115114

116115

@@ -248,6 +247,8 @@ dj-process --config configs/demo/process.yaml
248247
- **Note:** For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first.
249248
The default download cache directory is `~/.cache/data_juicer`. Change the cache location by setting the shell environment variable, `DATA_JUICER_CACHE_HOME` to another directory, and you can also change `DATA_JUICER_MODELS_CACHE` or `DATA_JUICER_ASSETS_CACHE` in the same way:
250249

250+
- **Note:** When using operators with third-party models, it's necessary to declare the corresponding `mem_required` in the configuration file (you can refer to the settings in the `config_all.yaml` file). During runtime, Data-Juicer will control the number of processes based on memory availability and the memory requirements of the operator models to achieve better data processing efficiency. When running with CUDA environment, if the mem_required for an operator is not declared correctly, it could potentially lead to a CUDA Out of Memory issue.
251+
251252
```shell
252253
# cache home
253254
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
@@ -320,6 +321,18 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
320321

321322
![Basic config example of format and definition](https://img.alicdn.com/imgextra/i1/O1CN01uXgjgj1khWKOigYww_!!6000000004715-0-tps-1745-871.jpg "Basic config file example")
322323

324+
### Sandbox
325+
326+
The data sandbox laboratory (DJ-Sandbox) provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance.
327+
328+
- In the sandbox, users can quickly experiment, iterate, and refine data recipes based on small-scale datasets and models, before scaling up to produce high-quality data to serve large-scale models.
329+
- In addition to the basic data optimization and recipe refinement features offered by Data-Juicer, users can seamlessly use configurable components such as data probe and analysis, model training and evaluation, and data and model feedback-based recipe refinement to form a complete one-stop data-model research and development pipeline.
330+
331+
The sandbox is run using the following commands by default, and for more information and details, please refer to the [sandbox documentation](docs/Sandbox.md).
332+
```shell
333+
python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
334+
```
335+
323336
### Preprocess Raw Data (Optional)
324337
- Our formatters support some common input dataset formats for now:
325338
- Multi-sample in one file: jsonl/json, parquet, csv/tsv, etc.

README_ZH.md

+22-9
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
2828

2929
如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献)
3030

31-
欢迎加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp)[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) ,或微信群(扫描下方二维码加入)进行讨论。
32-
33-
<img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />
34-
31+
欢迎提issues/PRs,以及加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp)[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 进行讨论!
3532

3633
----
3734

@@ -70,6 +67,7 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
7067
- [数据分析](#数据分析)
7168
- [数据可视化](#数据可视化)
7269
- [构建配置文件](#构建配置文件)
70+
- [沙盒实验室](#沙盒实验室)
7371
- [预处理原始数据(可选)](#预处理原始数据可选)
7472
- [对于 Docker 用户](#对于-docker-用户)
7573
- [数据处理菜谱](#数据处理菜谱)
@@ -83,15 +81,15 @@ Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护
8381

8482
![Overview](https://img.alicdn.com/imgextra/i4/O1CN01WYQP3Z1JHsaXaQDK6_!!6000000001004-0-tps-3640-1812.jpg)
8583

86-
* **系统化 & 可复用**:为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md),20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation)旨在让数据处理独立于特定的大语言模型数据集和处理流水线
84+
* **系统化 & 可复用**:为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md),20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation)旨在让多模态数据处理独立于特定的大语言模型数据集和处理流水线
8785

88-
* **数据反馈回路**支持详细的数据分析,并提供自动报告生成功能,使您深入了解您的数据集。结合多维度自动评估功能,支持在 LLM 开发过程的多个阶段进行及时反馈循环![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
86+
* **数据反馈回路 & 沙盒实验室**支持一站式数据-模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代,基于数据和模型反馈回路、可视化和多维度自动评估等功能,使您更了解和改进您的数据和模型![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
8987

90-
* **全面的数据处理菜谱**为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
88+
* **效率增强**提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
9189

92-
* **效率增强**提供高效的数据处理流水线,减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
90+
* **全面的数据处理菜谱**为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
9391

94-
* **用户友好**:设计简单易用,提供全面的[文档](#documentation)、简易[入门指南](#快速上手)[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
92+
* **用户友好**:设计简单易用,提供全面的[文档](#documents)、简易[入门指南](#快速上手)[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
9593

9694
* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。
9795

@@ -226,6 +224,8 @@ dj-process --config configs/demo/process.yaml
226224

227225
* **注意**:使用未保存在本地的第三方模型或资源的算子第一次运行可能会很慢,因为这些算子需要将相应的资源下载到缓存目录中。默认的下载缓存目录为`~/.cache/data_juicer`。您可通过设置 shell 环境变量 `DATA_JUICER_CACHE_HOME` 更改缓存目录位置,您也可以通过同样的方式更改 `DATA_JUICER_MODELS_CACHE``DATA_JUICER_ASSETS_CACHE` 来分别修改模型缓存或资源缓存目录:
228226

227+
* **注意**:对于使用了第三方模型的算子,在填写config文件时需要去声明其对应的`mem_required`(可以参考`config_all.yaml`文件中的设置)。Data-Juicer在运行过程中会根据内存情况和算子模型所需的memory大小来控制对应的进程数,以达成更好的数据处理的性能效率。而在使用CUDA环境运行时,如果不正确的声明算子的`mem_required`情况,则有可能导致CUDA Out of Memory。
228+
229229
```shell
230230
# 缓存主目录
231231
export DATA_JUICER_CACHE_HOME="/path/to/another/directory"
@@ -296,6 +296,19 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
296296

297297
![基础配置项格式及定义样例](https://img.alicdn.com/imgextra/i4/O1CN01xPtU0t1YOwsZyuqCx_!!6000000003050-0-tps-1692-879.jpg "基础配置文件样例")
298298

299+
### 沙盒实验室
300+
301+
数据沙盒实验室 (DJ-Sandbox) 为用户提供了持续生产数据菜谱的最佳实践,其具有低开销、可迁移、有指导性等特点。
302+
- 用户在沙盒中可以基于一些小规模数据集、模型对数据菜谱进行快速实验、迭代、优化,再迁移到更大尺度上,大规模生产高质量数据以服务大模型。
303+
- 用户在沙盒中,除了Data-Juicer基础的数据优化与数据菜谱微调功能外,还可以便捷地使用数据洞察与分析、沙盒模型训练与评测、基于数据和模型反馈优化数据菜谱等可配置组件,共同组成完整的一站式数据-模型研发流水线。
304+
305+
沙盒默认通过如下命令运行,更多介绍和细节请参阅[沙盒文档](docs/Sandbox-ZH.md).
306+
```shell
307+
python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
308+
```
309+
310+
311+
299312
### 预处理原始数据(可选)
300313

301314
* 我们的 Formatter 目前支持一些常见的输入数据集格式:

0 commit comments

Comments
 (0)