You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+27-14
Original file line number
Diff line number
Diff line change
@@ -32,11 +32,9 @@ We welcome you to join us in promoting LLM data development and research!
32
32
33
33
We provide a [Playground](http://8.130.100.170/) with a managed JupyterLab. [Try Data-Juicer](http://8.130.100.170/) straight away in your browser!
34
34
35
-
If you find Data-Juicer useful for your research or development, please kindly
36
-
cite our [work](#references). Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion.
37
-
38
-
<imgsrc="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png"width = "100"height = "100"alt="QR Code for WeChat group"align=center />
39
-
35
+
If you find Data-Juicer useful for your research or development, please kindly cite our [work](#references).
36
+
Welcome any issues/PRs and to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw)
37
+
or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion!
40
38
41
39
----
42
40
@@ -76,6 +74,7 @@ Table of Contents
76
74
-[Data Analysis](#data-analysis)
77
75
-[Data Visualization](#data-visualization)
78
76
-[Build Up Config Files](#build-up-config-files)
77
+
-[Sandbox](#sandbox)
79
78
-[Preprocess Raw Data (Optional)](#preprocess-raw-data-optional)
80
79
-[For Docker Users](#for-docker-users)
81
80
-[Data Recipes](#data-recipes)
@@ -92,25 +91,25 @@ Table of Contents
92
91
-**Systematic & Reusable**:
93
92
Empowering users with a systematic library of 80+ core [OPs](docs/Operators.md), 20+ reusable [config recipes](configs), and 20+ feature-rich
94
93
dedicated [toolkits](#documentation), designed to
95
-
function independently of specific LLM datasets and processing pipelines.
94
+
function independently of specific multimodal LLM datasets and processing pipelines.
96
95
97
-
-**Data-in-the-loop**: Allowing detailed data analyses with an automated
98
-
report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
-**Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing.
112
111
113
-
-**User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
112
+
-**User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documents), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
- **Note:** For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first.
249
248
The default download cache directory is `~/.cache/data_juicer`. Change the cache location by setting the shell environment variable, `DATA_JUICER_CACHE_HOME` to another directory, and you can also change `DATA_JUICER_MODELS_CACHE` or `DATA_JUICER_ASSETS_CACHE`in the same way:
250
249
250
+
- **Note:** When using operators with third-party models, it's necessary to declare the corresponding `mem_required` in the configuration file (you can refer to the settings in the `config_all.yaml` file). During runtime, Data-Juicer will control the number of processes based on memory availability and the memory requirements of the operator models to achieve better data processing efficiency. When running with CUDA environment, if the mem_required for an operator is not declared correctly, it could potentially lead to a CUDA Out of Memory issue.

322
323
324
+
### Sandbox
325
+
326
+
The data sandbox laboratory (DJ-Sandbox) provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance.
327
+
328
+
- In the sandbox, users can quickly experiment, iterate, and refine data recipes based on small-scale datasets and models, before scaling up to produce high-quality data to serve large-scale models.
329
+
- In addition to the basic data optimization and recipe refinement features offered by Data-Juicer, users can seamlessly use configurable components such as data probe and analysis, model training and evaluation, and data and model feedback-based recipe refinement to form a complete one-stop data-model research and development pipeline.
330
+
331
+
The sandbox is run using the following commands by default, and for more information and details, please refer to the [sandbox documentation](docs/Sandbox.md).
***注意**:对于使用了第三方模型的算子,在填写config文件时需要去声明其对应的`mem_required`(可以参考`config_all.yaml`文件中的设置)。Data-Juicer在运行过程中会根据内存情况和算子模型所需的memory大小来控制对应的进程数,以达成更好的数据处理的性能效率。而在使用CUDA环境运行时,如果不正确的声明算子的`mem_required`情况,则有可能导致CUDA Out of Memory。
0 commit comments