Skip to content

Commit 3a73afe

Browse files
committed
merge main
2 parents 432eb07 + 79567da commit 3a73afe

30 files changed

+1288
-28
lines changed

README.md

+8-5
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,13 @@ Data-Juicer is being actively updated and maintained. We will periodically enhan
3636

3737

3838
## News
39+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-03-13] We propose a new data synthesis method, *MindGym*, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., *16%* gain on [MathVision](https://mathllm.github.io/mathvision/#leaderboard) using only *400 samples*). See more details in [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499).
3940
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-28] DJ has been integrated in [Ray's official Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) and [Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html). Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by [Apache Arrow](https://github.com/apache/arrow/pull/45084).
4041
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-27] Our work on contrastive data synthesis, [ImgDiff](https://arxiv.org/pdf/2408.04594), has been accepted by *CVPR 2025*!
41-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] We propose a new data selection method, *DaaR*, which is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs. See more details in [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf).
42+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] We propose a new data selection method, *DaaR*, which is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs. See more details in [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380).
4243
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] We release our 2.0 paper, [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://arxiv.org/abs/2501.14755). It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes.
4344
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] We support post-tuning scenarios better, via 20+ related new [OPs](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2), and via unified [dataset format](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) compatible to LLaMA-Factory and ModelScope-Swift.
44-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 17 human-centric tasks with synthetic data, benchmarking video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it.
45+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 16 human-centric tasks with synthetic data, benchmarking 22 video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it.
4546

4647
<details>
4748
<summary> History News:
@@ -511,7 +512,7 @@ If you find Data-Juicer useful for your research or development, please kindly c
511512
```
512513

513514
<details>
514-
<summary> More related papers from the Data-Juicer Team:
515+
<summary> More data-related papers from the Data-Juicer Team:
515516
</summary>>
516517

517518
- [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
@@ -522,8 +523,10 @@ If you find Data-Juicer useful for your research or development, please kindly c
522523

523524
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
524525

525-
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
526-
526+
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
527+
528+
- [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
529+
527530
- [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
528531

529532
</details>

README_ZH.md

+7-4
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
3232
----
3333

3434
## 新消息
35+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-03-13] 我们提出了一种新的数据合成方法 *MindGym*,该方法鼓励 LLM 自我生成具有挑战性的认知问题,实现优于 SOTA 基线的数据效率、跨模态泛化和 SFT 效果(例如,仅使用 *400 个样本* 即可在 [MathVision](https://mathllm.github.io/mathvision/#leaderboard) 上获得 *16%* 的增益)。有关更多详细信息,请参阅[MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
3536
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-28] DJ 已被集成到 [Ray官方 Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html)[Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html)。此外,我们在 DJ2.0 中的流式 JSON 加载补丁已被 [Apache Arrow 官方集成](https://github.com/apache/arrow/pull/45084)
3637
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-27] 我们的对比数据合成工作, [ImgDiff](https://arxiv.org/pdf/2408.04594), 已被 *CVPR 2025* 接收!
37-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] 我们提出了一种新的数据选择方法 *DaaR*,该方法基于理论指导,将数据多样性建模为奖励信号,在 7 个基准测试中,微调 SOTA LLMs 取得了更好的整体表现。有关更多详细信息,请参阅 [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
38+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] 我们提出了一种新的数据选择方法 *DaaR*,该方法基于理论指导,将数据多样性建模为奖励信号,在 7 个基准测试中,微调 SOTA LLMs 取得了更好的整体表现。有关更多详细信息,请参阅 [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
3839
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] 我们发布了 2.0 版论文 [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://arxiv.org/abs/2501.14755)。DJ现在可以使用阿里云集群中 50 个 Ray 节点上的 6400 个 CPU 核心在 2.1 小时内处理 70B 数据样本,并使用 8 个 Ray 节点上的 1280 个 CPU 核心在 2.8 小时内对 5TB 数据进行重复数据删除。
3940
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] 我们通过 20 多个相关的新 [OP](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2) 以及与 LLaMA-Factory 和 ModelScope-Swift 兼容的统一 [数据集格式](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) 更好地支持Post-Tuning场景。
40-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*,它包含 17 个以人为中心的任务,使用合成数据,从内在情感和外在表现的角度对视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息,并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。
41+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*,它包含 16 个以人为中心的任务,使用合成数据,从内在情感和外在表现的角度对22个视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息,并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。
4142

4243
<details>
4344
<summary> History News:
@@ -492,7 +493,7 @@ Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/gr
492493
}
493494
```
494495
<details>
495-
<summary>更多Data-Juicer团队相关论文:
496+
<summary>更多Data-Juicer团队关于数据的论文:
496497
</summary>>
497498

498499
- [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
@@ -503,7 +504,9 @@ Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/gr
503504

504505
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
505506

506-
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
507+
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
508+
509+
- [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
507510

508511
- [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
509512

data_juicer/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = '1.2.1'
1+
__version__ = '1.2.2'
22

33
import os
44
import subprocess

data_juicer/utils/asset_utils.py

+2
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ def load_words_asset(words_dir: str, words_type: str):
4848
logger.info(f'Specified {words_dir} does not contain '
4949
f'any {words_type} files in json format, now '
5050
'download the one cached by data_juicer team')
51+
if words_type not in ASSET_LINKS:
52+
raise ValueError(f'{words_type} is not in remote server.')
5153
response = requests.get(ASSET_LINKS[words_type])
5254
words_dict = response.json()
5355
# cache the asset file locally

data_juicer/utils/compress.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ class FileLock(HF_FileLock):
2323
def _release(self):
2424
super()._release()
2525
try:
26-
# logger.debug(f'Remove {self._lock_file}')
27-
os.remove(self._lock_file)
26+
# logger.debug(f'Remove {self.lock_file}')
27+
os.remove(self.lock_file)
2828
# The file is already deleted and that's what we want.
2929
except OSError:
3030
pass
@@ -497,4 +497,4 @@ def decompress(ds, fingerprints=None, num_proc=1):
497497

498498

499499
def cleanup_compressed_cache_files(ds):
500-
CacheCompressManager().cleanup_cache_files(ds)
500+
CacheCompressManager(cache_utils.CACHE_COMPRESS).cleanup_cache_files(ds)

data_juicer/utils/constant.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -172,14 +172,15 @@ def get_access_log(cls, dj_cfg=None, dataset=None):
172172
elif 'jsonl' in dj_cfg.dataset_path:
173173
tmp_f_name = dj_cfg.dataset_path. \
174174
replace('.jsonl', '.tmp.jsonl')
175-
with open(dj_cfg.dataset_path, 'r') as orig_file:
175+
with open(dj_cfg.dataset_path, 'r',
176+
encoding='utf-8') as orig_file:
176177
first_line = orig_file.readline()
177178

178179
assert tmp_f_name is not None and first_line is not None, \
179180
'error when loading the first line, when ' \
180181
f'dj_cfg.dataset_path={dj_cfg.dataset_path}'
181182

182-
with open(tmp_f_name, 'w') as tmp_file:
183+
with open(tmp_f_name, 'w', encoding='utf-8') as tmp_file:
183184
tmp_file.write(first_line)
184185

185186
tmp_dj_cfg.dataset_path = tmp_f_name

data_juicer/utils/mm_utils.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -160,9 +160,9 @@ def iou(box1, box2):
160160
ix_max = min(x1_max, x2_max)
161161
iy_min = max(y1_min, y2_min)
162162
iy_max = min(y1_max, y2_max)
163-
intersection = max(0, (ix_max - ix_min) * (iy_max - iy_min))
163+
intersection = max(0, max(0, ix_max - ix_min) * max(0, iy_max - iy_min))
164164
union = area1 + area2 - intersection
165-
return 1.0 * intersection / union
165+
return 1.0 * intersection / union if union != 0 else 0.0
166166

167167

168168
def calculate_resized_dimensions(
@@ -207,7 +207,7 @@ def calculate_resized_dimensions(
207207

208208
# Determine final dimensions based on original orientation
209209
resized_dimensions = ((new_short_edge,
210-
new_long_edge) if width <= height else
210+
new_long_edge) if width >= height else
211211
(new_long_edge, new_short_edge))
212212

213213
# Ensure final dimensions are divisible by the specified value

data_juicer/utils/registry.py

+1-4
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,6 @@
1717
# https://github.com/modelscope/modelscope/blob/master/modelscope/utils/registry.py
1818
# --------------------------------------------------------
1919

20-
from loguru import logger
21-
2220

2321
class Registry(object):
2422
"""This class is used to register some modules to registry by a repo
@@ -53,8 +51,7 @@ def modules(self):
5351

5452
def list(self):
5553
"""Logging the list of module in current registry."""
56-
for m in self._modules.keys():
57-
logger.info(f'{self._name}\t{m}')
54+
return list(self._modules.keys())
5855

5956
def get(self, module_key):
6057
"""

tests/run.py

+6-3
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
import unittest
1313
import coverage
1414

15+
# start the coverage immediately
16+
cov = coverage.Coverage(include='data_juicer/**')
17+
cov.start()
18+
1519
from loguru import logger
1620

1721
from data_juicer.utils.unittest_utils import set_clear_model_flag, get_partial_test_cases
@@ -91,12 +95,11 @@ def gather_test_cases(test_dir, pattern, tag, mode='partial'):
9195

9296

9397
def main():
94-
cov = coverage.Coverage(include='data_juicer/**')
95-
cov.start()
96-
98+
global cov
9799
runner = unittest.TextTestRunner()
98100
test_suite = gather_test_cases(os.path.abspath(args.test_dir),
99101
args.pattern, args.tag, args.mode)
102+
logger.info(f'There are {len(test_suite._tests)} test cases to run.')
100103
res = runner.run(test_suite)
101104

102105
cov.stop()

tests/utils/test_asset_utils.py

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import os
2+
import json
3+
import unittest
4+
5+
from data_juicer.utils.asset_utils import load_words_asset
6+
7+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
8+
9+
class LoadWordsAssetTest(DataJuicerTestCaseBase):
10+
11+
def setUp(self) -> None:
12+
self.temp_output_path = 'tmp/test_asset_utils/'
13+
14+
def tearDown(self):
15+
if os.path.exists(self.temp_output_path):
16+
os.system(f'rm -rf {self.temp_output_path}')
17+
18+
def test_basic_func(self):
19+
# download assets from the remote server
20+
words_dict = load_words_asset(self.temp_output_path, 'stopwords')
21+
self.assertTrue(len(words_dict) > 0)
22+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'stopwords.json')))
23+
24+
words_dict = load_words_asset(self.temp_output_path, 'flagged_words')
25+
self.assertTrue(len(words_dict) > 0)
26+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'flagged_words.json')))
27+
28+
# non-existing asset
29+
with self.assertRaises(ValueError):
30+
load_words_asset(self.temp_output_path, 'non_existing_asset')
31+
32+
def test_load_from_existing_file(self):
33+
os.makedirs(self.temp_output_path, exist_ok=True)
34+
temp_asset = os.path.join(self.temp_output_path, 'temp_asset.json')
35+
with open(temp_asset, 'w') as fout:
36+
json.dump({'test_key': ['test_val']}, fout)
37+
38+
words_list = load_words_asset(self.temp_output_path, 'temp_asset')
39+
self.assertEqual(len(words_list), 1)
40+
self.assertEqual(len(words_list['test_key']), 1)
41+
42+
def test_load_from_serial_files(self):
43+
os.makedirs(self.temp_output_path, exist_ok=True)
44+
temp_asset = os.path.join(self.temp_output_path, 'temp_asset_v1.json')
45+
with open(temp_asset, 'w') as fout:
46+
json.dump({'test_key': ['test_val_1']}, fout)
47+
temp_asset = os.path.join(self.temp_output_path, 'temp_asset_v2.json')
48+
with open(temp_asset, 'w') as fout:
49+
json.dump({'test_key': ['test_val_2']}, fout)
50+
51+
words_list = load_words_asset(self.temp_output_path, 'temp_asset')
52+
self.assertEqual(len(words_list), 1)
53+
self.assertEqual(len(words_list['test_key']), 2)
54+
55+
56+
if __name__ == '__main__':
57+
unittest.main()
+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import unittest
2+
3+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
4+
5+
class AutoInstallMappingTest(DataJuicerTestCaseBase):
6+
7+
def test_placeholder(self):
8+
pass
9+
10+
11+
if __name__ == '__main__':
12+
unittest.main()
+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import unittest
2+
3+
from data_juicer.utils.auto_install_utils import _is_module_installed, _is_package_installed
4+
5+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
6+
7+
class IsXXXInstalledFuncsTest(DataJuicerTestCaseBase):
8+
9+
def test_is_module_installed(self):
10+
self.assertTrue(_is_module_installed('datasets'))
11+
self.assertTrue(_is_module_installed('simhash'))
12+
13+
self.assertFalse(_is_module_installed('non_existent_module'))
14+
15+
def test_is_package_installed(self):
16+
self.assertTrue(_is_package_installed('datasets'))
17+
self.assertTrue(_is_package_installed('ram@git+https://github.com/xinyu1205/recognize-anything.git'))
18+
self.assertTrue(_is_package_installed('scenedetect[opencv]'))
19+
20+
self.assertFalse(_is_package_installed('non_existent_package'))
21+
22+
23+
if __name__ == '__main__':
24+
unittest.main()
+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import unittest
2+
3+
from data_juicer.utils.availability_utils import _is_package_available
4+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
5+
6+
class AvailabilityUtilsTest(DataJuicerTestCaseBase):
7+
8+
def test_is_package_available(self):
9+
exist = _is_package_available('fsspec')
10+
self.assertTrue(exist)
11+
exist, version = _is_package_available('fsspec', return_version=True)
12+
self.assertTrue(exist)
13+
self.assertEqual(version, '2023.5.0')
14+
15+
exist = _is_package_available('non_existing_package')
16+
self.assertFalse(exist)
17+
exist, version = _is_package_available('non_existing_package', return_version=True)
18+
self.assertFalse(exist)
19+
self.assertEqual(version, 'N/A')
20+
21+
22+
if __name__ == '__main__':
23+
unittest.main()

0 commit comments

Comments
 (0)