Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 3a73afe

Browse files
committedMar 14, 2025·
merge main
2 parents 432eb07 + 79567da commit 3a73afe

30 files changed

+1288
-28
lines changed
 

‎README.md

+8-5
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,13 @@ Data-Juicer is being actively updated and maintained. We will periodically enhan
3636

3737

3838
## News
39+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-03-13] We propose a new data synthesis method, *MindGym*, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., *16%* gain on [MathVision](https://mathllm.github.io/mathvision/#leaderboard) using only *400 samples*). See more details in [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499).
3940
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-28] DJ has been integrated in [Ray's official Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) and [Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html). Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by [Apache Arrow](https://github.com/apache/arrow/pull/45084).
4041
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-27] Our work on contrastive data synthesis, [ImgDiff](https://arxiv.org/pdf/2408.04594), has been accepted by *CVPR 2025*!
41-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] We propose a new data selection method, *DaaR*, which is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs. See more details in [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf).
42+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] We propose a new data selection method, *DaaR*, which is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs. See more details in [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380).
4243
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] We release our 2.0 paper, [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://arxiv.org/abs/2501.14755). It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes.
4344
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] We support post-tuning scenarios better, via 20+ related new [OPs](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2), and via unified [dataset format](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) compatible to LLaMA-Factory and ModelScope-Swift.
44-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 17 human-centric tasks with synthetic data, benchmarking video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it.
45+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 16 human-centric tasks with synthetic data, benchmarking 22 video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it.
4546

4647
<details>
4748
<summary> History News:
@@ -511,7 +512,7 @@ If you find Data-Juicer useful for your research or development, please kindly c
511512
```
512513

513514
<details>
514-
<summary> More related papers from the Data-Juicer Team:
515+
<summary> More data-related papers from the Data-Juicer Team:
515516
</summary>>
516517

517518
- [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
@@ -522,8 +523,10 @@ If you find Data-Juicer useful for your research or development, please kindly c
522523

523524
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
524525

525-
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
526-
526+
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
527+
528+
- [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
529+
527530
- [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
528531

529532
</details>

‎README_ZH.md

+7-4
Original file line numberDiff line numberDiff line change
@@ -32,12 +32,13 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
3232
----
3333

3434
## 新消息
35+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-03-13] 我们提出了一种新的数据合成方法 *MindGym*,该方法鼓励 LLM 自我生成具有挑战性的认知问题,实现优于 SOTA 基线的数据效率、跨模态泛化和 SFT 效果(例如,仅使用 *400 个样本* 即可在 [MathVision](https://mathllm.github.io/mathvision/#leaderboard) 上获得 *16%* 的增益)。有关更多详细信息,请参阅[MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
3536
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-28] DJ 已被集成到 [Ray官方 Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html)[Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html)。此外,我们在 DJ2.0 中的流式 JSON 加载补丁已被 [Apache Arrow 官方集成](https://github.com/apache/arrow/pull/45084)
3637
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-27] 我们的对比数据合成工作, [ImgDiff](https://arxiv.org/pdf/2408.04594), 已被 *CVPR 2025* 接收!
37-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] 我们提出了一种新的数据选择方法 *DaaR*,该方法基于理论指导,将数据多样性建模为奖励信号,在 7 个基准测试中,微调 SOTA LLMs 取得了更好的整体表现。有关更多详细信息,请参阅 [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
38+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] 我们提出了一种新的数据选择方法 *DaaR*,该方法基于理论指导,将数据多样性建模为奖励信号,在 7 个基准测试中,微调 SOTA LLMs 取得了更好的整体表现。有关更多详细信息,请参阅 [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
3839
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] 我们发布了 2.0 版论文 [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://arxiv.org/abs/2501.14755)。DJ现在可以使用阿里云集群中 50 个 Ray 节点上的 6400 个 CPU 核心在 2.1 小时内处理 70B 数据样本,并使用 8 个 Ray 节点上的 1280 个 CPU 核心在 2.8 小时内对 5TB 数据进行重复数据删除。
3940
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] 我们通过 20 多个相关的新 [OP](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2) 以及与 LLaMA-Factory 和 ModelScope-Swift 兼容的统一 [数据集格式](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) 更好地支持Post-Tuning场景。
40-
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*,它包含 17 个以人为中心的任务,使用合成数据,从内在情感和外在表现的角度对视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息,并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。
41+
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*,它包含 16 个以人为中心的任务,使用合成数据,从内在情感和外在表现的角度对22个视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息,并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。
4142

4243
<details>
4344
<summary> History News:
@@ -492,7 +493,7 @@ Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/gr
492493
}
493494
```
494495
<details>
495-
<summary>更多Data-Juicer团队相关论文:
496+
<summary>更多Data-Juicer团队关于数据的论文:
496497
</summary>>
497498

498499
- [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
@@ -503,7 +504,9 @@ Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/gr
503504

504505
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
505506

506-
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
507+
- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
508+
509+
- [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
507510

508511
- [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
509512

‎data_juicer/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = '1.2.1'
1+
__version__ = '1.2.2'
22

33
import os
44
import subprocess

‎data_juicer/utils/asset_utils.py

+2
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ def load_words_asset(words_dir: str, words_type: str):
4848
logger.info(f'Specified {words_dir} does not contain '
4949
f'any {words_type} files in json format, now '
5050
'download the one cached by data_juicer team')
51+
if words_type not in ASSET_LINKS:
52+
raise ValueError(f'{words_type} is not in remote server.')
5153
response = requests.get(ASSET_LINKS[words_type])
5254
words_dict = response.json()
5355
# cache the asset file locally

‎data_juicer/utils/compress.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@ class FileLock(HF_FileLock):
2323
def _release(self):
2424
super()._release()
2525
try:
26-
# logger.debug(f'Remove {self._lock_file}')
27-
os.remove(self._lock_file)
26+
# logger.debug(f'Remove {self.lock_file}')
27+
os.remove(self.lock_file)
2828
# The file is already deleted and that's what we want.
2929
except OSError:
3030
pass
@@ -497,4 +497,4 @@ def decompress(ds, fingerprints=None, num_proc=1):
497497

498498

499499
def cleanup_compressed_cache_files(ds):
500-
CacheCompressManager().cleanup_cache_files(ds)
500+
CacheCompressManager(cache_utils.CACHE_COMPRESS).cleanup_cache_files(ds)

‎data_juicer/utils/constant.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -172,14 +172,15 @@ def get_access_log(cls, dj_cfg=None, dataset=None):
172172
elif 'jsonl' in dj_cfg.dataset_path:
173173
tmp_f_name = dj_cfg.dataset_path. \
174174
replace('.jsonl', '.tmp.jsonl')
175-
with open(dj_cfg.dataset_path, 'r') as orig_file:
175+
with open(dj_cfg.dataset_path, 'r',
176+
encoding='utf-8') as orig_file:
176177
first_line = orig_file.readline()
177178

178179
assert tmp_f_name is not None and first_line is not None, \
179180
'error when loading the first line, when ' \
180181
f'dj_cfg.dataset_path={dj_cfg.dataset_path}'
181182

182-
with open(tmp_f_name, 'w') as tmp_file:
183+
with open(tmp_f_name, 'w', encoding='utf-8') as tmp_file:
183184
tmp_file.write(first_line)
184185

185186
tmp_dj_cfg.dataset_path = tmp_f_name

‎data_juicer/utils/mm_utils.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -160,9 +160,9 @@ def iou(box1, box2):
160160
ix_max = min(x1_max, x2_max)
161161
iy_min = max(y1_min, y2_min)
162162
iy_max = min(y1_max, y2_max)
163-
intersection = max(0, (ix_max - ix_min) * (iy_max - iy_min))
163+
intersection = max(0, max(0, ix_max - ix_min) * max(0, iy_max - iy_min))
164164
union = area1 + area2 - intersection
165-
return 1.0 * intersection / union
165+
return 1.0 * intersection / union if union != 0 else 0.0
166166

167167

168168
def calculate_resized_dimensions(
@@ -207,7 +207,7 @@ def calculate_resized_dimensions(
207207

208208
# Determine final dimensions based on original orientation
209209
resized_dimensions = ((new_short_edge,
210-
new_long_edge) if width <= height else
210+
new_long_edge) if width >= height else
211211
(new_long_edge, new_short_edge))
212212

213213
# Ensure final dimensions are divisible by the specified value

‎data_juicer/utils/registry.py

+1-4
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,6 @@
1717
# https://github.com/modelscope/modelscope/blob/master/modelscope/utils/registry.py
1818
# --------------------------------------------------------
1919

20-
from loguru import logger
21-
2220

2321
class Registry(object):
2422
"""This class is used to register some modules to registry by a repo
@@ -53,8 +51,7 @@ def modules(self):
5351

5452
def list(self):
5553
"""Logging the list of module in current registry."""
56-
for m in self._modules.keys():
57-
logger.info(f'{self._name}\t{m}')
54+
return list(self._modules.keys())
5855

5956
def get(self, module_key):
6057
"""

‎tests/run.py

+6-3
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@
1212
import unittest
1313
import coverage
1414

15+
# start the coverage immediately
16+
cov = coverage.Coverage(include='data_juicer/**')
17+
cov.start()
18+
1519
from loguru import logger
1620

1721
from data_juicer.utils.unittest_utils import set_clear_model_flag, get_partial_test_cases
@@ -91,12 +95,11 @@ def gather_test_cases(test_dir, pattern, tag, mode='partial'):
9195

9296

9397
def main():
94-
cov = coverage.Coverage(include='data_juicer/**')
95-
cov.start()
96-
98+
global cov
9799
runner = unittest.TextTestRunner()
98100
test_suite = gather_test_cases(os.path.abspath(args.test_dir),
99101
args.pattern, args.tag, args.mode)
102+
logger.info(f'There are {len(test_suite._tests)} test cases to run.')
100103
res = runner.run(test_suite)
101104

102105
cov.stop()

‎tests/utils/test_asset_utils.py

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import os
2+
import json
3+
import unittest
4+
5+
from data_juicer.utils.asset_utils import load_words_asset
6+
7+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
8+
9+
class LoadWordsAssetTest(DataJuicerTestCaseBase):
10+
11+
def setUp(self) -> None:
12+
self.temp_output_path = 'tmp/test_asset_utils/'
13+
14+
def tearDown(self):
15+
if os.path.exists(self.temp_output_path):
16+
os.system(f'rm -rf {self.temp_output_path}')
17+
18+
def test_basic_func(self):
19+
# download assets from the remote server
20+
words_dict = load_words_asset(self.temp_output_path, 'stopwords')
21+
self.assertTrue(len(words_dict) > 0)
22+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'stopwords.json')))
23+
24+
words_dict = load_words_asset(self.temp_output_path, 'flagged_words')
25+
self.assertTrue(len(words_dict) > 0)
26+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'flagged_words.json')))
27+
28+
# non-existing asset
29+
with self.assertRaises(ValueError):
30+
load_words_asset(self.temp_output_path, 'non_existing_asset')
31+
32+
def test_load_from_existing_file(self):
33+
os.makedirs(self.temp_output_path, exist_ok=True)
34+
temp_asset = os.path.join(self.temp_output_path, 'temp_asset.json')
35+
with open(temp_asset, 'w') as fout:
36+
json.dump({'test_key': ['test_val']}, fout)
37+
38+
words_list = load_words_asset(self.temp_output_path, 'temp_asset')
39+
self.assertEqual(len(words_list), 1)
40+
self.assertEqual(len(words_list['test_key']), 1)
41+
42+
def test_load_from_serial_files(self):
43+
os.makedirs(self.temp_output_path, exist_ok=True)
44+
temp_asset = os.path.join(self.temp_output_path, 'temp_asset_v1.json')
45+
with open(temp_asset, 'w') as fout:
46+
json.dump({'test_key': ['test_val_1']}, fout)
47+
temp_asset = os.path.join(self.temp_output_path, 'temp_asset_v2.json')
48+
with open(temp_asset, 'w') as fout:
49+
json.dump({'test_key': ['test_val_2']}, fout)
50+
51+
words_list = load_words_asset(self.temp_output_path, 'temp_asset')
52+
self.assertEqual(len(words_list), 1)
53+
self.assertEqual(len(words_list['test_key']), 2)
54+
55+
56+
if __name__ == '__main__':
57+
unittest.main()
+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import unittest
2+
3+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
4+
5+
class AutoInstallMappingTest(DataJuicerTestCaseBase):
6+
7+
def test_placeholder(self):
8+
pass
9+
10+
11+
if __name__ == '__main__':
12+
unittest.main()
+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import unittest
2+
3+
from data_juicer.utils.auto_install_utils import _is_module_installed, _is_package_installed
4+
5+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
6+
7+
class IsXXXInstalledFuncsTest(DataJuicerTestCaseBase):
8+
9+
def test_is_module_installed(self):
10+
self.assertTrue(_is_module_installed('datasets'))
11+
self.assertTrue(_is_module_installed('simhash'))
12+
13+
self.assertFalse(_is_module_installed('non_existent_module'))
14+
15+
def test_is_package_installed(self):
16+
self.assertTrue(_is_package_installed('datasets'))
17+
self.assertTrue(_is_package_installed('ram@git+https://github.com/xinyu1205/recognize-anything.git'))
18+
self.assertTrue(_is_package_installed('scenedetect[opencv]'))
19+
20+
self.assertFalse(_is_package_installed('non_existent_package'))
21+
22+
23+
if __name__ == '__main__':
24+
unittest.main()
+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
import unittest
2+
3+
from data_juicer.utils.availability_utils import _is_package_available
4+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
5+
6+
class AvailabilityUtilsTest(DataJuicerTestCaseBase):
7+
8+
def test_is_package_available(self):
9+
exist = _is_package_available('fsspec')
10+
self.assertTrue(exist)
11+
exist, version = _is_package_available('fsspec', return_version=True)
12+
self.assertTrue(exist)
13+
self.assertEqual(version, '2023.5.0')
14+
15+
exist = _is_package_available('non_existing_package')
16+
self.assertFalse(exist)
17+
exist, version = _is_package_available('non_existing_package', return_version=True)
18+
self.assertFalse(exist)
19+
self.assertEqual(version, 'N/A')
20+
21+
22+
if __name__ == '__main__':
23+
unittest.main()

‎tests/utils/test_cache_utils.py

+36
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import unittest
2+
3+
import datasets
4+
5+
from data_juicer.utils.cache_utils import DatasetCacheControl, dataset_cache_control
6+
7+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
8+
9+
class DatasetCacheControlTest(DataJuicerTestCaseBase):
10+
11+
def test_basic_func(self):
12+
self.assertTrue(datasets.is_caching_enabled())
13+
with DatasetCacheControl(on=False):
14+
self.assertFalse(datasets.is_caching_enabled())
15+
self.assertTrue(datasets.is_caching_enabled())
16+
17+
with DatasetCacheControl(on=False):
18+
self.assertFalse(datasets.is_caching_enabled())
19+
with DatasetCacheControl(on=True):
20+
self.assertTrue(datasets.is_caching_enabled())
21+
self.assertFalse(datasets.is_caching_enabled())
22+
self.assertTrue(datasets.is_caching_enabled())
23+
24+
def test_decorator(self):
25+
26+
@dataset_cache_control(on=False)
27+
def check():
28+
return datasets.is_caching_enabled()
29+
30+
self.assertTrue(datasets.is_caching_enabled())
31+
self.assertFalse(check())
32+
self.assertTrue(datasets.is_caching_enabled())
33+
34+
35+
if __name__ == '__main__':
36+
unittest.main()

‎tests/utils/test_ckpt_utils.py

+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
import os
2+
import unittest
3+
import json
4+
5+
from data_juicer.core.data import NestedDataset
6+
from data_juicer.utils.ckpt_utils import CheckpointManager
7+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
8+
9+
class CkptUtilsTest(DataJuicerTestCaseBase):
10+
11+
def setUp(self) -> None:
12+
self.temp_output_path = 'tmp/test_ckpt_utils/'
13+
14+
def tearDown(self):
15+
if os.path.exists(self.temp_output_path):
16+
os.system(f'rm -rf {self.temp_output_path}')
17+
18+
def test_basic_func(self):
19+
ckpt_path = os.path.join(self.temp_output_path, 'ckpt_1')
20+
manager = CheckpointManager(ckpt_path, original_process_list=[
21+
{'test_op_1': {'test_key': 'test_value_1'}},
22+
{'test_op_2': {'test_key': 'test_value_2'}},
23+
])
24+
self.assertEqual(manager.get_left_process_list(), [
25+
{'test_op_1': {'test_key': 'test_value_1'}},
26+
{'test_op_2': {'test_key': 'test_value_2'}},
27+
])
28+
self.assertFalse(manager.ckpt_available)
29+
30+
self.assertFalse(manager.check_ckpt())
31+
os.makedirs(ckpt_path, exist_ok=True)
32+
os.makedirs(os.path.join(ckpt_path, 'latest'), exist_ok=True)
33+
with open(os.path.join(ckpt_path, 'ckpt_op.json'), 'w') as fout:
34+
json.dump([
35+
{'test_op_1': {'test_key': 'test_value_1'}},
36+
], fout)
37+
self.assertTrue(manager.check_ops_to_skip())
38+
39+
manager = CheckpointManager(ckpt_path, original_process_list=[
40+
{'test_op_1': {'test_key': 'test_value_1'}},
41+
{'test_op_2': {'test_key': 'test_value_2'}},
42+
])
43+
with open(os.path.join(ckpt_path, 'ckpt_op.json'), 'w') as fout:
44+
json.dump([
45+
{'test_op_1': {'test_key': 'test_value_1'}},
46+
{'test_op_2': {'test_key': 'test_value_2'}},
47+
], fout)
48+
self.assertFalse(manager.check_ops_to_skip())
49+
50+
def test_different_ops(self):
51+
ckpt_path = os.path.join(self.temp_output_path, 'ckpt_2')
52+
os.makedirs(ckpt_path, exist_ok=True)
53+
os.makedirs(os.path.join(ckpt_path, 'latest'), exist_ok=True)
54+
with open(os.path.join(ckpt_path, 'ckpt_op.json'), 'w') as fout:
55+
json.dump([
56+
{'test_op_2': {'test_key': 'test_value_2'}},
57+
], fout)
58+
manager = CheckpointManager(ckpt_path, original_process_list=[
59+
{'test_op_1': {'test_key': 'test_value_1'}},
60+
{'test_op_2': {'test_key': 'test_value_2'}},
61+
])
62+
self.assertFalse(manager.ckpt_available)
63+
64+
def test_save_and_load_ckpt(self):
65+
ckpt_path = os.path.join(self.temp_output_path, 'ckpt_3')
66+
test_data = {
67+
'text': ['text1', 'text2', 'text3'],
68+
}
69+
dataset = NestedDataset.from_dict(test_data)
70+
manager = CheckpointManager(ckpt_path, original_process_list=[])
71+
self.assertFalse(os.path.exists(os.path.join(manager.ckpt_ds_dir, 'dataset_info.json')))
72+
manager.record({'test_op_1': {'test_key': 'test_value_1'}})
73+
manager.save_ckpt(dataset)
74+
self.assertTrue(os.path.exists(os.path.join(manager.ckpt_ds_dir, 'dataset_info.json')))
75+
self.assertTrue(os.path.exists(manager.ckpt_op_record))
76+
loaded_ckpt = manager.load_ckpt()
77+
self.assertDatasetEqual(dataset, loaded_ckpt)
78+
79+
80+
if __name__ == '__main__':
81+
unittest.main()

‎tests/utils/test_common_utils.py

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import unittest
2+
import sys
3+
4+
from data_juicer.utils.common_utils import (
5+
stats_to_number, dict_to_hash, nested_access, is_string_list,
6+
avg_split_string_list_under_limit, is_float
7+
)
8+
9+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
10+
11+
class CommonUtilsTest(DataJuicerTestCaseBase):
12+
13+
def test_stats_to_number(self):
14+
self.assertEqual(stats_to_number('1.0'), 1.0)
15+
self.assertEqual(stats_to_number([1.0, 2.0, 3.0]), 2.0)
16+
17+
self.assertEqual(stats_to_number([]), -sys.maxsize)
18+
self.assertEqual(stats_to_number(None), -sys.maxsize)
19+
self.assertEqual(stats_to_number([], reverse=False), sys.maxsize)
20+
self.assertEqual(stats_to_number(None, reverse=False), sys.maxsize)
21+
22+
def test_dict_to_hash(self):
23+
self.assertEqual(len(dict_to_hash({'a': 1, 'b': 2})), 64)
24+
self.assertEqual(len(dict_to_hash({'a': 1, 'b': 2}, hash_length=32)), 32)
25+
26+
def test_nested_access(self):
27+
self.assertEqual(nested_access({'a': {'b': 1}}, 'a.b'), 1)
28+
self.assertEqual(nested_access({'a': [{'b': 1}]}, 'a.0.b', digit_allowed=True), 1)
29+
self.assertEqual(nested_access({'a': [{'b': 1}]}, 'a.0.b', digit_allowed=False), None)
30+
31+
def test_is_string_list(self):
32+
self.assertTrue(is_string_list(['a', 'b', 'c']))
33+
self.assertFalse(is_string_list([1, 2, 3]))
34+
self.assertFalse(is_string_list(['a', 2, 'c']))
35+
36+
def test_is_float(self):
37+
self.assertTrue(is_float('1.0'))
38+
self.assertTrue(is_float(1.0))
39+
self.assertTrue(is_float('1e-4'))
40+
self.assertFalse(is_float('a'))
41+
42+
def test_avg_split_string_list_under_limit(self):
43+
test_data = [
44+
(['a', 'b', 'c'], [1, 2, 3], None, [['a', 'b', 'c']]),
45+
(['a', 'b', 'c'], [1, 2, 3], 3, [['a', 'b'], ['c']]),
46+
(['a', 'b', 'c'], [1, 2, 3], 2, [['a'], ['b'], ['c']]),
47+
(['a', 'b', 'c', 'd', 'e'], [1, 2, 3, 1, 1], 3, [['a', 'b'], ['c'], ['d', 'e']]),
48+
(['a', 'b', 'c'], [1, 2], 3, [['a', 'b', 'c']]),
49+
(['a', 'b', 'c'], [1, 2, 3], 100, [['a', 'b', 'c']]),
50+
]
51+
52+
for str_list, token_nums, max_token_num, expected_result in test_data:
53+
self.assertEqual(avg_split_string_list_under_limit(str_list, token_nums, max_token_num), expected_result)
54+
55+
56+
if __name__ == '__main__':
57+
unittest.main()

‎tests/utils/test_compress.py

+216
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
import os
2+
import json
3+
import unittest
4+
5+
from datasets import config, load_dataset
6+
7+
from data_juicer.core import NestedDataset
8+
from data_juicer.utils.compress import compress, decompress, cleanup_compressed_cache_files, CompressionOff
9+
from data_juicer.utils import cache_utils
10+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
11+
12+
class CacheCompressTest(DataJuicerTestCaseBase):
13+
14+
def setUp(self) -> None:
15+
self.temp_output_path = 'tmp/test_compress/'
16+
self.test_data_path = self.temp_output_path + 'test.json'
17+
os.makedirs(self.temp_output_path, exist_ok=True)
18+
with open(self.test_data_path, 'w') as fout:
19+
json.dump([{'test_key_1': 'test_val_1'}], fout)
20+
self.ori_cache_dir = config.HF_DATASETS_CACHE
21+
config.HF_DATASETS_CACHE = self.temp_output_path
22+
23+
def tearDown(self):
24+
if os.path.exists(self.temp_output_path):
25+
os.system(f'rm -rf {self.temp_output_path}')
26+
config.HF_DATASETS_CACHE = self.ori_cache_dir
27+
28+
def test_basic_func(self):
29+
cache_utils.CACHE_COMPRESS = 'zstd'
30+
ds = load_dataset('json', data_files=self.test_data_path, split='train')
31+
prev_ds = ds.map(lambda s: {'test_key_2': 'test_val_2', **s})
32+
curr_ds = prev_ds.map(lambda s: {'test_key_3': 'test_val_3', **s})
33+
for fn in ds.cache_files:
34+
self.assertTrue(os.path.exists(fn['filename']))
35+
for fn in prev_ds.cache_files:
36+
self.assertTrue(os.path.exists(fn['filename']))
37+
for fn in curr_ds.cache_files:
38+
self.assertTrue(os.path.exists(fn['filename']))
39+
40+
# won't compress original dataset
41+
compress(ds, prev_ds)
42+
# cache files of the original dataset always exist
43+
for fn in ds.cache_files:
44+
self.assertTrue(os.path.exists(fn['filename']))
45+
self.assertFalse(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
46+
# cache files of the previous dataset are deleted
47+
for fn in prev_ds.cache_files:
48+
self.assertTrue(os.path.exists(fn['filename']))
49+
# cache files of the current dataset are kept
50+
for fn in curr_ds.cache_files:
51+
self.assertTrue(os.path.exists(fn['filename']))
52+
53+
# compress previous dataset
54+
compress(prev_ds, curr_ds)
55+
# cache files of the original dataset always exist
56+
for fn in ds.cache_files:
57+
self.assertTrue(os.path.exists(fn['filename']))
58+
# cache files of the previous dataset are deleted
59+
for fn in prev_ds.cache_files:
60+
self.assertFalse(os.path.exists(fn['filename']))
61+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
62+
# cache files of the current dataset are kept
63+
for fn in curr_ds.cache_files:
64+
self.assertTrue(os.path.exists(fn['filename']))
65+
66+
# decompress the previous dataset
67+
decompress(prev_ds)
68+
for fn in prev_ds.cache_files:
69+
self.assertTrue(os.path.exists(fn['filename']))
70+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
71+
72+
# clean up the compressed cache files of the previous dataset
73+
cleanup_compressed_cache_files(prev_ds)
74+
for fn in prev_ds.cache_files:
75+
self.assertTrue(os.path.exists(fn['filename']))
76+
self.assertFalse(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
77+
78+
def test_dif_compress_method(self):
79+
cache_utils.CACHE_COMPRESS = 'gzip'
80+
ds = load_dataset('json', data_files=self.test_data_path, split='train')
81+
prev_ds = ds.map(lambda s: {'test_key_2': 'test_val_2', **s})
82+
curr_ds = prev_ds.map(lambda s: {'test_key_3': 'test_val_3', **s})
83+
for fn in ds.cache_files:
84+
self.assertTrue(os.path.exists(fn['filename']))
85+
for fn in prev_ds.cache_files:
86+
self.assertTrue(os.path.exists(fn['filename']))
87+
for fn in curr_ds.cache_files:
88+
self.assertTrue(os.path.exists(fn['filename']))
89+
90+
# won't compress original dataset
91+
compress(ds, prev_ds)
92+
# cache files of the original dataset always exist
93+
for fn in ds.cache_files:
94+
self.assertTrue(os.path.exists(fn['filename']))
95+
self.assertFalse(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
96+
# cache files of the previous dataset are deleted
97+
for fn in prev_ds.cache_files:
98+
self.assertTrue(os.path.exists(fn['filename']))
99+
# cache files of the current dataset are kept
100+
for fn in curr_ds.cache_files:
101+
self.assertTrue(os.path.exists(fn['filename']))
102+
103+
# compress previous dataset
104+
compress(prev_ds, curr_ds)
105+
# cache files of the original dataset always exist
106+
for fn in ds.cache_files:
107+
self.assertTrue(os.path.exists(fn['filename']))
108+
# cache files of the previous dataset are deleted
109+
for fn in prev_ds.cache_files:
110+
self.assertFalse(os.path.exists(fn['filename']))
111+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
112+
# cache files of the current dataset are kept
113+
for fn in curr_ds.cache_files:
114+
self.assertTrue(os.path.exists(fn['filename']))
115+
116+
# decompress the previous dataset
117+
decompress(prev_ds)
118+
for fn in prev_ds.cache_files:
119+
self.assertTrue(os.path.exists(fn['filename']))
120+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
121+
122+
# clean up the compressed cache files of the previous dataset
123+
cleanup_compressed_cache_files(prev_ds)
124+
for fn in prev_ds.cache_files:
125+
self.assertTrue(os.path.exists(fn['filename']))
126+
self.assertFalse(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
127+
128+
def test_multiprocessing(self):
129+
cache_utils.CACHE_COMPRESS = 'zstd'
130+
ds = load_dataset('json', data_files=self.test_data_path, split='train')
131+
prev_ds = ds.map(lambda s: {'test_key_2': 'test_val_2', **s})
132+
curr_ds = prev_ds.map(lambda s: {'test_key_3': 'test_val_3', **s})
133+
for fn in ds.cache_files:
134+
self.assertTrue(os.path.exists(fn['filename']))
135+
for fn in prev_ds.cache_files:
136+
self.assertTrue(os.path.exists(fn['filename']))
137+
for fn in curr_ds.cache_files:
138+
self.assertTrue(os.path.exists(fn['filename']))
139+
compress(prev_ds, curr_ds, num_proc=2)
140+
# cache files of the original dataset always exist
141+
for fn in ds.cache_files:
142+
self.assertTrue(os.path.exists(fn['filename']))
143+
# cache files of the previous dataset are deleted
144+
for fn in prev_ds.cache_files:
145+
self.assertFalse(os.path.exists(fn['filename']))
146+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
147+
# cache files of the current dataset are kept
148+
for fn in curr_ds.cache_files:
149+
self.assertTrue(os.path.exists(fn['filename']))
150+
151+
# decompress the previous dataset
152+
decompress(prev_ds, num_proc=2)
153+
for fn in prev_ds.cache_files:
154+
self.assertTrue(os.path.exists(fn['filename']))
155+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
156+
157+
# clean up the compressed cache files of the previous dataset
158+
cleanup_compressed_cache_files(prev_ds)
159+
for fn in prev_ds.cache_files:
160+
self.assertTrue(os.path.exists(fn['filename']))
161+
self.assertFalse(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
162+
163+
def test_compression_off(self):
164+
cache_utils.CACHE_COMPRESS = 'lz4'
165+
ds = load_dataset('json', data_files=self.test_data_path, split='train')
166+
prev_ds = ds.map(lambda s: {'test_key_2': 'test_val_2', **s})
167+
curr_ds = prev_ds.map(lambda s: {'test_key_3': 'test_val_3', **s})
168+
for fn in ds.cache_files:
169+
self.assertTrue(os.path.exists(fn['filename']))
170+
for fn in prev_ds.cache_files:
171+
self.assertTrue(os.path.exists(fn['filename']))
172+
for fn in curr_ds.cache_files:
173+
self.assertTrue(os.path.exists(fn['filename']))
174+
175+
# disable cache compression
176+
with CompressionOff():
177+
compress(prev_ds, curr_ds)
178+
# cache files of the original dataset always exist
179+
for fn in ds.cache_files:
180+
self.assertTrue(os.path.exists(fn['filename']))
181+
# cache files of the previous dataset are deleted
182+
for fn in prev_ds.cache_files:
183+
self.assertTrue(os.path.exists(fn['filename']))
184+
self.assertFalse(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
185+
# cache files of the current dataset are kept
186+
for fn in curr_ds.cache_files:
187+
self.assertTrue(os.path.exists(fn['filename']))
188+
189+
# re-enable cache compression
190+
compress(prev_ds, curr_ds)
191+
# cache files of the original dataset always exist
192+
for fn in ds.cache_files:
193+
self.assertTrue(os.path.exists(fn['filename']))
194+
# cache files of the previous dataset are deleted
195+
for fn in prev_ds.cache_files:
196+
self.assertFalse(os.path.exists(fn['filename']))
197+
self.assertTrue(os.path.exists(fn['filename'] + f'.{cache_utils.CACHE_COMPRESS}'))
198+
# cache files of the current dataset are kept
199+
for fn in curr_ds.cache_files:
200+
self.assertTrue(os.path.exists(fn['filename']))
201+
202+
def test_dataset_without_cache(self):
203+
prev_ds = NestedDataset.from_list([{'test_key': 'test_val'}])
204+
curr_ds = prev_ds.map(lambda s: {'test_key_2': 'test_val_2', **s})
205+
# dataset from list does not have cache files
206+
self.assertTrue(len(prev_ds.cache_files) == 0)
207+
self.assertTrue(len(curr_ds.cache_files) == 0)
208+
compress(prev_ds, curr_ds)
209+
decompress(prev_ds)
210+
cleanup_compressed_cache_files(prev_ds)
211+
self.assertTrue(len(prev_ds.cache_files) == 0)
212+
self.assertTrue(len(curr_ds.cache_files) == 0)
213+
214+
215+
if __name__ == '__main__':
216+
unittest.main()

‎tests/utils/test_constant.py

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import os
2+
import unittest
3+
4+
from data_juicer.core import NestedDataset
5+
from data_juicer.config import init_configs
6+
from data_juicer.utils.constant import StatsKeys
7+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
8+
9+
test_yaml_path = os.path.join(os.path.dirname(os.path.realpath(__file__)),
10+
'..',
11+
'config',
12+
'demo_4_test.yaml')
13+
14+
class StatsKeysTest(DataJuicerTestCaseBase):
15+
16+
def setUp(self) -> None:
17+
super().setUp()
18+
StatsKeys._accessed_by = {}
19+
20+
def tearDown(cls) -> None:
21+
super().tearDown()
22+
StatsKeys._accessed_by = {}
23+
24+
def test_basic_func(self):
25+
cfg = init_configs(args=f'--config {test_yaml_path}'.split())
26+
res = StatsKeys.get_access_log(cfg)
27+
self.assertEqual(len(res), 1) # only 1 filter
28+
self.assertIn('language_id_score_filter', res)
29+
self.assertEqual(res['language_id_score_filter'], {'lang', 'lang_score'})
30+
31+
# obtain again
32+
res_2 = StatsKeys.get_access_log(cfg)
33+
self.assertEqual(res, res_2)
34+
35+
def test_basic_func_with_dataset(self):
36+
dataset = NestedDataset.from_list([{'text': 'hello world'}])
37+
cfg = init_configs(args=f'--config {test_yaml_path}'.split())
38+
res = StatsKeys.get_access_log(cfg, dataset)
39+
self.assertEqual(len(res), 1) # only 1 filter
40+
self.assertIn('language_id_score_filter', res)
41+
self.assertEqual(res['language_id_score_filter'], {'lang', 'lang_score'})
42+
43+
44+
if __name__ == '__main__':
45+
unittest.main()

‎tests/utils/test_file_utils.py

+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
import os
2+
import unittest
3+
import regex as re
4+
5+
from data_juicer.utils.file_utils import (
6+
find_files_with_suffix, is_absolute_path,
7+
add_suffix_to_filename, create_directory_if_not_exists, transfer_filename,
8+
copy_data
9+
)
10+
from data_juicer.utils.mm_utils import Fields
11+
12+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
13+
14+
class FileUtilsTest(DataJuicerTestCaseBase):
15+
16+
def setUp(self) -> None:
17+
self.temp_output_path = 'tmp/test_file_utils/'
18+
os.makedirs(self.temp_output_path)
19+
20+
def tearDown(self):
21+
if os.path.exists(self.temp_output_path):
22+
os.system(f'rm -rf {self.temp_output_path}')
23+
24+
def test_find_files_with_suffix(self):
25+
# prepare test files
26+
fn_list = ['test1.txt', 'test2.txt', 'test3.md']
27+
for fn in fn_list:
28+
with open(os.path.join(self.temp_output_path, fn), 'w') as f:
29+
f.write(fn)
30+
31+
self.assertEqual(find_files_with_suffix(os.path.join(self.temp_output_path, 'test1.txt')),
32+
{'.txt': [os.path.join(self.temp_output_path, 'test1.txt')]})
33+
self.assertEqual(find_files_with_suffix(self.temp_output_path),
34+
{'.txt': [os.path.join(self.temp_output_path, 'test1.txt'), os.path.join(self.temp_output_path, 'test2.txt')],
35+
'.md': [os.path.join(self.temp_output_path, 'test3.md')]})
36+
self.assertEqual(find_files_with_suffix(self.temp_output_path, 'txt'),
37+
{'.txt': [os.path.join(self.temp_output_path, 'test1.txt'), os.path.join(self.temp_output_path, 'test2.txt')]})
38+
39+
def test_is_absolute_path(self):
40+
self.assertFalse(is_absolute_path(self.temp_output_path))
41+
self.assertTrue(is_absolute_path(os.path.abspath(self.temp_output_path)))
42+
43+
def test_add_suffix_to_filename(self):
44+
self.assertEqual(add_suffix_to_filename('test.txt', '_suffix'), 'test_suffix.txt')
45+
self.assertEqual(add_suffix_to_filename('test.txt', ''), 'test.txt')
46+
self.assertEqual(add_suffix_to_filename('test', '_suffix'), 'test_suffix')
47+
self.assertEqual(add_suffix_to_filename('.git', '_suffix'), '.git_suffix')
48+
49+
def test_create_directory_if_not_exists(self):
50+
self.assertTrue(os.path.exists(self.temp_output_path))
51+
create_directory_if_not_exists(self.temp_output_path)
52+
self.assertTrue(os.path.exists(self.temp_output_path))
53+
os.rmdir(self.temp_output_path)
54+
self.assertFalse(os.path.exists(self.temp_output_path))
55+
create_directory_if_not_exists(self.temp_output_path)
56+
self.assertTrue(os.path.exists(self.temp_output_path))
57+
58+
def test_transfer_filename(self):
59+
self.assertTrue(
60+
re.match(
61+
os.path.join(self.temp_output_path, Fields.multimodal_data_output_dir, 'op1', 'abc__dj_hash_#(.*?)#.jpg'),
62+
transfer_filename(os.path.join(self.temp_output_path, 'abc.jpg'), 'op1')))
63+
64+
def test_copy_data(self):
65+
tgt_fn = 'test.txt'
66+
ori_dir = os.path.join(self.temp_output_path, 'test1')
67+
tgt_dir = os.path.join(self.temp_output_path, 'test2')
68+
69+
self.assertFalse(copy_data(ori_dir, tgt_dir, tgt_fn))
70+
71+
os.makedirs(ori_dir, exist_ok=True)
72+
with open(os.path.join(ori_dir, tgt_fn), 'w') as f:
73+
f.write('test')
74+
75+
self.assertTrue(copy_data(ori_dir, tgt_dir, tgt_fn))
76+
self.assertTrue(os.path.exists(os.path.join(tgt_dir, tgt_fn)))
77+
78+
79+
if __name__ == '__main__':
80+
unittest.main()

‎tests/utils/test_fingerprint_utils.py

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import unittest
2+
3+
from data_juicer.core import NestedDataset
4+
from data_juicer.utils.fingerprint_utils import generate_fingerprint
5+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
6+
7+
class FingerprintUtilsTest(DataJuicerTestCaseBase):
8+
9+
def test_generate_fingerprint(self):
10+
dataset = NestedDataset.from_list([{'text_key': 'test_val'}])
11+
fingerprint = generate_fingerprint(dataset)
12+
self.assertLessEqual(len(fingerprint), 64)
13+
14+
# with func args
15+
new_fingerprint = generate_fingerprint(dataset, lambda x: x['text_key'])
16+
self.assertLessEqual(len(new_fingerprint), 64)
17+
self.assertNotEqual(new_fingerprint, fingerprint)
18+
19+
20+
if __name__ == '__main__':
21+
unittest.main()

‎tests/utils/test_lazy_loader.py

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import unittest
2+
3+
from types import ModuleType
4+
5+
from data_juicer.utils.lazy_loader import LazyLoader
6+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
7+
8+
class LazyLoaderTest(DataJuicerTestCaseBase):
9+
10+
def test_basic_func(self):
11+
torch = LazyLoader('torch', 'torch')
12+
# it's a LazyLoader at the beginning
13+
self.assertIsInstance(torch, LazyLoader)
14+
# invoke it or check the dir to install and activate it
15+
self.assertIsInstance(dir(torch), list)
16+
# it's a real module now
17+
self.assertIsInstance(torch, ModuleType)
18+
19+
20+
if __name__ == '__main__':
21+
unittest.main()

‎tests/utils/test_logger_utils.py

+113
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
import os
2+
import unittest
3+
import jsonlines
4+
import regex as re
5+
from loguru import logger
6+
7+
import data_juicer.utils.logger_utils
8+
from data_juicer.utils.logger_utils import setup_logger, get_log_file_path, make_log_summarization
9+
10+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
11+
12+
@unittest.skip('This case could break the logger.')
13+
class LoggerUtilsTest(DataJuicerTestCaseBase):
14+
15+
def setUp(self) -> None:
16+
self.temp_output_path = 'tmp/test_logger_utils/'
17+
data_juicer.utils.logger_utils.LOGGER_SETUP = False
18+
19+
def tearDown(self):
20+
if os.path.exists(self.temp_output_path):
21+
os.system(f'rm -rf {self.temp_output_path}')
22+
23+
def get_log_messages(self, content):
24+
lines = content.strip().split('\n')
25+
messages = []
26+
for line in lines:
27+
line = line.strip()
28+
if line:
29+
if ' - ' in line:
30+
messages.append(' - '.join(line.strip().split(' - ')[1:]))
31+
else:
32+
messages.append(line)
33+
return messages
34+
35+
def test_logger_utils(self):
36+
setup_logger(self.temp_output_path)
37+
logger.info('info test')
38+
logger.warning('warning test')
39+
logger.error('error test')
40+
logger.debug('debug test')
41+
print('extra normal info')
42+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'log.txt')))
43+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'log_ERROR.txt')))
44+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'log_WARNING.txt')))
45+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'log_DEBUG.txt')))
46+
with open(os.path.join(self.temp_output_path, 'log.txt'), 'r') as f:
47+
content = f.read()
48+
messages = self.get_log_messages(content)
49+
self.assertEqual(len(messages), 5)
50+
self.assertEqual(messages, ['info test', 'warning test', 'error test', 'debug test', 'extra normal info'])
51+
52+
with jsonlines.open(os.path.join(self.temp_output_path, 'log_ERROR.txt'), 'r') as reader:
53+
messages = [line for line in reader]
54+
self.assertEqual(len(messages), 1)
55+
self.assertEqual(messages[0]['record']['message'], 'error test')
56+
with jsonlines.open(os.path.join(self.temp_output_path, 'log_WARNING.txt'), 'r') as reader:
57+
messages = [line for line in reader]
58+
self.assertEqual(len(messages), 1)
59+
self.assertEqual(messages[0]['record']['message'], 'warning test')
60+
with jsonlines.open(os.path.join(self.temp_output_path, 'log_DEBUG.txt'), 'r') as reader:
61+
messages = [line for line in reader]
62+
self.assertEqual(len(messages), 1)
63+
self.assertEqual(messages[0]['record']['message'], 'debug test')
64+
65+
self.assertEqual(get_log_file_path(), os.path.abspath(os.path.join(self.temp_output_path, 'log.txt')))
66+
67+
# setup again
68+
setup_logger(os.path.join(self.temp_output_path, 'second_setup'))
69+
logger.info('info test')
70+
self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'log.txt')))
71+
self.assertFalse(os.path.exists(os.path.join(self.temp_output_path, 'second_setup', 'log.txt')))
72+
73+
def test_make_log_summarization(self):
74+
setup_logger(self.temp_output_path)
75+
logger.info('normal log 1')
76+
logger.error(f'An error occurred in fake_op_1 when processing sample '
77+
f'"fake_sample_1" -- {ModuleNotFoundError}: err msg 1')
78+
logger.info('normal log 2')
79+
logger.warning('warning message')
80+
logger.info('normal log 3')
81+
logger.error(f'An error occurred in fake_op_2 when processing sample '
82+
f'"fake_sample_1" -- {ValueError}: err msg 1')
83+
logger.info('normal log 4')
84+
logger.error(f'An error occurred in fake_op_3 when processing sample '
85+
f'"fake_sample_3" -- {ModuleNotFoundError}: err msg 3')
86+
logger.info('normal log 5')
87+
88+
make_log_summarization()
89+
with open(os.path.join(self.temp_output_path, 'log.txt')) as f:
90+
content = f.read()
91+
# find start words
92+
self.assertIn('Processing finished with:', content)
93+
# find number of warnings and errors
94+
warn_num = re.findall(r'Warnings: (\d+)', content)
95+
self.assertEqual(len(warn_num), 1)
96+
self.assertEqual(int(warn_num[0]), 1)
97+
err_num = re.findall(r'Errors: (\d+)', content)
98+
self.assertEqual(len(err_num), 1)
99+
self.assertEqual(int(err_num[0]), 3)
100+
# find table head content
101+
self.assertIn('OP/Method', content)
102+
self.assertIn('Error Type', content)
103+
self.assertIn('Error Message', content)
104+
self.assertIn('Error Count', content)
105+
# find end words
106+
log_fn = re.findall(r'Error/Warning details can be found in the log file \[(.+)\] and its related log files\.', content)
107+
self.assertEqual(len(log_fn), 1)
108+
self.assertEqual(log_fn[0], os.path.abspath(os.path.join(self.temp_output_path, 'log.txt')))
109+
self.assertTrue(os.path.exists(log_fn[0]))
110+
111+
112+
if __name__ == '__main__':
113+
unittest.main()

‎tests/utils/test_mm_utils.py

+352
Large diffs are not rendered by default.

‎tests/utils/test_model_utils.py

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
import unittest
2+
3+
from data_juicer.utils.model_utils import get_backup_model_link
4+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
5+
6+
# other funcs are called by ops already
7+
class ModelUtilsTest(DataJuicerTestCaseBase):
8+
9+
def test_get_backup_model_link(self):
10+
test_data = [
11+
('lid.176.bin', 'https://dl.fbaipublicfiles.com/fasttext/supervised-models/'), # exact match
12+
('zh.sp.model', 'https://huggingface.co/edugp/kenlm/resolve/main/wikipedia/'), # pattern match
13+
('invalid_model_name', None), # invalid model name
14+
]
15+
for model_name, expected_link in test_data:
16+
self.assertEqual(get_backup_model_link(model_name), expected_link)
17+
18+
19+
if __name__ == '__main__':
20+
unittest.main()

‎tests/utils/test_process_utils.py

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
import unittest
2+
import torch
3+
import multiprocess as mp
4+
5+
from data_juicer.utils.process_utils import setup_mp, get_min_cuda_memory, calculate_np
6+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
7+
8+
class ProcessUtilsTest(DataJuicerTestCaseBase):
9+
10+
def test_setup_mp(self):
11+
all_methods = mp.get_all_start_methods()
12+
setup_mp()
13+
self.assertIn(mp.get_start_method(), all_methods)
14+
15+
setup_mp('spawn')
16+
self.assertEqual(mp.get_start_method(), 'spawn')
17+
18+
setup_mp(['spawn', 'forkserver', 'fork'])
19+
self.assertEqual(mp.get_start_method(), 'spawn')
20+
21+
def test_get_min_cuda_memory(self):
22+
if torch.cuda.is_available():
23+
self.assertIsInstance(get_min_cuda_memory(), int)
24+
else:
25+
with self.assertRaises(AssertionError):
26+
get_min_cuda_memory()
27+
28+
29+
if __name__ == '__main__':
30+
unittest.main()

‎tests/utils/test_registry.py

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
import unittest
2+
3+
from data_juicer.utils.registry import Registry
4+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
5+
6+
class RegistryTest(DataJuicerTestCaseBase):
7+
8+
def test_basic_func(self):
9+
registry = Registry('test')
10+
11+
class A:
12+
pass
13+
registry.register_module('module_a', A)
14+
15+
@registry.register_module('module_b')
16+
class B:
17+
pass
18+
19+
self.assertEqual(registry.name, 'test')
20+
self.assertEqual(registry.modules, {'module_a': A, 'module_b': B})
21+
self.assertEqual(registry.list(), ['module_a', 'module_b'])
22+
self.assertEqual(registry.get('module_a'), A)
23+
self.assertEqual(registry.get('module_b'), B)
24+
25+
with self.assertRaises(KeyError):
26+
registry.register_module('module_b', B)
27+
28+
with self.assertRaises(TypeError):
29+
registry.register_module(1, A)
30+
31+
32+
if __name__ == '__main__':
33+
unittest.main()

‎tests/utils/test_resource_utils.py

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import unittest
2+
import torch
3+
4+
from data_juicer.utils.resource_utils import query_cuda_info, query_mem_info, get_cpu_count, get_cpu_utilization
5+
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
6+
7+
class RegistryTest(DataJuicerTestCaseBase):
8+
9+
def test_query_cuda_info(self):
10+
if torch.cuda.is_available():
11+
self.assertIsNotNone(query_cuda_info('memory.used'))
12+
else:
13+
self.assertIsNone(query_cuda_info('memory.used'))
14+
15+
def test_query_mem_info(self):
16+
self.assertIsInstance(query_mem_info('total'), float)
17+
self.assertIsNone(query_mem_info('invalid key'))
18+
19+
def test_get_cpu_count(self):
20+
self.assertIsInstance(get_cpu_count(), int)
21+
22+
def test_get_cpu_utilization(self):
23+
self.assertIsInstance(get_cpu_utilization(), float)
24+
25+
26+
if __name__ == '__main__':
27+
unittest.main()

‎tests/utils/test_unittest_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
44

5-
class TextTokenDistCollectorTest(DataJuicerTestCaseBase):
5+
class UnittestUtilsTest(DataJuicerTestCaseBase):
66

77
def test_placeholder(self):
88
# placeholder for test

‎tools/quality_classifier/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1+
English | [中文](./README_ZH.md)
12

23
# Data Scoring Capabilities
34

45
Data-Juicer provides a set of data scoring capabilities to help you evaluate your datasets.
56

67
- All [Filter Operators](../../docs/Operators.md) include a sub-process called `compute_stats`, which calculates statistical measurements based on a pre-defined functional goal of the operator. These measurements are typically derived using simple rules, auxiliary models, or advanced algorithms, such as for perplexity, length, modality matching scores, etc. The `Analyzer` will automatically aggregate these statistics and report them in the resulting dataset.
78

8-
- Prompt-based LLM scoring operators are also available, such as `llm_api_difficulty_score_filter` and `llm_api_quality_score_filter`. These operators come with default prompts for general use cases but also offer flexibility for users to customize their own models or specific requirements.
9+
- Prompt-based LLM scoring operators are also available, such as `llm_difficulty_score_filter` and `llm_quality_score_filter`. These operators come with default prompts for general use cases but also offer flexibility for users to customize their own models or specific requirements.
910

1011
- Additionally, we provide a toolkit to reproduce the GPT-3 quality classifier, as described in the following section.
1112

‎tools/quality_classifier/README_ZH.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1+
中文 | [English](./README.md)
2+
13
# 数据打分能力
24

35
Data-Juicer 提供了一组数据打分能力,可帮助您评估数据集。
46

57
- 所有 [Filter OPs](../../docs/Operators.md) 都包含一个名为 `compute_stats` 的子过程,该过程根据运算符的预定义功能目标,计算统计测量值。这些测量值通常使用简单规则、辅助模型或高级算法得出,例如困惑度、长度、模态匹配分数等。`Analyzer` 将自动汇总这些统计数据并将其报告在结果数据集中。
68

7-
- 我们也提供了基于提示词的 LLM 评分算子,例如 `llm_api_difficulty_score_filter``llm_api_quality_score_filter`。这些算子带有针对一般用例的默认提示,但也为用户提供了灵活性,以自定义特定模型或特定要求。
9+
- 我们也提供了基于提示词的 LLM 评分算子,例如 `llm_difficulty_score_filter``llm_quality_score_filter`。这些算子带有针对一般用例的默认提示,但也为用户提供了灵活性,以自定义特定模型或特定要求。
810

911
- 此外,我们还提供了一个工具包来复现 GPT-3 质量分类器,如下一节所述。
1012

0 commit comments

Comments
 (0)
Please sign in to comment.