modelscope
diff --git a/‎README.md
Lines changed: 8 additions & 5 deletions b/‎README.md
Lines changed: 8 additions & 5 deletions
diff --git a/‎README_ZH.md
Lines changed: 7 additions & 4 deletions b/‎README_ZH.md
Lines changed: 7 additions & 4 deletions
diff --git a/‎data_juicer/__init__.py
Lines changed: 1 addition & 1 deletion b/‎data_juicer/__init__.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎data_juicer/utils/asset_utils.py
Lines changed: 2 additions & 0 deletions b/‎data_juicer/utils/asset_utils.py
Lines changed: 2 additions & 0 deletions
diff --git a/‎data_juicer/utils/compress.py
Lines changed: 3 additions & 3 deletions b/‎data_juicer/utils/compress.py
Lines changed: 3 additions & 3 deletions
diff --git a/‎data_juicer/utils/constant.py
Lines changed: 3 additions & 2 deletions b/‎data_juicer/utils/constant.py
Lines changed: 3 additions & 2 deletions
diff --git a/‎data_juicer/utils/mm_utils.py
Lines changed: 3 additions & 3 deletions b/‎data_juicer/utils/mm_utils.py
Lines changed: 3 additions & 3 deletions
diff --git a/‎data_juicer/utils/registry.py
Lines changed: 1 addition & 4 deletions b/‎data_juicer/utils/registry.py
Lines changed: 1 addition & 4 deletions
diff --git a/‎tests/run.py
Lines changed: 6 additions & 3 deletions b/‎tests/run.py
Lines changed: 6 additions & 3 deletions
diff --git a/‎tests/utils/test_asset_utils.py
Lines changed: 57 additions & 0 deletions b/‎tests/utils/test_asset_utils.py
Lines changed: 57 additions & 0 deletions
diff --git a/‎tests/utils/test_auto_install_mapping.py
Lines changed: 12 additions & 0 deletions b/‎tests/utils/test_auto_install_mapping.py
Lines changed: 12 additions & 0 deletions
diff --git a/‎tests/utils/test_auto_install_utils.py
Lines changed: 24 additions & 0 deletions b/‎tests/utils/test_auto_install_utils.py
Lines changed: 24 additions & 0 deletions
diff --git a/‎tests/utils/test_availablility_utils.py
Lines changed: 23 additions & 0 deletions b/‎tests/utils/test_availablility_utils.py
Lines changed: 23 additions & 0 deletions
@@ -36,12 +36,13 @@ Data-Juicer is being actively updated and maintained. We will periodically enhan
 
 
 ## News
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-03-13] We propose a new data synthesis method, *MindGym*, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., *16%* gain on [MathVision](https://mathllm.github.io/mathvision/#leaderboard) using only *400 samples*). See more details in  [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499).
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-28] DJ has been integrated in [Ray's official Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) and [Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html). Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by [Apache Arrow](https://github.com/apache/arrow/pull/45084). 
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-27] Our work on contrastive data synthesis, [ImgDiff](https://arxiv.org/pdf/2408.04594), has been accepted by *CVPR 2025*!
-- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] We propose a new data selection method, *DaaR*, which is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs. See more details in [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf).
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] We propose a new data selection method, *DaaR*, which is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs. See more details in [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380).
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] We release our 2.0 paper, [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://arxiv.org/abs/2501.14755). It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes.
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] We support post-tuning scenarios better, via 20+ related new [OPs](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2), and via unified [dataset format](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) compatible to LLaMA-Factory and ModelScope-Swift.
-- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 17 human-centric tasks with synthetic data, benchmarking video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it.
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 16 human-centric tasks with synthetic data, benchmarking 22 video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it.
 
 <details>
 <summary> History News:
@@ -511,7 +512,7 @@ If you find Data-Juicer useful for your research or development, please kindly c
 ```
 
 <details>
-<summary> More related papers from the Data-Juicer Team:
+<summary> More data-related papers from the Data-Juicer Team:
 </summary>>
 
 - [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
@@ -522,8 +523,10 @@ If you find Data-Juicer useful for your research or development, please kindly c
 
 - [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
 
-- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
-  
+- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
+
+- [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
+
 - [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
 
 </details>
 
@@ -32,12 +32,13 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
 ----
 
 ## 新消息
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-03-13] 我们提出了一种新的数据合成方法 *MindGym*，该方法鼓励 LLM 自我生成具有挑战性的认知问题，实现优于 SOTA 基线的数据效率、跨模态泛化和 SFT 效果（例如，仅使用 *400 个样本* 即可在 [MathVision](https://mathllm.github.io/mathvision/#leaderboard) 上获得 *16%* 的增益）。有关更多详细信息，请参阅[MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)。
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-28] DJ 已被集成到 [Ray官方 Ecosystem](https://docs.ray.io/en/latest/ray-overview/ray-libraries.html) 和 [Example Gallery](https://docs.ray.io/en/latest/data/examples/data_juicer_distributed_data_processing.html)。此外，我们在 DJ2.0 中的流式 JSON 加载补丁已被 [Apache Arrow 官方集成](https://github.com/apache/arrow/pull/45084)。
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-27] 我们的对比数据合成工作， [ImgDiff](https://arxiv.org/pdf/2408.04594)， 已被 *CVPR 2025* 接收！
-- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] 我们提出了一种新的数据选择方法 *DaaR*，该方法基于理论指导，将数据多样性建模为奖励信号，在 7 个基准测试中，微调 SOTA LLMs 取得了更好的整体表现。有关更多详细信息，请参阅 [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf) 。
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-02-05] 我们提出了一种新的数据选择方法 *DaaR*，该方法基于理论指导，将数据多样性建模为奖励信号，在 7 个基准测试中，微调 SOTA LLMs 取得了更好的整体表现。有关更多详细信息，请参阅 [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380) 。
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] 我们发布了 2.0 版论文 [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://arxiv.org/abs/2501.14755)。DJ现在可以使用阿里云集群中 50 个 Ray 节点上的 6400 个 CPU 核心在 2.1 小时内处理 70B 数据样本，并使用 8 个 Ray 节点上的 1280 个 CPU 核心在 2.8 小时内对 5TB 数据进行重复数据删除。
 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] 我们通过 20 多个相关的新 [OP](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2) 以及与 LLaMA-Factory 和 ModelScope-Swift 兼容的统一 [数据集格式](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) 更好地支持Post-Tuning场景。
-- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*，它包含 17 个以人为中心的任务，使用合成数据，从内在情感和外在表现的角度对视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息，并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。
+- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*，它包含 16 个以人为中心的任务，使用合成数据，从内在情感和外在表现的角度对22个视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息，并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。
 
 <details>
 <summary> History News:
@@ -492,7 +493,7 @@ Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/gr
 }
 ```
 <details>
-<summary>更多Data-Juicer团队相关论文:
+<summary>更多Data-Juicer团队关于数据的论文:
 </summary>>
 
 - [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
@@ -503,7 +504,9 @@ Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/gr
 
 - [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
 
-- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DaaR_arXiv_preview.pdf)
+- [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https://www.arxiv.org/abs/2502.04380)
+
+- [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https://arxiv.org/abs/2503.09499)
 
 - [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
 
 
@@ -1,4 +1,4 @@
-__version__ = '1.2.1'
+__version__ = '1.2.2'
 
 import os
 import subprocess
 
@@ -48,6 +48,8 @@ def load_words_asset(words_dir: str, words_type: str):
         logger.info(f'Specified {words_dir} does not contain '
                     f'any {words_type} files in json format, now '
                     'download the one cached by data_juicer team')
+        if words_type not in ASSET_LINKS:
+            raise ValueError(f'{words_type} is not in remote server.')
         response = requests.get(ASSET_LINKS[words_type])
         words_dict = response.json()
         # cache the asset file locally
 
@@ -23,8 +23,8 @@ class FileLock(HF_FileLock):
     def _release(self):
         super()._release()
         try:
-            # logger.debug(f'Remove {self._lock_file}')
-            os.remove(self._lock_file)
+            # logger.debug(f'Remove {self.lock_file}')
+            os.remove(self.lock_file)
         # The file is already deleted and that's what we want.
         except OSError:
             pass
@@ -497,4 +497,4 @@ def decompress(ds, fingerprints=None, num_proc=1):
 
 
 def cleanup_compressed_cache_files(ds):
-    CacheCompressManager().cleanup_cache_files(ds)
+    CacheCompressManager(cache_utils.CACHE_COMPRESS).cleanup_cache_files(ds)
@@ -172,14 +172,15 @@ def get_access_log(cls, dj_cfg=None, dataset=None):
                 elif 'jsonl' in dj_cfg.dataset_path:
                     tmp_f_name = dj_cfg.dataset_path. \
                         replace('.jsonl', '.tmp.jsonl')
-                    with open(dj_cfg.dataset_path, 'r') as orig_file:
+                    with open(dj_cfg.dataset_path, 'r',
+                              encoding='utf-8') as orig_file:
                         first_line = orig_file.readline()
 
                 assert tmp_f_name is not None and first_line is not None, \
                     'error when loading the first line, when ' \
                     f'dj_cfg.dataset_path={dj_cfg.dataset_path}'
 
-                with open(tmp_f_name, 'w') as tmp_file:
+                with open(tmp_f_name, 'w', encoding='utf-8') as tmp_file:
                     tmp_file.write(first_line)
 
                 tmp_dj_cfg.dataset_path = tmp_f_name
 
@@ -160,9 +160,9 @@ def iou(box1, box2):
     ix_max = min(x1_max, x2_max)
     iy_min = max(y1_min, y2_min)
     iy_max = min(y1_max, y2_max)
-    intersection = max(0, (ix_max - ix_min) * (iy_max - iy_min))
+    intersection = max(0, max(0, ix_max - ix_min) * max(0, iy_max - iy_min))
     union = area1 + area2 - intersection
-    return 1.0 * intersection / union
+    return 1.0 * intersection / union if union != 0 else 0.0
 
 
 def calculate_resized_dimensions(
@@ -207,7 +207,7 @@ def calculate_resized_dimensions(
 
     # Determine final dimensions based on original orientation
     resized_dimensions = ((new_short_edge,
-                           new_long_edge) if width <= height else
+                           new_long_edge) if width >= height else
                           (new_long_edge, new_short_edge))
 
     # Ensure final dimensions are divisible by the specified value
 
@@ -17,8 +17,6 @@
 #  https://github.com/modelscope/modelscope/blob/master/modelscope/utils/registry.py
 # --------------------------------------------------------
 
-from loguru import logger
-
 
 class Registry(object):
     """This class is used to register some modules to registry by a repo
@@ -53,8 +51,7 @@ def modules(self):
 
     def list(self):
         """Logging the list of module in current registry."""
-        for m in self._modules.keys():
-            logger.info(f'{self._name}\t{m}')
+        return list(self._modules.keys())
 
     def get(self, module_key):
         """
 
@@ -12,6 +12,10 @@
 import unittest
 import coverage
 
+# start the coverage immediately
+cov = coverage.Coverage(include='data_juicer/**')
+cov.start()
+
 from loguru import logger
 
 from data_juicer.utils.unittest_utils import set_clear_model_flag, get_partial_test_cases
@@ -91,12 +95,11 @@ def gather_test_cases(test_dir, pattern, tag, mode='partial'):
 
 
 def main():
-    cov = coverage.Coverage(include='data_juicer/**')
-    cov.start()
-
+    global cov
     runner = unittest.TextTestRunner()
     test_suite = gather_test_cases(os.path.abspath(args.test_dir),
                                    args.pattern, args.tag, args.mode)
+    logger.info(f'There are {len(test_suite._tests)} test cases to run.')
     res = runner.run(test_suite)
 
     cov.stop()
 
@@ -0,0 +1,57 @@
+import os
+import json
+import unittest
+
+from data_juicer.utils.asset_utils import load_words_asset
+
+from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
+
+class LoadWordsAssetTest(DataJuicerTestCaseBase):
+
+    def setUp(self) -> None:
+        self.temp_output_path = 'tmp/test_asset_utils/'
+
+    def tearDown(self):
+        if os.path.exists(self.temp_output_path):
+            os.system(f'rm -rf {self.temp_output_path}')
+
+    def test_basic_func(self):
+        # download assets from the remote server
+        words_dict = load_words_asset(self.temp_output_path, 'stopwords')
+        self.assertTrue(len(words_dict) > 0)
+        self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'stopwords.json')))
+
+        words_dict = load_words_asset(self.temp_output_path, 'flagged_words')
+        self.assertTrue(len(words_dict) > 0)
+        self.assertTrue(os.path.exists(os.path.join(self.temp_output_path, 'flagged_words.json')))
+
+        # non-existing asset
+        with self.assertRaises(ValueError):
+            load_words_asset(self.temp_output_path, 'non_existing_asset')
+
+    def test_load_from_existing_file(self):
+        os.makedirs(self.temp_output_path, exist_ok=True)
+        temp_asset = os.path.join(self.temp_output_path, 'temp_asset.json')
+        with open(temp_asset, 'w') as fout:
+            json.dump({'test_key': ['test_val']}, fout)
+
+        words_list = load_words_asset(self.temp_output_path, 'temp_asset')
+        self.assertEqual(len(words_list), 1)
+        self.assertEqual(len(words_list['test_key']), 1)
+
+    def test_load_from_serial_files(self):
+        os.makedirs(self.temp_output_path, exist_ok=True)
+        temp_asset = os.path.join(self.temp_output_path, 'temp_asset_v1.json')
+        with open(temp_asset, 'w') as fout:
+            json.dump({'test_key': ['test_val_1']}, fout)
+        temp_asset = os.path.join(self.temp_output_path, 'temp_asset_v2.json')
+        with open(temp_asset, 'w') as fout:
+            json.dump({'test_key': ['test_val_2']}, fout)
+
+        words_list = load_words_asset(self.temp_output_path, 'temp_asset')
+        self.assertEqual(len(words_list), 1)
+        self.assertEqual(len(words_list['test_key']), 2)
+
+
+if __name__ == '__main__':
+    unittest.main()
@@ -0,0 +1,12 @@
+import unittest
+
+from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
+
+class AutoInstallMappingTest(DataJuicerTestCaseBase):
+
+    def test_placeholder(self):
+        pass
+
+
+if __name__ == '__main__':
+    unittest.main()
@@ -0,0 +1,24 @@
+import unittest
+
+from data_juicer.utils.auto_install_utils import _is_module_installed, _is_package_installed
+
+from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
+
+class IsXXXInstalledFuncsTest(DataJuicerTestCaseBase):
+
+    def test_is_module_installed(self):
+        self.assertTrue(_is_module_installed('datasets'))
+        self.assertTrue(_is_module_installed('simhash'))
+
+        self.assertFalse(_is_module_installed('non_existent_module'))
+
+    def test_is_package_installed(self):
+        self.assertTrue(_is_package_installed('datasets'))
+        self.assertTrue(_is_package_installed('ram@git+https://github.com/xinyu1205/recognize-anything.git'))
+        self.assertTrue(_is_package_installed('scenedetect[opencv]'))
+
+        self.assertFalse(_is_package_installed('non_existent_package'))
+
+
+if __name__ == '__main__':
+    unittest.main()
@@ -0,0 +1,23 @@
+import unittest
+
+from data_juicer.utils.availability_utils import _is_package_available
+from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
+
+class AvailabilityUtilsTest(DataJuicerTestCaseBase):
+
+    def test_is_package_available(self):
+        exist = _is_package_available('fsspec')
+        self.assertTrue(exist)
+        exist, version = _is_package_available('fsspec', return_version=True)
+        self.assertTrue(exist)
+        self.assertEqual(version, '2023.5.0')
+
+        exist = _is_package_available('non_existing_package')
+        self.assertFalse(exist)
+        exist, version = _is_package_available('non_existing_package', return_version=True)
+        self.assertFalse(exist)
+        self.assertEqual(version, 'N/A')
+
+
+if __name__ == '__main__':
+    unittest.main()
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-__version__ = '1.2.1'`
	`1`	`+__version__ = '1.2.2'`
`2`	`2`
`3`	`3`	`import os`
`4`	`4`	`import subprocess`