Fix typo in op description (#620)

BeachWang · web-flow · commit 8d094109fc50 · 2025-03-14T17:19:01.000+08:00
* done

* llm_api_difficulty_score_filter

* refine order

* score_threshold -&gt; min/max_score

* for vllm

* enable vllm

* rename op

* fix op num

* rm redundant op

* op doc

* fix typo
diff --git a/configs/config_all.yaml b/configs/config_all.yaml
@@ -652,7 +652,7 @@ process:
   - language_id_score_filter:                               # filter text in specific language with language scores larger than a specific max value
       lang: en                                                # keep text in what language
       min_score: 0.8                                          # the min language scores to filter text
-  - llm_difficulty_score_filter:                            # filter to keep sample with high difficulty score estimated by LLM in API.
+  - llm_difficulty_score_filter:                            # filter to keep sample with high difficulty score estimated by LLM.
       api_or_hf_model: 'gpt-4o'                               # API or huggingface model name.
       min_score: 0.5                                          # The lowest difficulty score threshold to keep the sample.
       api_endpoint: null                                      # URL endpoint for the API.
@@ -666,7 +666,7 @@ process:
       enable_vllm: false                                      # If true, use VLLM for loading hugging face or local llm. Otherwise, use API for reference.
       model_params: {}                                        # Parameters for initializing the API model.
       sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
-  - llm_quality_score_filter:                               # filter to keep sample with high quality score estimated by LLM in API.
+  - llm_quality_score_filter:                               # filter to keep sample with high quality score estimated by LLM.
       api_or_hf_model: 'gpt-4o'                               # API or huggingface model name.
       min_score: 0.5                                          # The lowest quality score threshold to keep the sample.
       api_endpoint: null                                      # URL endpoint for the API.
diff --git a/data_juicer/ops/filter/llm_difficulty_score_filter.py b/data_juicer/ops/filter/llm_difficulty_score_filter.py
@@ -20,7 +20,7 @@
 @OPERATORS.register_module(OP_NAME)
 class LLMDifficultyScoreFilter(Filter):
     """
-    Filter to keep sample with high difficulty score estimated by LLM in API.
+    Filter to keep sample with high difficulty score estimated by LLM.
     """
 
     # avoid leading whitespace
diff --git a/data_juicer/ops/filter/llm_quality_score_filter.py b/data_juicer/ops/filter/llm_quality_score_filter.py
@@ -20,7 +20,7 @@
 @OPERATORS.register_module(OP_NAME)
 class LLMQualityScoreFilter(Filter):
     """
-    Filter to keep sample with high quality score estimated by LLM in API.
+    Filter to keep sample with high quality score estimated by LLM.
     """
 
     # avoid leading whitespace
diff --git a/docs/Operators.md b/docs/Operators.md
@@ -107,8 +107,8 @@ All the specific operators are listed below, each featured with several capabili
 | image_text_similarity_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples those similarities between image and text within a specific range. 过滤器将图像和文本之间的相似性保持在特定范围内。 | [code](../data_juicer/ops/filter/image_text_similarity_filter.py) | [tests](../tests/ops/filter/test_image_text_similarity_filter.py) |
 | image_watermark_filter | 🏞Image 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose images have no watermark with high probability. 过滤器以保持其图像没有水印的样本具有高概率。 | [code](../data_juicer/ops/filter/image_watermark_filter.py) | [tests](../tests/ops/filter/test_image_watermark_filter.py) |
 | language_id_score_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples in a specific language with confidence score larger than a specific min value. 过滤器以保留置信度得分大于特定最小值的特定语言的样本。 | [code](../data_juicer/ops/filter/language_id_score_filter.py) | [tests](../tests/ops/filter/test_language_id_score_filter.py) |
-| llm_difficulty_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high difficulty score estimated by LLM in API. 在API中过滤以保持LLM估计的高难度分数的样本。 | [code](../data_juicer/ops/filter/llm_difficulty_score_filter.py) | [tests](../tests/ops/filter/test_llm_difficulty_score_filter.py) |
-| llm_quality_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high quality score estimated by LLM in API. 在API中过滤以保持LLM估计的高质量分数的样品。 | [code](../data_juicer/ops/filter/llm_quality_score_filter.py) | [tests](../tests/ops/filter/test_llm_quality_score_filter.py) |
+| llm_difficulty_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high difficulty score estimated by LLM. 过滤器以保持LLM估计的高难度分数的样本。 | [code](../data_juicer/ops/filter/llm_difficulty_score_filter.py) | [tests](../tests/ops/filter/test_llm_difficulty_score_filter.py) |
+| llm_quality_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high quality score estimated by LLM. 过滤器以保持LLM估计的高质量分数的样本。 | [code](../data_juicer/ops/filter/llm_quality_score_filter.py) | [tests](../tests/ops/filter/test_llm_quality_score_filter.py) |
 | maximum_line_length_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with maximum line length within a specific range. 过滤器将最大行长度的样本保持在特定范围内。 | [code](../data_juicer/ops/filter/maximum_line_length_filter.py) | [tests](../tests/ops/filter/test_maximum_line_length_filter.py) |
 | perplexity_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with perplexity score less than a specific max value. 过滤以保留困惑度分数小于特定最大值的样本。 | [code](../data_juicer/ops/filter/perplexity_filter.py) | [tests](../tests/ops/filter/test_perplexity_filter.py) |
 | phrase_grounding_recall_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose locating recalls of phrases extracted from text in the images are within a specified range. 过滤器，用于保留从图像中的文本中提取的短语的定位回忆在指定范围内的样本。 | [code](../data_juicer/ops/filter/phrase_grounding_recall_filter.py) | [tests](../tests/ops/filter/test_phrase_grounding_recall_filter.py) |