Skip to content

Commit 8d09410

Browse files
authored
Fix typo in op description (#620)
* done * llm_api_difficulty_score_filter * refine order * score_threshold -> min/max_score * for vllm * enable vllm * rename op * fix op num * rm redundant op * op doc * fix typo
1 parent 79567da commit 8d09410

File tree

4 files changed

+6
-6
lines changed

4 files changed

+6
-6
lines changed

configs/config_all.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -652,7 +652,7 @@ process:
652652
- language_id_score_filter: # filter text in specific language with language scores larger than a specific max value
653653
lang: en # keep text in what language
654654
min_score: 0.8 # the min language scores to filter text
655-
- llm_difficulty_score_filter: # filter to keep sample with high difficulty score estimated by LLM in API.
655+
- llm_difficulty_score_filter: # filter to keep sample with high difficulty score estimated by LLM.
656656
api_or_hf_model: 'gpt-4o' # API or huggingface model name.
657657
min_score: 0.5 # The lowest difficulty score threshold to keep the sample.
658658
api_endpoint: null # URL endpoint for the API.
@@ -666,7 +666,7 @@ process:
666666
enable_vllm: false # If true, use VLLM for loading hugging face or local llm. Otherwise, use API for reference.
667667
model_params: {} # Parameters for initializing the API model.
668668
sampling_params: {} # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
669-
- llm_quality_score_filter: # filter to keep sample with high quality score estimated by LLM in API.
669+
- llm_quality_score_filter: # filter to keep sample with high quality score estimated by LLM.
670670
api_or_hf_model: 'gpt-4o' # API or huggingface model name.
671671
min_score: 0.5 # The lowest quality score threshold to keep the sample.
672672
api_endpoint: null # URL endpoint for the API.

data_juicer/ops/filter/llm_difficulty_score_filter.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
@OPERATORS.register_module(OP_NAME)
2121
class LLMDifficultyScoreFilter(Filter):
2222
"""
23-
Filter to keep sample with high difficulty score estimated by LLM in API.
23+
Filter to keep sample with high difficulty score estimated by LLM.
2424
"""
2525

2626
# avoid leading whitespace

data_juicer/ops/filter/llm_quality_score_filter.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
@OPERATORS.register_module(OP_NAME)
2121
class LLMQualityScoreFilter(Filter):
2222
"""
23-
Filter to keep sample with high quality score estimated by LLM in API.
23+
Filter to keep sample with high quality score estimated by LLM.
2424
"""
2525

2626
# avoid leading whitespace

docs/Operators.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -107,8 +107,8 @@ All the specific operators are listed below, each featured with several capabili
107107
| image_text_similarity_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples those similarities between image and text within a specific range. 过滤器将图像和文本之间的相似性保持在特定范围内。 | [code](../data_juicer/ops/filter/image_text_similarity_filter.py) | [tests](../tests/ops/filter/test_image_text_similarity_filter.py) |
108108
| image_watermark_filter | 🏞Image 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose images have no watermark with high probability. 过滤器以保持其图像没有水印的样本具有高概率。 | [code](../data_juicer/ops/filter/image_watermark_filter.py) | [tests](../tests/ops/filter/test_image_watermark_filter.py) |
109109
| language_id_score_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples in a specific language with confidence score larger than a specific min value. 过滤器以保留置信度得分大于特定最小值的特定语言的样本。 | [code](../data_juicer/ops/filter/language_id_score_filter.py) | [tests](../tests/ops/filter/test_language_id_score_filter.py) |
110-
| llm_difficulty_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high difficulty score estimated by LLM in API. 在API中过滤以保持LLM估计的高难度分数的样本| [code](../data_juicer/ops/filter/llm_difficulty_score_filter.py) | [tests](../tests/ops/filter/test_llm_difficulty_score_filter.py) |
111-
| llm_quality_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high quality score estimated by LLM in API. 在API中过滤以保持LLM估计的高质量分数的样品| [code](../data_juicer/ops/filter/llm_quality_score_filter.py) | [tests](../tests/ops/filter/test_llm_quality_score_filter.py) |
110+
| llm_difficulty_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high difficulty score estimated by LLM. 过滤器以保持LLM估计的高难度分数的样本| [code](../data_juicer/ops/filter/llm_difficulty_score_filter.py) | [tests](../tests/ops/filter/test_llm_difficulty_score_filter.py) |
111+
| llm_quality_score_filter | 💻CPU 🌊vLLM 🔗API 🟡Beta | Filter to keep sample with high quality score estimated by LLM. 过滤器以保持LLM估计的高质量分数的样本| [code](../data_juicer/ops/filter/llm_quality_score_filter.py) | [tests](../tests/ops/filter/test_llm_quality_score_filter.py) |
112112
| maximum_line_length_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with maximum line length within a specific range. 过滤器将最大行长度的样本保持在特定范围内。 | [code](../data_juicer/ops/filter/maximum_line_length_filter.py) | [tests](../tests/ops/filter/test_maximum_line_length_filter.py) |
113113
| perplexity_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with perplexity score less than a specific max value. 过滤以保留困惑度分数小于特定最大值的样本。 | [code](../data_juicer/ops/filter/perplexity_filter.py) | [tests](../tests/ops/filter/test_perplexity_filter.py) |
114114
| phrase_grounding_recall_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose locating recalls of phrases extracted from text in the images are within a specified range. 过滤器,用于保留从图像中的文本中提取的短语的定位回忆在指定范围内的样本。 | [code](../data_juicer/ops/filter/phrase_grounding_recall_filter.py) | [tests](../tests/ops/filter/test_phrase_grounding_recall_filter.py) |

0 commit comments

Comments
 (0)