Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/accuracy_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ jobs:
model_name: Qwen3-VL-30B-A3B-Instruct
- runner: a2-2
model_name: DeepSeek-V2-Lite
- runner: a2-2
model_name: ERNIE-4.5-21B-A3B-PT
fail-fast: false
# test will be triggered when tag 'accuracy-test' & 'ready-for-test'
if: >-
Expand Down
14 changes: 14 additions & 0 deletions tests/e2e/models/configs/ERNIE-4.5-21B-A3B-PT.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
model_name: "PaddlePaddle/ERNIE-4.5-21B-A3B-PT"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,flexible-extract"
value: 0.72
Comment on lines +5 to +6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The composite metric name "exact_match,flexible-extract" will likely cause the test to fail with a KeyError. The test script (tests/e2e/models/test_lm_eval_correctness.py, lines 152-155) attempts to find the standard error by creating a key "exact_match_stderr,flexible-extract", which is probably not present in the evaluation results. The script doesn't seem to correctly parse compound metric names for stderr retrieval. This might require a fix in the test script itself, or splitting this into separate metric definitions if the framework supports it.

- name: "ceval-valid"
metrics:
- name: "acc,none"
value: 0.85
Comment on lines +9 to +10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the gsm8k task, the metric name "acc,none" will likely cause a KeyError during the test run. The test script (tests/e2e/models/test_lm_eval_correctness.py, lines 152-155) will try to access a key "acc_stderr,none" for the standard error, which is unlikely to exist in the results from lm_eval. This points to a potential bug in how the test script handles stderr for metrics with commas in their names.

num_fewshot: 5
tensor_parallel_size: 2
batch_size: 16
gpu_memory_utilization: 0.6
1 change: 1 addition & 0 deletions tests/e2e/models/configs/accuracy.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ Qwen2-7B.yaml
Qwen2-VL-7B-Instruct.yaml
Qwen2-Audio-7B-Instruct.yaml
Qwen3-VL-30B-A3B-Instruct.yaml
ERNIE-4.5-21B-A3B-PT.yaml
Loading