-
Notifications
You must be signed in to change notification settings - Fork 515
[Test]Add accuracy test for model ERNIE-4.5-21B-A3B-PT #3658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: MrZ20 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new accuracy test for the ERNIE-4.5-21B-A3B-PT model. The changes include a new YAML configuration file for the test and an update to the list of test configurations. While the changes are straightforward, the new configuration file uses composite metric names (e.g., "exact_match,flexible-extract") that are likely to expose a bug in the test runner script (test_lm_eval_correctness.py), causing the new test to fail with a KeyError when retrieving standard error values. I've added critical comments on the specific metric definitions in the YAML file explaining the issue.
| - name: "exact_match,flexible-extract" | ||
| value: 0.72 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The composite metric name "exact_match,flexible-extract" will likely cause the test to fail with a KeyError. The test script (tests/e2e/models/test_lm_eval_correctness.py, lines 152-155) attempts to find the standard error by creating a key "exact_match_stderr,flexible-extract", which is probably not present in the evaluation results. The script doesn't seem to correctly parse compound metric names for stderr retrieval. This might require a fix in the test script itself, or splitting this into separate metric definitions if the framework supports it.
| - name: "acc,none" | ||
| value: 0.85 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the gsm8k task, the metric name "acc,none" will likely cause a KeyError during the test run. The test script (tests/e2e/models/test_lm_eval_correctness.py, lines 152-155) will try to access a key "acc_stderr,none" for the standard error, which is unlikely to exist in the results from lm_eval. This points to a potential bug in how the test script handles stderr for metrics with commas in their names.
|
@MengqingCao Review required |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Add accuracy test for model ERNIE-4.5-21B-A3B-PT
Does this PR introduce any user-facing change?
How was this patch tested?