Skip to content

Conversation

@MrZ20
Copy link

@MrZ20 MrZ20 commented Oct 23, 2025

What this PR does / why we need it?

Add accuracy test for model ERNIE-4.5-21B-A3B-PT

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: MrZ20 <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new accuracy test for the ERNIE-4.5-21B-A3B-PT model. The changes include a new YAML configuration file for the test and an update to the list of test configurations. While the changes are straightforward, the new configuration file uses composite metric names (e.g., "exact_match,flexible-extract") that are likely to expose a bug in the test runner script (test_lm_eval_correctness.py), causing the new test to fail with a KeyError when retrieving standard error values. I've added critical comments on the specific metric definitions in the YAML file explaining the issue.

Comment on lines +5 to +6
- name: "exact_match,flexible-extract"
value: 0.72
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The composite metric name "exact_match,flexible-extract" will likely cause the test to fail with a KeyError. The test script (tests/e2e/models/test_lm_eval_correctness.py, lines 152-155) attempts to find the standard error by creating a key "exact_match_stderr,flexible-extract", which is probably not present in the evaluation results. The script doesn't seem to correctly parse compound metric names for stderr retrieval. This might require a fix in the test script itself, or splitting this into separate metric definitions if the framework supports it.

Comment on lines +9 to +10
- name: "acc,none"
value: 0.85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the gsm8k task, the metric name "acc,none" will likely cause a KeyError during the test run. The test script (tests/e2e/models/test_lm_eval_correctness.py, lines 152-155) will try to access a key "acc_stderr,none" for the standard error, which is unlikely to exist in the results from lm_eval. This points to a potential bug in how the test script handles stderr for metrics with commas in their names.

@MrZ20
Copy link
Author

MrZ20 commented Oct 23, 2025

@MengqingCao Review required

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant