Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Oct 23, 2025

What this PR does / why we need it?

This PR refactors the Ascend attention implementation to align with vLLM's core interfaces, simplifying the code and improving maintainability.

Key Changes:

  • Align with vLLM's Attention Interface: The forward method signature in AscendAttentionBackendImpl now matches the base AttentionImpl in vLLM, removing the custom trace_flag.

  • Enable Opaque Attention Operator: By adding opaque_attention_op to AscendPlatform, we allow vLLM to wrap our attention kernel in its standard vllm.unified_attention_with_output operator. This avoids the need for a custom call path.

  • Remove Obsolete Code:

    • The custom op vllm.unified_ascend_attention_with_output has been deleted as it is now redundant.
    • The trace_flag and its associated logic were removed, reducing code complexity.
    • An outdated quantization branch within the attention implementation was cleaned up.
  • Improve Readability: Renamed output variables (output vs. intermediate_output) and added comments to clarify the in-place nature of the attention output.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

No extra tests needed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Ascend attention implementation by removing the unified_ascend_attention_with_output custom operator and its associated graph-splitting logic, aligning the attention mechanism more closely with the upstream vLLM framework. This is a positive change for code simplification and maintainability. However, the review identifies a critical issue where support for INT8 KV cache quantization appears to have been removed without a replacement, which could be a significant regression for quantized models.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Removes the `unified_ascend_attention_with_output` custom operator and its associated graph-splitting logic. The calling of attention implementation is now aligned to vLLM.

Also removes unecessary codes in attention implementation.

Signed-off-by: Yizhou Liu <[email protected]>
Removes the `trace_flag` parameter from all calls to the attention backend's `forward` method within the unit tests.

The parameter is no longer used, and this change simplifies the test code by removing the redundant argument and the now-obsolete test case that specifically validated its behavior.

Signed-off-by: Yizhou Liu <[email protected]>
@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Oct 24, 2025

Due to CI infra update, we need to close this PR and create a new one.

@yiz-liu yiz-liu closed this Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant