Skip to content

Conversation

@jianzs
Copy link
Collaborator

@jianzs jianzs commented Oct 23, 2025

What this PR does / why we need it?

This optimization existed before but appears to have been removed at some point. This pull request restores it to remove the stream synchronization overhead caused by the MLA chunk context. As a result, the DS R1 Prefill performance increased from 4.15 to 4.20.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI pass.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize performance by removing a stream synchronization in the ring MLA implementation. The approach of pre-transferring the chunk_seq_lens tensor to the device is correct. However, I've identified a few areas for further improvement to fully eliminate synchronization overhead. I've also found a critical logic bug where only the last prefill request is processed by the ring attention, which needs to be addressed. My detailed feedback is in the comments below.

seq_len2 = prefill_metadata.chunked_context.chunk_seq_lens[i]
seq_len2_npu = prefill_metadata.chunked_context.chunk_seq_lens_npu[
i]
seq_len = torch.stack([seq_len1, seq_len2])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The seq_len variable is created here but never used. Its creation involves torch.stack with seq_len2, which is a CPU tensor, and seq_len1, a device tensor. This causes an unnecessary host-to-device transfer and stream synchronization, which this PR aims to eliminate. Removing this line will improve performance by avoiding this synchronization.

Signed-off-by: Jade Zheng <[email protected]>
cache_k_pe,
prefill_metadata.block_table,
seq_len2.to(q_nope.device),
seq_len2_npu,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops, removing this optimization might be a bug introduced by MLA refactoring. Thanks for the fix. LGTM.

@jianzs jianzs added ready read for review ready-for-test start test by label for PR labels Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants