-
Notifications
You must be signed in to change notification settings - Fork 513
[Feature] Remove stream synchronization during ring_mla #3672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jade Zheng <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to optimize performance by removing a stream synchronization in the ring MLA implementation. The approach of pre-transferring the chunk_seq_lens tensor to the device is correct. However, I've identified a few areas for further improvement to fully eliminate synchronization overhead. I've also found a critical logic bug where only the last prefill request is processed by the ring attention, which needs to be addressed. My detailed feedback is in the comments below.
| seq_len2 = prefill_metadata.chunked_context.chunk_seq_lens[i] | ||
| seq_len2_npu = prefill_metadata.chunked_context.chunk_seq_lens_npu[ | ||
| i] | ||
| seq_len = torch.stack([seq_len1, seq_len2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seq_len variable is created here but never used. Its creation involves torch.stack with seq_len2, which is a CPU tensor, and seq_len1, a device tensor. This causes an unnecessary host-to-device transfer and stream synchronization, which this PR aims to eliminate. Removing this line will improve performance by avoiding this synchronization.
Signed-off-by: Jade Zheng <[email protected]>
| cache_k_pe, | ||
| prefill_metadata.block_table, | ||
| seq_len2.to(q_nope.device), | ||
| seq_len2_npu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops, removing this optimization might be a bug introduced by MLA refactoring. Thanks for the fix. LGTM.
Signed-off-by: Jade Zheng <[email protected]>
What this PR does / why we need it?
This optimization existed before but appears to have been removed at some point. This pull request restores it to remove the stream synchronization overhead caused by the MLA chunk context. As a result, the DS R1 Prefill performance increased from 4.15 to 4.20.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI pass.