[Feature] Remove stream synchronization during ring_mla #3672

jianzs · 2025-10-23T09:15:43Z

What this PR does / why we need it?

This optimization existed before but appears to have been removed at some point. This pull request restores it to remove the stream synchronization overhead caused by the MLA chunk context. As a result, the DS R1 Prefill performance increased from 4.15 to 4.20.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI pass.

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Jade Zheng <[email protected]>

github-actions · 2025-10-23T09:16:04Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to optimize performance by removing a stream synchronization in the ring MLA implementation. The approach of pre-transferring the chunk_seq_lens tensor to the device is correct. However, I've identified a few areas for further improvement to fully eliminate synchronization overhead. I've also found a critical logic bug where only the last prefill request is processed by the ring attention, which needs to be addressed. My detailed feedback is in the comments below.

gemini-code-assist · 2025-10-23T09:18:36Z

vllm_ascend/attention/mla_v1.py

            seq_len2 = prefill_metadata.chunked_context.chunk_seq_lens[i]
+            seq_len2_npu = prefill_metadata.chunked_context.chunk_seq_lens_npu[
+                i]
            seq_len = torch.stack([seq_len1, seq_len2])


The seq_len variable is created here but never used. Its creation involves torch.stack with seq_len2, which is a CPU tensor, and seq_len1, a device tensor. This causes an unnecessary host-to-device transfer and stream synchronization, which this PR aims to eliminate. Removing this line will improve performance by avoiding this synchronization.

Signed-off-by: Jade Zheng <[email protected]>

vllm_ascend/attention/mla_v1.py

whx-sjtu · 2025-10-24T02:51:37Z

vllm_ascend/attention/mla_v1.py

                cache_k_pe,
                prefill_metadata.block_table,
-                seq_len2.to(q_nope.device),
+                seq_len2_npu,


Ops, removing this optimization might be a bug introduced by MLA refactoring. Thanks for the fix. LGTM.

Signed-off-by: Jade Zheng <[email protected]>

[Feature] Remove stream synchronization during ring_mla

4223c83

Signed-off-by: Jade Zheng <[email protected]>

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

update test case

4139ebf

Signed-off-by: Jade Zheng <[email protected]>

github-actions bot added the module:tests label Oct 23, 2025

whx-sjtu reviewed Oct 24, 2025

View reviewed changes

vllm_ascend/attention/mla_v1.py Outdated Show resolved Hide resolved

whx-sjtu reviewed Oct 24, 2025

View reviewed changes

fix: remove non_blocking flag from chunk_seq_lens_npu transfer

1d6d64e

Signed-off-by: Jade Zheng <[email protected]>

jianzs added ready read for review ready-for-test start test by label for PR labels Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Remove stream synchronization during ring_mla #3672

[Feature] Remove stream synchronization during ring_mla #3672

jianzs commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

Uh oh!

whx-sjtu Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Feature] Remove stream synchronization during ring_mla #3672

Are you sure you want to change the base?

[Feature] Remove stream synchronization during ring_mla #3672

Conversation

jianzs commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

whx-sjtu Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jianzs commented Oct 23, 2025 •

edited by github-actions bot

Loading