[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. #3669

ZYang6263 · 2025-10-23T08:19:44Z

What this PR does / why we need it?

This PR boosts performance by introducing a fused kernel for the matrix matmul and reduce scatter operations. It supports both unquantized (e.g., BFloat16) and W8A8 quantized models.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

gemini-code-assist

Code Review

This pull request introduces a fused matmul/reduce-scatter kernel to optimize performance, which is a great initiative. The implementation correctly identifies the specific CANN version and quantization methods to apply the optimization. However, I've found a critical correctness issue where the bias term is ignored in the unquantized path, which could lead to incorrect model outputs. Additionally, there are several opportunities to improve code quality by removing dead code, placing imports correctly, and refactoring duplicated logic. Addressing these points will make the new optimized code paths more robust and maintainable.

vllm_ascend/ops/linear_op.py

gemini-code-assist · 2025-10-23T08:21:49Z

vllm_ascend/ops/linear_op.py

+            output_parallel = torch.empty(x.shape[0] // self.layer.tp_size,
+                                          self.layer.weight.shape[0],
+                                          dtype=self.layer.params_dtype,
+                                          device=x.device)


The output_parallel tensor initialized here is never used, as the result from npu_mm_reduce_scatter_base is assigned directly to the output variable. This is dead code and should be removed. A similar redundant initialization exists for the W8A8 path on lines 432-436.

vllm_ascend/ops/linear_op.py

github-actions · 2025-10-23T08:24:46Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Co-authored-by: ZhaoJiangJiang <[email protected]> Co-authored-by: rjg-lyh <[email protected]> Co-authored-by: raintBN-91 <[email protected]> Signed-off-by: ZYang6263 <[email protected]> Changes to be committed: modified: vllm_ascend/ops/linear_op.py modified: vllm_ascend/ops/register_custom_ops.py Signed-off-by: ZYang6263 <[email protected]>

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

ZYang6263 force-pushed the RS_new branch from 7e48cbb to 8a304b6 Compare October 23, 2025 08:28

github-actions bot added the module:ops label Oct 23, 2025

ZYang6263 force-pushed the RS_new branch 5 times, most recently from fe6ecca to f4cdbbc Compare October 23, 2025 11:24

weijinqian0 approved these changes Oct 23, 2025

View reviewed changes

ZYang6263 force-pushed the RS_new branch from f4cdbbc to 9e68503 Compare October 23, 2025 11:59

rjg-lyh added ready read for review ready-for-test start test by label for PR labels Oct 23, 2025

ZYang6263 closed this Oct 24, 2025

ZYang6263 reopened this Oct 24, 2025

ZYang6263 closed this Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. #3669

[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. #3669

ZYang6263 commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. #3669

[Main][Perf] Add fused matmul/reduce-scatter kernel for performance optimization. #3669

Conversation

ZYang6263 commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZYang6263 commented Oct 23, 2025 •

edited by github-actions bot

Loading