[feature]Pooling Features and PCP Adaptation #4143

DreamerLeader · 2025-11-12T04:52:39Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

Signed-off-by: fjw <[email protected]>

github-actions · 2025-11-12T04:52:47Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the Mooncake engine to support different parallelism dimensions, namely Prefill Context Parallelism (PCP), Decode Context Parallelism (DCP), and Tensor Parallelism (TP). This is achieved by replacing the generic worker_id with specific ranks for each parallelism type (pcp_rank, dcp_rank, tp_rank) throughout the configuration and keying mechanisms.

While the refactoring is a good step towards more flexible distributed execution, I've identified a critical issue in how the block_size is calculated when context parallelism is enabled. The logic incorrectly compounds scaling factors from both PCP and DCP, which could lead to significant errors. This logic is also duplicated in three different places, increasing maintenance overhead and risk. My review includes detailed comments on this issue with suggestions for a fix.

gemini-code-assist · 2025-11-12T04:54:23Z

vllm_ascend/distributed/mooncake/mooncake_engine.py

+        if self.pcp_size > 1:
+            self.block_size *= self.pcp_size
+
+        if self.dcp_size > 1:
+            self.block_size *= self.dcp_size


The logic for adjusting self.block_size appears to be incorrect. When both prefill context parallelism (pcp_size > 1) and decode context parallelism (dcp_size > 1) are enabled, self.block_size is multiplied by both factors (i.e., self.block_size *= self.pcp_size * self.dcp_size).

This is likely not the intended behavior. PCP and DCP are typically applied at different stages (prefill and decode, respectively) and should not have their scaling factors compounded. This could lead to incorrect memory calculations, buffer overflows, or other critical runtime errors.

The suggested change ensures that these conditions are handled exclusively and raises an error for the ambiguous case where both are enabled.

Furthermore, this same logic is duplicated in vllm_ascend/distributed/mooncake/mooncake_store_connector_v1.py. This code should be centralized into a single utility function to avoid inconsistencies and improve maintainability.

Suggested change

if self.pcp_size > 1:

self.block_size *= self.pcp_size

if self.dcp_size > 1:

self.block_size *= self.dcp_size

if self.pcp_size > 1 and self.dcp_size > 1:

raise ValueError("Both PCP and DCP enabled is not supported.")

elif self.pcp_size > 1:

self.block_size *= self.pcp_size

elif self.dcp_size > 1:

self.block_size *= self.dcp_size

gemini-code-assist · 2025-11-12T04:54:23Z

vllm_ascend/distributed/mooncake/mooncake_store_connector_v1.py

+        if self.pcp_size > 1:
+            self._block_size *= self.pcp_size
+
+        if self.dcp_size > 1:
+            self._block_size *= self.dcp_size


This block of code for calculating _block_size has the same potential bug as noted in mooncake_engine.py. It incorrectly compounds the scaling factors for pcp_size and dcp_size, which can lead to critical errors.

This logic is duplicated across multiple files and classes. It should be refactored into a shared utility function to ensure correctness and maintainability. Please see the detailed comment on vllm_ascend/distributed/mooncake/mooncake_engine.py (lines 71-75).

Suggested change

if self.pcp_size > 1:

self._block_size *= self.pcp_size

if self.dcp_size > 1:

self._block_size *= self.dcp_size

if self.pcp_size > 1 and self.dcp_size > 1:

raise ValueError("Both PCP and DCP enabled is not supported.")

elif self.pcp_size > 1:

self._block_size *= self.pcp_size

elif self.dcp_size > 1:

self._block_size *= self.dcp_size

gemini-code-assist · 2025-11-12T04:54:23Z

vllm_ascend/distributed/mooncake/mooncake_store_connector_v1.py

+        if self.pcp_size > 1:
+            self._block_size *= self.pcp_size
+
+        if self.dcp_size > 1:
+            self._block_size *= self.dcp_size


This is the third instance of the duplicated and potentially incorrect logic for _block_size calculation. As mentioned in other comments, compounding pcp_size and dcp_size is likely a bug and can cause severe issues.

This logic must be corrected and centralized to prevent future bugs and improve code quality. Please refer to the comment on vllm_ascend/distributed/mooncake/mooncake_engine.py (lines 71-75) for a detailed explanation and suggested fix.

Suggested change

if self.pcp_size > 1:

self._block_size *= self.pcp_size

if self.dcp_size > 1:

self._block_size *= self.dcp_size

if self.pcp_size > 1 and self.dcp_size > 1:

raise ValueError("Both PCP and DCP enabled is not supported.")

elif self.pcp_size > 1:

self._block_size *= self.pcp_size

elif self.dcp_size > 1:

self._block_size *= self.dcp_size

Signed-off-by: fjw <[email protected]>

vllm_ascend/distributed/mooncake/config_data.py

Signed-off-by: fjw <[email protected]>

weijinqian0 · 2025-11-12T08:16:34Z

Does an increase in block size potentially lead to worse performance?

SlightwindSec · 2025-11-12T08:26:01Z

vllm_ascend/distributed/mooncake/config_data.py

+    """ Initialize the current prefill context model parallel rank """
+    pcp_rank: int
+    """ Initialize the current decode context model parallel rank """
+    dcp_rank: int


dcp_rank might be redundant with tp_rank as their logic is similar. We can probably use tp_rank directly and remove dcp_rank.

SlightwindSec · 2025-11-12T08:27:19Z

vllm_ascend/distributed/mooncake/mooncake_engine.py

 # Third Party
 import torch
 from vllm.config import VllmConfig
+from vllm.distributed import (get_decode_context_model_parallel_rank,


These get_xxx methods are from a private repository, not from the main branch. They need to be intercepted.

Pooling Features and PCP Adaptation

f1062be

Signed-off-by: fjw <[email protected]>

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

Pooling Features and PCP Adaptation

bbee9ba

Signed-off-by: fjw <[email protected]>

DreamerLeader changed the title ~~Pooling Features and PCP Adaptation~~ [feature]Pooling Features and PCP Adaptation Nov 12, 2025

Pooling Features and PCP Adaptation

3d276f3

Signed-off-by: fjw <[email protected]>

SlightwindSec reviewed Nov 12, 2025

View reviewed changes

vllm_ascend/distributed/mooncake/config_data.py Outdated Show resolved Hide resolved

DreamerLeader added 2 commits November 12, 2025 15:01

Pooling Features and PCP Adaptation

fdd3863

Signed-off-by: fjw <[email protected]>

Pooling Features and PCP Adaptation

51d7845

Signed-off-by: fjw <[email protected]>

SlightwindSec reviewed Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature]Pooling Features and PCP Adaptation #4143

[feature]Pooling Features and PCP Adaptation #4143

DreamerLeader commented Nov 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

gemini-code-assist bot Nov 12, 2025

Uh oh!

Uh oh!

weijinqian0 commented Nov 12, 2025

Uh oh!

SlightwindSec Nov 12, 2025

Uh oh!

SlightwindSec Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[feature]Pooling Features and PCP Adaptation #4143

Are you sure you want to change the base?

[feature]Pooling Features and PCP Adaptation #4143

Conversation

DreamerLeader commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weijinqian0 commented Nov 12, 2025

Uh oh!

SlightwindSec Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

SlightwindSec Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DreamerLeader commented Nov 12, 2025 •

edited by github-actions bot

Loading