Skip to content

Conversation

@DreamerLeader
Copy link
Contributor

@DreamerLeader DreamerLeader commented Nov 12, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Mooncake engine to support different parallelism dimensions, namely Prefill Context Parallelism (PCP), Decode Context Parallelism (DCP), and Tensor Parallelism (TP). This is achieved by replacing the generic worker_id with specific ranks for each parallelism type (pcp_rank, dcp_rank, tp_rank) throughout the configuration and keying mechanisms.

While the refactoring is a good step towards more flexible distributed execution, I've identified a critical issue in how the block_size is calculated when context parallelism is enabled. The logic incorrectly compounds scaling factors from both PCP and DCP, which could lead to significant errors. This logic is also duplicated in three different places, increasing maintenance overhead and risk. My review includes detailed comments on this issue with suggestions for a fix.

Comment on lines 71 to 75
if self.pcp_size > 1:
self.block_size *= self.pcp_size

if self.dcp_size > 1:
self.block_size *= self.dcp_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for adjusting self.block_size appears to be incorrect. When both prefill context parallelism (pcp_size > 1) and decode context parallelism (dcp_size > 1) are enabled, self.block_size is multiplied by both factors (i.e., self.block_size *= self.pcp_size * self.dcp_size).

This is likely not the intended behavior. PCP and DCP are typically applied at different stages (prefill and decode, respectively) and should not have their scaling factors compounded. This could lead to incorrect memory calculations, buffer overflows, or other critical runtime errors.

The suggested change ensures that these conditions are handled exclusively and raises an error for the ambiguous case where both are enabled.

Furthermore, this same logic is duplicated in vllm_ascend/distributed/mooncake/mooncake_store_connector_v1.py. This code should be centralized into a single utility function to avoid inconsistencies and improve maintainability.

Suggested change
if self.pcp_size > 1:
self.block_size *= self.pcp_size
if self.dcp_size > 1:
self.block_size *= self.dcp_size
if self.pcp_size > 1 and self.dcp_size > 1:
raise ValueError("Both PCP and DCP enabled is not supported.")
elif self.pcp_size > 1:
self.block_size *= self.pcp_size
elif self.dcp_size > 1:
self.block_size *= self.dcp_size

Comment on lines 39 to 43
if self.pcp_size > 1:
self._block_size *= self.pcp_size

if self.dcp_size > 1:
self._block_size *= self.dcp_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code for calculating _block_size has the same potential bug as noted in mooncake_engine.py. It incorrectly compounds the scaling factors for pcp_size and dcp_size, which can lead to critical errors.

This logic is duplicated across multiple files and classes. It should be refactored into a shared utility function to ensure correctness and maintainability. Please see the detailed comment on vllm_ascend/distributed/mooncake/mooncake_engine.py (lines 71-75).

Suggested change
if self.pcp_size > 1:
self._block_size *= self.pcp_size
if self.dcp_size > 1:
self._block_size *= self.dcp_size
if self.pcp_size > 1 and self.dcp_size > 1:
raise ValueError("Both PCP and DCP enabled is not supported.")
elif self.pcp_size > 1:
self._block_size *= self.pcp_size
elif self.dcp_size > 1:
self._block_size *= self.dcp_size

Comment on lines 186 to 190
if self.pcp_size > 1:
self._block_size *= self.pcp_size

if self.dcp_size > 1:
self._block_size *= self.dcp_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This is the third instance of the duplicated and potentially incorrect logic for _block_size calculation. As mentioned in other comments, compounding pcp_size and dcp_size is likely a bug and can cause severe issues.

This logic must be corrected and centralized to prevent future bugs and improve code quality. Please refer to the comment on vllm_ascend/distributed/mooncake/mooncake_engine.py (lines 71-75) for a detailed explanation and suggested fix.

Suggested change
if self.pcp_size > 1:
self._block_size *= self.pcp_size
if self.dcp_size > 1:
self._block_size *= self.dcp_size
if self.pcp_size > 1 and self.dcp_size > 1:
raise ValueError("Both PCP and DCP enabled is not supported.")
elif self.pcp_size > 1:
self._block_size *= self.pcp_size
elif self.dcp_size > 1:
self._block_size *= self.dcp_size

@DreamerLeader DreamerLeader changed the title Pooling Features and PCP Adaptation [feature]Pooling Features and PCP Adaptation Nov 12, 2025
@weijinqian0
Copy link
Collaborator

Does an increase in block size potentially lead to worse performance?

""" Initialize the current prefill context model parallel rank """
pcp_rank: int
""" Initialize the current decode context model parallel rank """
dcp_rank: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcp_rank might be redundant with tp_rank as their logic is similar. We can probably use tp_rank directly and remove dcp_rank.

# Third Party
import torch
from vllm.config import VllmConfig
from vllm.distributed import (get_decode_context_model_parallel_rank,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These get_xxx methods are from a private repository, not from the main branch. They need to be intercepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants