-
Notifications
You must be signed in to change notification settings - Fork 563
[feature]Pooling Features and PCP Adaptation #4143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: fjw <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Mooncake engine to support different parallelism dimensions, namely Prefill Context Parallelism (PCP), Decode Context Parallelism (DCP), and Tensor Parallelism (TP). This is achieved by replacing the generic worker_id with specific ranks for each parallelism type (pcp_rank, dcp_rank, tp_rank) throughout the configuration and keying mechanisms.
While the refactoring is a good step towards more flexible distributed execution, I've identified a critical issue in how the block_size is calculated when context parallelism is enabled. The logic incorrectly compounds scaling factors from both PCP and DCP, which could lead to significant errors. This logic is also duplicated in three different places, increasing maintenance overhead and risk. My review includes detailed comments on this issue with suggestions for a fix.
| if self.pcp_size > 1: | ||
| self.block_size *= self.pcp_size | ||
|
|
||
| if self.dcp_size > 1: | ||
| self.block_size *= self.dcp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for adjusting self.block_size appears to be incorrect. When both prefill context parallelism (pcp_size > 1) and decode context parallelism (dcp_size > 1) are enabled, self.block_size is multiplied by both factors (i.e., self.block_size *= self.pcp_size * self.dcp_size).
This is likely not the intended behavior. PCP and DCP are typically applied at different stages (prefill and decode, respectively) and should not have their scaling factors compounded. This could lead to incorrect memory calculations, buffer overflows, or other critical runtime errors.
The suggested change ensures that these conditions are handled exclusively and raises an error for the ambiguous case where both are enabled.
Furthermore, this same logic is duplicated in vllm_ascend/distributed/mooncake/mooncake_store_connector_v1.py. This code should be centralized into a single utility function to avoid inconsistencies and improve maintainability.
| if self.pcp_size > 1: | |
| self.block_size *= self.pcp_size | |
| if self.dcp_size > 1: | |
| self.block_size *= self.dcp_size | |
| if self.pcp_size > 1 and self.dcp_size > 1: | |
| raise ValueError("Both PCP and DCP enabled is not supported.") | |
| elif self.pcp_size > 1: | |
| self.block_size *= self.pcp_size | |
| elif self.dcp_size > 1: | |
| self.block_size *= self.dcp_size | |
| if self.pcp_size > 1: | ||
| self._block_size *= self.pcp_size | ||
|
|
||
| if self.dcp_size > 1: | ||
| self._block_size *= self.dcp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for calculating _block_size has the same potential bug as noted in mooncake_engine.py. It incorrectly compounds the scaling factors for pcp_size and dcp_size, which can lead to critical errors.
This logic is duplicated across multiple files and classes. It should be refactored into a shared utility function to ensure correctness and maintainability. Please see the detailed comment on vllm_ascend/distributed/mooncake/mooncake_engine.py (lines 71-75).
| if self.pcp_size > 1: | |
| self._block_size *= self.pcp_size | |
| if self.dcp_size > 1: | |
| self._block_size *= self.dcp_size | |
| if self.pcp_size > 1 and self.dcp_size > 1: | |
| raise ValueError("Both PCP and DCP enabled is not supported.") | |
| elif self.pcp_size > 1: | |
| self._block_size *= self.pcp_size | |
| elif self.dcp_size > 1: | |
| self._block_size *= self.dcp_size | |
| if self.pcp_size > 1: | ||
| self._block_size *= self.pcp_size | ||
|
|
||
| if self.dcp_size > 1: | ||
| self._block_size *= self.dcp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the third instance of the duplicated and potentially incorrect logic for _block_size calculation. As mentioned in other comments, compounding pcp_size and dcp_size is likely a bug and can cause severe issues.
This logic must be corrected and centralized to prevent future bugs and improve code quality. Please refer to the comment on vllm_ascend/distributed/mooncake/mooncake_engine.py (lines 71-75) for a detailed explanation and suggested fix.
| if self.pcp_size > 1: | |
| self._block_size *= self.pcp_size | |
| if self.dcp_size > 1: | |
| self._block_size *= self.dcp_size | |
| if self.pcp_size > 1 and self.dcp_size > 1: | |
| raise ValueError("Both PCP and DCP enabled is not supported.") | |
| elif self.pcp_size > 1: | |
| self._block_size *= self.pcp_size | |
| elif self.dcp_size > 1: | |
| self._block_size *= self.dcp_size | |
Signed-off-by: fjw <[email protected]>
Signed-off-by: fjw <[email protected]>
Signed-off-by: fjw <[email protected]>
Signed-off-by: fjw <[email protected]>
|
Does an increase in block size potentially lead to worse performance? |
| """ Initialize the current prefill context model parallel rank """ | ||
| pcp_rank: int | ||
| """ Initialize the current decode context model parallel rank """ | ||
| dcp_rank: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcp_rank might be redundant with tp_rank as their logic is similar. We can probably use tp_rank directly and remove dcp_rank.
| # Third Party | ||
| import torch | ||
| from vllm.config import VllmConfig | ||
| from vllm.distributed import (get_decode_context_model_parallel_rank, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These get_xxx methods are from a private repository, not from the main branch. They need to be intercepted.
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?