-
Notifications
You must be signed in to change notification settings - Fork 563
[long_seq_Feat] support chunk prefill #4158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the chunked prefill implementation for MLA attention to better support context parallelism (PCP/DCP). The changes simplify the logic in _compute_prefill_context by introducing a new helper function _reorg_kvcache and unifying the handling of different parallelism configurations. While the refactoring in vllm_ascend/attention/mla_v1.py is a significant improvement in code clarity and maintainability, I've identified a critical bug in vllm_ascend/worker/model_runner_v1.py related to the calculation of context length for speculative decoding when context parallelism is enabled. This could lead to incorrect attention computations.
a16bf4a to
212c346
Compare
bab356f to
eab493d
Compare
Signed-off-by: LookAround <[email protected]>
1e3ff6e to
63bc686
Compare
Signed-off-by: Delphine-Nic <[email protected]>
|
I'll enable full test once lint pass |
Signed-off-by: LookAround <[email protected]>
841b6a3 to
998d62a
Compare
Signed-off-by: Delphine-Nic <[email protected]>
Signed-off-by: LookAround <[email protected]>
Signed-off-by: LookAround <[email protected]>
What this PR does / why we need it?
1、qwen GQA attention_v1 optim
2、DeepSeek MLA refactor, all gather q -> all gather kv
3、modelrunner refactor for chunk prefill, we remove some code not use
Does this PR introduce any user-facing change?
How was this patch tested?