-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix for Spec model TP + Chunked Prefill #10232
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: andoorve <[email protected]>
f1ff8aa
to
6863d1f
Compare
Hi, Based on our DM discussions my understanding is that the main issue seems to be that even when all the sequences are prompts (only prefill) we have num_lookahead_slots as > 0. I added some logs in this pr (https://github.com/vllm-project/vllm/pull/10186/files) and the output if I run with and without chunked-prefill enabled is the following Without chunked prefill
With chunked prefill
In without chunked-prefill run if it is a complete prefill batch num_lookahead_slots is set to 0 but it is not the case for the chunked-prefill run. I wonder if we should fix __schedule_chunked_prefill to set num_lookahead_slots to 0 if it is a complete prefill batch and add an assertion in spec_decode_worker for that? |
I like that, I think this would be more in line with the expected semantics (no speculation on prefills-only). Thanks for looking into it!! |
8ce5735
to
d5f6392
Compare
This reverts commit 6863d1f. Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
d5f6392
to
10f69a4
Compare
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
As discussed over DM, moving this up to the scheduler level is a cleaner fix, moved the check there. @NickLucche @sroy745 PTAL if this logic looks good, then I'll mark this ready! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Added a couple of comments about tests. Logic LGTM
Thanks
# If all prompts, then we set num_lookahead_slots to 0 | ||
# this alloows us to go through the `no_spec` path in | ||
# `spec_decode_worker.py` | ||
all_prefills = (len(scheduled_seq_groups) == num_prefill_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we want to add a test in test_chunked_prefill_scheduler.py to cover this case?
@@ -409,6 +409,14 @@ def execute_model( | |||
execute_model_req) | |||
num_lookahead_slots = execute_model_req.num_lookahead_slots | |||
|
|||
all_prompt = (all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we can add an e2e test for the case with speculative-draft-tensor-parallel-size > 1 and chunked-prefill enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a look and see
num_batched_tokens=budget.num_batched_tokens, | ||
blocks_to_swap_in=swapped_in.blocks_to_swap_in, | ||
blocks_to_swap_out=running_scheduled.blocks_to_swap_out, | ||
blocks_to_copy=running_scheduled.blocks_to_copy + | ||
swapped_in.blocks_to_copy, | ||
ignored_seq_groups=prefills.ignored_seq_groups + | ||
swapped_in.infeasible_seq_groups, | ||
num_lookahead_slots=running_scheduled.num_lookahead_slots, | ||
num_lookahead_slots=num_lookahead_slots, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varun-sundar-rabindranath could you also review this part to see if this will break multi-step scheduling with chunked prefill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Tag. I believe it will affect performance.
multi-step + chunked-prefill allows for having look-ahead slots even when all the sequences are prefills. The sequences are processed as prefills in step 1 and are processed as decodes in steps 2 - n.
Setting the lookahead_slots to 0, will force single stepping for the all-prefills case. I can get some profiles.
@andoorve is there a way to make this update only if spec decode is enabled ? I believe that would be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @varun-sundar-rabindranath I think that should be possible, thanks for the feedback! Let me see how we can do that
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
5893379
to
0b300d2
Compare
Fixes the issue I raised here: #9291. Chunked prefill + spec decoding + TP on the spec model fails for me with
KeyError: 'num_seq_groups'
when I used the following command.This fix makes it so the proposer only runs once on the non driver processes when
no_spec
is on to match the driver.One thing that is still confusing is I would expect this issue to show up without chunked prefill as well. Unsure why it doesn't show up in that case. Would be good to get an opinion from someone more familiar with spec decode path.
FIX #10276