[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

andoorve · 2024-11-11T20:32:03Z

Fixes the issue I raised here: #9291. Chunked prefill + spec decoding + TP on the spec model fails for me with KeyError: 'num_seq_groups' when I used the following command.

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max-num-seqs 32  --block-size 32  --speculative-model meta-llama/Llama-3.1-8B-Instruct  --num-speculative-tokens 8 --gpu-memory-utilization  0.98 --use-v2-block-manager --distributed-executor-backend ray --enable-chunked-prefill --max-num-batched-tokens 4096 --max-model-len 32768

This fix makes it so the proposer only runs once on the non driver processes when no_spec is on to match the driver.

One thing that is still confusing is I would expect this issue to show up without chunked prefill as well. Unsure why it doesn't show up in that case. Would be good to get an opinion from someone more familiar with spec decode path.

FIX #10276

github-actions · 2024-11-11T20:32:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

andoorve · 2024-11-11T20:32:58Z

@NickLucche @sroy745

Signed-off-by: andoorve <[email protected]>

sroy745 · 2024-11-12T04:34:24Z

Hi,
Thanks for the fix.

Based on our DM discussions my understanding is that the main issue seems to be that even when all the sequences are prompts (only prefill) we have num_lookahead_slots as > 0. I added some logs in this pr (https://github.com/vllm-project/vllm/pull/10186/files) and the output if I run with and without chunked-prefill enabled is the following

Without chunked prefill

num_lookahead_slots in _schedule_default 0
prefills in _schedule_default_prefill 1
decodes in _schedule_default_prefill 0

With chunked prefill

num_lookahead_slots in _schedule_chunked_prefill 4
prefills in _schedule_chunked_prefill 1
decodes in _schedule_chunked_prefill 0

In without chunked-prefill run if it is a complete prefill batch num_lookahead_slots is set to 0 but it is not the case for the chunked-prefill run. I wonder if we should fix __schedule_chunked_prefill to set num_lookahead_slots to 0 if it is a complete prefill batch and add an assertion in spec_decode_worker for that?

NickLucche · 2024-11-12T07:26:37Z

I wonder if we should fix __schedule_chunked_prefill to set num_lookahead_slots to 0 if it is a complete prefill batch and add an assertion in spec_decode_worker for that

I like that, I think this would be more in line with the expected semantics (no speculation on prefills-only).

Thanks for looking into it!!

This reverts commit 6863d1f. Signed-off-by: andoorve <[email protected]>

Signed-off-by: andoorve <[email protected]>

andoorve · 2024-11-12T19:43:34Z

As discussed over DM, moving this up to the scheduler level is a cleaner fix, moved the check there. @NickLucche @sroy745 PTAL if this logic looks good, then I'll mark this ready!

sroy745

Thanks for the PR! Added a couple of comments about tests. Logic LGTM

Thanks

vllm/core/scheduler.py

sroy745 · 2024-11-13T03:58:20Z

vllm/core/scheduler.py

+        # If all prompts, then we set num_lookahead_slots to 0
+        # this alloows us to go through the `no_spec` path in
+        # `spec_decode_worker.py`
+        all_prefills = (len(scheduled_seq_groups) == num_prefill_groups)


I am wondering if we want to add a test in test_chunked_prefill_scheduler.py to cover this case?

sroy745 · 2024-11-13T04:25:00Z

vllm/spec_decode/spec_decode_worker.py

@@ -409,6 +409,14 @@ def execute_model(
            execute_model_req)
        num_lookahead_slots = execute_model_req.num_lookahead_slots

+        all_prompt = (all(


I am wondering if we can add an e2e test for the case with speculative-draft-tensor-parallel-size > 1 and chunked-prefill enabled.

Let me take a look and see

comaniac · 2024-11-13T18:13:57Z

vllm/core/scheduler.py

            num_batched_tokens=budget.num_batched_tokens,
            blocks_to_swap_in=swapped_in.blocks_to_swap_in,
            blocks_to_swap_out=running_scheduled.blocks_to_swap_out,
            blocks_to_copy=running_scheduled.blocks_to_copy +
            swapped_in.blocks_to_copy,
            ignored_seq_groups=prefills.ignored_seq_groups +
            swapped_in.infeasible_seq_groups,
-            num_lookahead_slots=running_scheduled.num_lookahead_slots,
+            num_lookahead_slots=num_lookahead_slots,


@varun-sundar-rabindranath could you also review this part to see if this will break multi-step scheduling with chunked prefill?

Thanks for the Tag. I believe it will affect performance.
multi-step + chunked-prefill allows for having look-ahead slots even when all the sequences are prefills. The sequences are processed as prefills in step 1 and are processed as decodes in steps 2 - n.
Setting the lookahead_slots to 0, will force single stepping for the all-prefills case. I can get some profiles.

@andoorve is there a way to make this update only if spec decode is enabled ? I believe that would be safer.

Hi @varun-sundar-rabindranath I think that should be possible, thanks for the feedback! Let me see how we can do that

Signed-off-by: andoorve <[email protected]>

andoorve requested review from cadedaniel, comaniac and njhill November 11, 2024 20:32

Fix for Spec model TP + Chunked Prefill

6863d1f

Signed-off-by: andoorve <[email protected]>

andoorve force-pushed the andoorve/spec-fix-chunked branch from f1ff8aa to 6863d1f Compare November 11, 2024 20:33

andoorve force-pushed the andoorve/spec-fix-chunked branch from 8ce5735 to d5f6392 Compare November 12, 2024 19:13

andoorve added 2 commits November 12, 2024 19:16

Revert "Fix for Spec model TP + Chunked Prefill"

902daaa

This reverts commit 6863d1f. Signed-off-by: andoorve <[email protected]>

Move fix to scheduler

10f69a4

Signed-off-by: andoorve <[email protected]>

andoorve force-pushed the andoorve/spec-fix-chunked branch from d5f6392 to 10f69a4 Compare November 12, 2024 19:16

andoorve added 3 commits November 12, 2024 19:31

Add assert

5000122

Signed-off-by: andoorve <[email protected]>

Small cleanup

25d0d7c

Signed-off-by: andoorve <[email protected]>

Small cleanup

af21b63

Signed-off-by: andoorve <[email protected]>

andoorve self-assigned this Nov 13, 2024

sroy745 reviewed Nov 13, 2024

View reviewed changes

comaniac reviewed Nov 13, 2024

View reviewed changes

andoorve added 2 commits November 13, 2024 18:38

Typo fix

74e1376

Signed-off-by: andoorve <[email protected]>

Docs change

b4b2242

Signed-off-by: andoorve <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Nov 13, 2024

andoorve added the bug Something isn't working label Nov 15, 2024

Removed unnecessary checks

0b300d2

Signed-off-by: andoorve <[email protected]>

andoorve force-pushed the andoorve/spec-fix-chunked branch from 5893379 to 0b300d2 Compare November 15, 2024 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

andoorve commented Nov 11, 2024 •

edited

Loading

github-actions bot commented Nov 11, 2024

andoorve commented Nov 11, 2024

sroy745 commented Nov 12, 2024 •

edited

Loading

NickLucche commented Nov 12, 2024

andoorve commented Nov 12, 2024

sroy745 left a comment •

edited

Loading

sroy745 Nov 13, 2024

sroy745 Nov 13, 2024

andoorve Nov 14, 2024

comaniac Nov 13, 2024

varun-sundar-rabindranath Nov 13, 2024

andoorve Nov 14, 2024

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

Are you sure you want to change the base?

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

Conversation

andoorve commented Nov 11, 2024 • edited Loading

github-actions bot commented Nov 11, 2024

andoorve commented Nov 11, 2024

sroy745 commented Nov 12, 2024 • edited Loading

NickLucche commented Nov 12, 2024

andoorve commented Nov 12, 2024

sroy745 left a comment • edited Loading

Choose a reason for hiding this comment

sroy745 Nov 13, 2024

Choose a reason for hiding this comment

sroy745 Nov 13, 2024

Choose a reason for hiding this comment

andoorve Nov 14, 2024

Choose a reason for hiding this comment

comaniac Nov 13, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Nov 13, 2024

Choose a reason for hiding this comment

andoorve Nov 14, 2024

Choose a reason for hiding this comment

andoorve commented Nov 11, 2024 •

edited

Loading

sroy745 commented Nov 12, 2024 •

edited

Loading

sroy745 left a comment •

edited

Loading