[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible

### Proposal to improve performance

_No response_

### Report of performance regression

We attempted to reproduce the results from the [vLLM Blog article _"How Speculative Decoding Boosts vLLM Performance by up to 2.8x"_](https://blog.vllm.ai/2024/10/17/spec-decode.html). However, we were unable to achieve the performance levels reported in the article. The experimental setup we used is detailed below:

> vLLM version: 0.6.3
> Device: 4 x H100 PCIe
> Target Model: meta-llama/Meta-Llama-3-70B-Instruct with TP=4
> Draft Model: turboderp/Qwama-0.5B-Instruct with TP=1
> Dataset: anon8231489123/ShareGPT_Vicuna_unfiltered
> QPS: 1
> Num Speculative Tokens: 4

Since the article did not specify the maximum batch size, we experimented with various batch sizes to align with the conditions implied in the blog. The results are summarized in the table below:

NUM_REQUESTS | BATCH_SIZE | latency | output_token_throughput | total_token_throughput | sequence_throughput | p99_ttft_ms | mean_tpot_ms | mean_e2e_ms 
-- | -- | -- | -- | -- | -- | -- | -- | -- 
32 | 1 | 379.36 | 16.39 | 42.82 | 0.08 | 342058.66 | 59.24 | 192514.28 
32 | 1 | 269.17 | 23.1 | 60.35 | 0.12 | 231910.81 | 48.47 | 132276.49 
128 | 4 | 466.57 | 55.38 | 133.21 | 0.27 | 301887.07 | 68.16 | 158126.15 
128 | 4 | 448.78 | 57.58 | 138.49 | 0.28 | 289373.59 | 71.17 | 151403.83 
128 | 8 | 306.15 | 84.4 | 203.01 | 0.41 | 142675.01 | 85.46 | 80552.87 
128 | 8 | 301.58 | 85.68 | 206.09 | 0.42 | 149036.84 | 94.52 | 83421.69 
128 | 16 | 209.41 | 123.39 | 296.79 | 0.6 | 44303.73 | 103.11 | 37825.59 
128 | 16 | 198.52 | 130.16 | 313.07 | 0.63 | 38716.87 | 107.46 | 33761.1 
512 | 256 | 554.79 | 182.35 | 398.75 | 0.9 | 1598.33 | 127.45 | 25150.41 
512 | 256 | 541.01 | 187 | 408.91 | 0.92 | 1659.16 | 117.04 | 20934.63 

While our results generally indicate improved performance using speculative decoding for the ShareGPT dataset, we observed a maximum speedup of 1.4x with a batch size of 1. In contrast, the end-to-end (E2E) latency was significantly different. Even at a maximum batch size of 256, the average E2E latency exceeded 20 seconds, whereas the [article](https://blog.vllm.ai/2024/10/17/spec-decode.html) reported an average of only 1–2 seconds.

Therefore, we kindly request that the vLLM team provide more detailed information regarding the experimental setup described in the [blog](https://blog.vllm.ai/2024/10/17/spec-decode.html), including specifics on the maximum batch size, input and output lengths of the dataset, number of requests, and CPU configuration. This information would greatly enhance the reliability of vLLM performance results and offer valuable insights to other users.

Thank you for your assistance.


### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

vLLM version: 0.6.3
PyTorch version: 2.4.0+cu121
OS: Ubuntu 20.04.6 LTS (x86_64)

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.77                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.46.2                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible #10318

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NUM_REQUESTS	BATCH_SIZE	latency	output_token_throughput	total_token_throughput	sequence_throughput	p99_ttft_ms	mean_tpot_ms	mean_e2e_ms
32	1	379.36	16.39	42.82	0.08	342058.66	59.24	192514.28
32	1	269.17	23.1	60.35	0.12	231910.81	48.47	132276.49
128	4	466.57	55.38	133.21	0.27	301887.07	68.16	158126.15
128	4	448.78	57.58	138.49	0.28	289373.59	71.17	151403.83
128	8	306.15	84.4	203.01	0.41	142675.01	85.46	80552.87
128	8	301.58	85.68	206.09	0.42	149036.84	94.52	83421.69
128	16	209.41	123.39	296.79	0.6	44303.73	103.11	37825.59
128	16	198.52	130.16	313.07	0.63	38716.87	107.46	33761.1
512	256	554.79	182.35	398.75	0.9	1598.33	127.45	25150.41
512	256	541.01	187	408.91	0.92	1659.16	117.04	20934.63

Uh oh!

[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible #10318

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions