Skip to content

[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible #10318

Closed as not planned
@yeonjoon-jung01

Description

@yeonjoon-jung01

Proposal to improve performance

No response

Report of performance regression

We attempted to reproduce the results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x". However, we were unable to achieve the performance levels reported in the article. The experimental setup we used is detailed below:

vLLM version: 0.6.3
Device: 4 x H100 PCIe
Target Model: meta-llama/Meta-Llama-3-70B-Instruct with TP=4
Draft Model: turboderp/Qwama-0.5B-Instruct with TP=1
Dataset: anon8231489123/ShareGPT_Vicuna_unfiltered
QPS: 1
Num Speculative Tokens: 4

Since the article did not specify the maximum batch size, we experimented with various batch sizes to align with the conditions implied in the blog. The results are summarized in the table below:

NUM_REQUESTS BATCH_SIZE latency output_token_throughput total_token_throughput sequence_throughput p99_ttft_ms mean_tpot_ms mean_e2e_ms
32 1 379.36 16.39 42.82 0.08 342058.66 59.24 192514.28
32 1 269.17 23.1 60.35 0.12 231910.81 48.47 132276.49
128 4 466.57 55.38 133.21 0.27 301887.07 68.16 158126.15
128 4 448.78 57.58 138.49 0.28 289373.59 71.17 151403.83
128 8 306.15 84.4 203.01 0.41 142675.01 85.46 80552.87
128 8 301.58 85.68 206.09 0.42 149036.84 94.52 83421.69
128 16 209.41 123.39 296.79 0.6 44303.73 103.11 37825.59
128 16 198.52 130.16 313.07 0.63 38716.87 107.46 33761.1
512 256 554.79 182.35 398.75 0.9 1598.33 127.45 25150.41
512 256 541.01 187 408.91 0.92 1659.16 117.04 20934.63

While our results generally indicate improved performance using speculative decoding for the ShareGPT dataset, we observed a maximum speedup of 1.4x with a batch size of 1. In contrast, the end-to-end (E2E) latency was significantly different. Even at a maximum batch size of 256, the average E2E latency exceeded 20 seconds, whereas the article reported an average of only 1–2 seconds.

Therefore, we kindly request that the vLLM team provide more detailed information regarding the experimental setup described in the blog, including specifics on the maximum batch size, input and output lengths of the dataset, number of requests, and CPU configuration. This information would greatly enhance the reliability of vLLM performance results and offer valuable insights to other users.

Thank you for your assistance.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

vLLM version: 0.6.3
PyTorch version: 2.4.0+cu121
OS: Ubuntu 20.04.6 LTS (x86_64)

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.46.2 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance-related issuesstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions