Description
Proposal to improve performance
No response
Report of performance regression
We attempted to reproduce the results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x". However, we were unable to achieve the performance levels reported in the article. The experimental setup we used is detailed below:
vLLM version: 0.6.3
Device: 4 x H100 PCIe
Target Model: meta-llama/Meta-Llama-3-70B-Instruct with TP=4
Draft Model: turboderp/Qwama-0.5B-Instruct with TP=1
Dataset: anon8231489123/ShareGPT_Vicuna_unfiltered
QPS: 1
Num Speculative Tokens: 4
Since the article did not specify the maximum batch size, we experimented with various batch sizes to align with the conditions implied in the blog. The results are summarized in the table below:
NUM_REQUESTS | BATCH_SIZE | latency | output_token_throughput | total_token_throughput | sequence_throughput | p99_ttft_ms | mean_tpot_ms | mean_e2e_ms |
---|---|---|---|---|---|---|---|---|
32 | 1 | 379.36 | 16.39 | 42.82 | 0.08 | 342058.66 | 59.24 | 192514.28 |
32 | 1 | 269.17 | 23.1 | 60.35 | 0.12 | 231910.81 | 48.47 | 132276.49 |
128 | 4 | 466.57 | 55.38 | 133.21 | 0.27 | 301887.07 | 68.16 | 158126.15 |
128 | 4 | 448.78 | 57.58 | 138.49 | 0.28 | 289373.59 | 71.17 | 151403.83 |
128 | 8 | 306.15 | 84.4 | 203.01 | 0.41 | 142675.01 | 85.46 | 80552.87 |
128 | 8 | 301.58 | 85.68 | 206.09 | 0.42 | 149036.84 | 94.52 | 83421.69 |
128 | 16 | 209.41 | 123.39 | 296.79 | 0.6 | 44303.73 | 103.11 | 37825.59 |
128 | 16 | 198.52 | 130.16 | 313.07 | 0.63 | 38716.87 | 107.46 | 33761.1 |
512 | 256 | 554.79 | 182.35 | 398.75 | 0.9 | 1598.33 | 127.45 | 25150.41 |
512 | 256 | 541.01 | 187 | 408.91 | 0.92 | 1659.16 | 117.04 | 20934.63 |
While our results generally indicate improved performance using speculative decoding for the ShareGPT dataset, we observed a maximum speedup of 1.4x with a batch size of 1. In contrast, the end-to-end (E2E) latency was significantly different. Even at a maximum batch size of 256, the average E2E latency exceeded 20 seconds, whereas the article reported an average of only 1–2 seconds.
Therefore, we kindly request that the vLLM team provide more detailed information regarding the experimental setup described in the blog, including specifics on the maximum batch size, input and output lengths of the dataset, number of requests, and CPU configuration. This information would greatly enhance the reliability of vLLM performance results and offer valuable insights to other users.
Thank you for your assistance.
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
vLLM version: 0.6.3
PyTorch version: 2.4.0+cu121
OS: Ubuntu 20.04.6 LTS (x86_64)
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.46.2 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.