You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM version: 0.6.3
Device: 4 x H100 PCIe
Target Model: meta-llama/Meta-Llama-3-70B-Instruct with TP=4
Draft Model: turboderp/Qwama-0.5B-Instruct with TP=1
Dataset: anon8231489123/ShareGPT_Vicuna_unfiltered
QPS: 1
Num Speculative Tokens: 4
Since the article did not specify the maximum batch size, we experimented with various batch sizes to align with the conditions implied in the blog. The results are summarized in the table below:
NUM_REQUESTS
BATCH_SIZE
latency
output_token_throughput
total_token_throughput
sequence_throughput
p99_ttft_ms
mean_tpot_ms
mean_e2e_ms
32
1
379.36
16.39
42.82
0.08
342058.66
59.24
192514.28
32
1
269.17
23.1
60.35
0.12
231910.81
48.47
132276.49
128
4
466.57
55.38
133.21
0.27
301887.07
68.16
158126.15
128
4
448.78
57.58
138.49
0.28
289373.59
71.17
151403.83
128
8
306.15
84.4
203.01
0.41
142675.01
85.46
80552.87
128
8
301.58
85.68
206.09
0.42
149036.84
94.52
83421.69
128
16
209.41
123.39
296.79
0.6
44303.73
103.11
37825.59
128
16
198.52
130.16
313.07
0.63
38716.87
107.46
33761.1
512
256
554.79
182.35
398.75
0.9
1598.33
127.45
25150.41
512
256
541.01
187
408.91
0.92
1659.16
117.04
20934.63
While our results generally indicate improved performance using speculative decoding for the ShareGPT dataset, we observed a maximum speedup of 1.4x with a batch size of 1. In contrast, the end-to-end (E2E) latency was significantly different. Even at a maximum batch size of 256, the average E2E latency exceeded 20 seconds, whereas the article reported an average of only 1–2 seconds.
Therefore, we kindly request that the vLLM team provide more detailed information regarding the experimental setup described in the blog, including specifics on the maximum batch size, input and output lengths of the dataset, number of requests, and CPU configuration. This information would greatly enhance the reliability of vLLM performance results and offer valuable insights to other users.
Thank you for your assistance.
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
yeonjoon-jung01
changed the title
[Performance]: Results in "vLLM Blog" article about speculative decoding are unreproducible
[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible
Nov 15, 2024
Proposal to improve performance
No response
Report of performance regression
We attempted to reproduce the results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x". However, we were unable to achieve the performance levels reported in the article. The experimental setup we used is detailed below:
Since the article did not specify the maximum batch size, we experimented with various batch sizes to align with the conditions implied in the blog. The results are summarized in the table below:
While our results generally indicate improved performance using speculative decoding for the ShareGPT dataset, we observed a maximum speedup of 1.4x with a batch size of 1. In contrast, the end-to-end (E2E) latency was significantly different. Even at a maximum batch size of 256, the average E2E latency exceeded 20 seconds, whereas the article reported an average of only 1–2 seconds.
Therefore, we kindly request that the vLLM team provide more detailed information regarding the experimental setup described in the blog, including specifics on the maximum batch size, input and output lengths of the dataset, number of requests, and CPU configuration. This information would greatly enhance the reliability of vLLM performance results and offer valuable insights to other users.
Thank you for your assistance.
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
vLLM version: 0.6.3
PyTorch version: 2.4.0+cu121
OS: Ubuntu 20.04.6 LTS (x86_64)
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.77
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.560.30 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.46.2 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: