This benchmark is meant to be a lightweight layer ontop of an existing vLLM/SGLang/TRTLLM installation. For example, no install
is required if one is running in the following dockers: vllm/vllm-openai:v0.11.0 (vLLM), lmsysorg/sglang:v0.5.4.post2 (SGLang), or
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4 (TRT-LLM).
Next
cd examples/specdec_benchCollect relevant metrics on acceptance rate, timing, and outputs for Speculative Decoding methods. Acceptance rate refers to the number of tokens generated on every iteration. For a standard Autoregressive LLM, this number is just 1.
A basic example run script is provided which benchmarks MTBench (a standard 160 prompts spanning 8 categories). MTBench is available here
Download nvidia/gpt-oss-120b-Eagle3 to a local directory /path/to/eagle.
python3 run.py \
--model_dir openai/gpt-oss-120b \
--tokenizer openai/gpt-oss-120b \
--draft_model_dir /path/to/eagle \
--mtbench question.jsonl \
--tp_size 1 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--num_requests 80 \
--engine TRTLLM \
--concurrency 1 \
--postprocess gptossDownload nvidia/gpt-oss-120b-Eagle3 to a local directory /path/to/eagle.
python3 run.py \
--model_dir openai/gpt-oss-120b \
--tokenizer openai/gpt-oss-120b \
--draft_model_dir /path/to/eagle \
--random_isl 1024 \
--tp_size 1 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--num_requests 40 \
--engine TRTLLM \
--concurrency 1Running SPEED-Bench on Llama 3.3 70B + Eagle 3
-
Install the requirements file using
pip install -r requirements_speed.txt -
Prepare the data using the provided script:
python3 prepare_data.py --dataset speed --config allThe data will be saved to data/ directory, each config type (qualitative, throughput_1k, ...) to each own directory.
GOVERNING TERMS: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement.
ADDITIONAL INFORMATION: MIT for bigcode/humanevalpack, RUCAIBox/MMATH, RUCAIBox/BAMBOO and EQ-Bench. Apache 2.0 for Writing Bench and Spec-Bench. CC BY 4.0 for FBK-MT/MCIF. MIT and Apache 2.0 for tianyang/repobench_python_v1.1, JetBrains-Research/lca-project-level-code-completion and tianyang/repobench_java_v1.1.
NOTICE: For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The prepare_data.py script automatically fetches data from all the source datasets.
Additional details are in HuggingFace dataset repository.
python3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/qualitative \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progresspython3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/throughput_1k \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progressFor longer context (>8192 tokens), please use the following configuration when using TRTLLM:
engine_args:
max_seq_len: 131072 # Model max context length (for Llama 3.3 70B)
enable_chunked_prefill: truepython3 run.py \
--model_dir meta-llama/Llama-3.3-70B-Instruct \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--dataset speed \
--dataset_path data/speed/throughput_16k \
--tp_size 8 \
--ep_size 1 \
--draft_length 3 \
--output_length 4096 \
--engine TRTLLM \
--concurrency 32 \
--show_progress \
--runtime_params runtime_args_long_context.yamlThe goal of this benchmark is to provide an easy way to configure, run, and compare speculative implementations across frameworks in an apples-to-apples method.
This benchmark sends request in a single-threaded fashion, so running large concurrency (>256) may result in python async scheduling delays and skew metrics.
If larger concurrency is needed, it is recommended to fully deploy the model using vllm serve, python -m sglang.launch_server, or trtllm-serve (for vLLM, SGlang, or TRTLLM respectively) and
use a more robust benchmarking client like NVIDIA AI Perf.