blog post | HuggingFace Dataset | arXiv
A benchmark for evaluating LLMs' ability to generate GPU kernels
See blog post and arXiv paper for more details.
We structure the problem for LLM to transpile operators described in PyTorch to CUDA kernels, at whatever level of granularity it desires to.
We construct Kernel Bench to have 4 Levels of categories:
- Level 1 🧱: Single-kernel operators (100 Problems) The foundational building blocks of neural nets (Convolutions, Matrix multiplies, Layer normalization)
- Level 2 🔗: Simple fusion patterns (100 Problems) A fused kernel would be faster than separated kernels (Conv + Bias + ReLU, Matmul + Scale + Sigmoid)
- Level 3 ⚛️: Full model architectures (50 Problems) Optimize entire model architectures end-to-end (MobileNet, VGG, MiniGPT, Mamba)
- Level 4 🤗: Level Hugging Face Optimize whole model architectures from HuggingFace
To evaluate model-generated kernels, we need to check if they:
- is correct ✅: check against reference torch operators
n_correctness
times on randomized inputs. - is performant ⏱️: compare against reference torch operators
n_trial
times to measure speedup between runtimes.
Check out src/eval.py
for details on how we implement correctness check and timing.
We provide a convenient script scripts/run_and_check.py
to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel.
Since we need to capture both correctness and performance, we define a metric fast_p
: fraction of tasks that are both correct and have a speedup greater than threshold p
; speedup is computed as the ratio of PyTorch reference wall-clock time to generated kernel time.
Some examples to illustrate this metric that filters based on speedups:
fast_1
is the fraction of tasks that LM-generated kernels are both correct and faster than PyTorch baselinefast_2
is the fraction of tasks that LM-generated kernels are both correct and at least 2x faster than PyTorch baselinefast_0
is the fraction of tasks that LM-generated kernels are correct. (same as correctness rate)
You can increase speedup threshold p
to make the task more challenging.
We provide a script scripts/greedy_analysis.py
to compute the overall benchmark performance.
Since we need to capture both correctness and performance, we use a metric fast_p
: fraction of tasks that are both correct and have a speedup greater than threshold p
; speedup is computed as the ratio of PyTorch reference wall-clock time to generated kernel time.
We organize the repo into the following structure:
KernelBench/
├── assets/
├── KernelBench/ # Benchmark dataset files
├── src/ # KernelBench logic code
│ ├── unit_tests/
│ ├── prompts/
│ ├── ....
├── scripts/ # helpful scripts to run the benchmark
├── results/ # baseline times across hardware
├── runs/ # where your runs will be stored
conda create --name kernel-bench python=3.10
conda activate kernel-bench
pip install -r requirements.txt
pip install -e .
To call LLM API providers, set your {INFERENCE_SERVER_PROVIDER}_API_KEY
API key.
Running and profiling kernels require a GPU.
If you don't have GPU available locally, you can set up Modal. Set up your modal token after creating an account by running modal token new
. Then, use the generate_and_eval_single_sample_modal.py
script.
It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample.
# for example, run level 2 problem 40 from huggingface
python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" level=2 problem_id=40
# dataset_src could be "local" or "huggingface"
# add .verbose_logging for more visbility
# 1. Generate responses and store kernels locally to runs/{run_name} directory
python3 scripts/generate_samples.py run_name="test_hf_level_1" dataset_src="huggingface" level="1" num_workers=50 server_type="deepseek" model_name="deepseek-coder" temperature=0
# 2. Evaluate on all generated kernels in runs/{run_name} directory
python3 scripts/eval_from_generations.py level=1 run_name="test_hf_level_1" dataset_src="local" level="1" num_gpu_devices=8 timeout=300
You can check out scripts/greedy_analysis.py
to analyze the eval results.
We provide some reference baseline times a variety of NVIDIA GPUs across generations in results/timing
.
- Triton Variant (Ongoing)
- Easy to use CoLab Notebook Example
- Push button flow on Modal / Cloud Provider
- Integrate with more frameworks, such as ThunderKittens
- Add backward pass
- Integrate with toolchains such as NCU
- NVIDIA - Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling
- METR - Measuring Automated Kernel Engineering
- Sakana AI - AI Cuda Engineer
If you are using KernelBench, we love to hear more about it!
MIT. Check LICENSE.md
for more details.
@misc{ouyang2025kernelbenchllmswriteefficient,
title={KernelBench: Can LLMs Write Efficient GPU Kernels?},
author={Anne Ouyang and Simon Guo and Simran Arora and Alex L. Zhang and William Hu and Christopher Ré and Azalia Mirhoseini},
year={2025},
eprint={2502.10517},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.10517},
}