Enable ViT torch.compile + CUDA Graph by b-mu · Pull Request #33 · CentML/vllm

b-mu · 2026-01-30T06:18:10Z

Purpose

After integration of high-performance kernels for ViT attention, we saw kernel launch overhead. To improve performance, we add two features:

torch.compile(): fuse native kernels, e.g. layernorm, elementwise
CUDA graph for the ViT: note that the image patch size varies across samples for the ViT, hence we support
- exact match: we capture a default set of frequently used grids,
- padding: we also capture grid size in certain buckets, and pad image patch to the nearest bucket,
- eager: due to memory constraint, we can only capture a limited set of grids, and leave the rest in eager mode, especially the very large grids, since the launch overhead is less noticeable and the benefits would be negligible.

Test Plan

Tested end-to-end accuracy with the below configuration

Compilation Configs:

    --vllm.cli=--compilation-config='{
      "compile_mm_encoder": true,
      "cudagraph_mm_encoder": true,
      "encoder_cudagraph_verbose": true,
      "encoder_cudagraph_grid_configs": "custom",
      "encoder_cudagraph_max_grid_size": 218,
      "encoder_cudagraph_padded_mode": true,
      "encoder_cudagraph_bucket_sizes": [88, 106, 140, 176, 200],
      "encoder_cudagraph_one_by_one": true
      ...
    }'

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Max Hu <hyoung2991@gmail.com>

b-mu self-assigned this Jan 30, 2026

b-mu changed the title ~~WIP: Reduce Gaps between Kernels in ViT~~ WIP: Enable ViT torch.compile + CUDA Graph Jan 30, 2026

b-mu changed the title ~~WIP: Enable ViT torch.compile + CUDA Graph~~ Enable ViT torch.compile + CUDA Graph Feb 1, 2026

b-mu requested review from maxyanghu and wangshangsam February 1, 2026 21:41

maxyanghu and others added 25 commits February 1, 2026 17:22

transfer impl

a997f97

Signed-off-by: Max Hu <hyoung2991@gmail.com>

add compilation configs for mm encoder cudagraph.

b5886e9

add mm encoder cudagraph manager (exact | bucket).

ccbeba9

add capture mm encoder cudagraph option.

062ceea

implement mm encoder cudagraph manager (exact | bucket).

7d70346

precompute pos_embeds, rotary embeddings, cu_seqlens.

e1019e3

use forward_cudagraph() with precomputed tensors.

b31bca9

add more grids.

d0d63e3

add encoder cudagraph manager in v1.

bb32c23

keep assertion.

bcc72a4

distinguish encoder and lm for graph capture range.

35becaa

fix ambiguous tensor.

9a7e47b

add log for grid_thw.

f7af48a

disable assertion for now, but warn.

a516b15

compute embeddings with exact, unpadded grid thw.

3490b83

log vit cudagraph mode.

6aea329

add custom grid config.

4c1f1a0

update custom grid config.

7ca3136

update comment for cudagraph related compilation config.

cf04736

clean up.

545e478

clean up.

a2b474d

update comment for encoder cudagraph.

c3a025f

log encoder cudagraph stats.

722ff9d

get bucket size from config for padding mode.

c981904

eliminate dead code.

b479627

b-mu added 30 commits February 4, 2026 04:37

remove capture in advance.

6323e3e

clean up logs.

ff74d0b

set cudagraph mode in forward context.

96a1dd6

capture encoder piecewise cudagraph in capture_model().

ee90e73

try separate compile and capture.

dba634a

try disable torch cache.

14e8b15

try only disable torch aot cache.

ea90f03

try disable fx graph cache.

d0f9807

disable encoder torch cache.

00c03e3

skip piecewise if not enabled in cli.

41ff604

add encoder cudagraph batch sizes.

f918ee9

support batch size > 1 in encoder cudagraph.

3296099

encoder cudagraph batching logic.

cd453be

batch with padding at the end.

a13707b

use batching with padding at the end.

e452402

track buffer modified by padding.

f93a3d6

fix max() over empty list.

e06be38

format.

f1db17e

format.

6b64707

fix mypy.

4fe2061

add log for batching.

fbd14d1

format.

3d2b60e

fix duplciate image processing.

24bfc22

format.

df710a8

format.

f5142a2

fix mypy.

36c346b

fix hit rate calculation.

acdd256

calculate padding waste.

24eb4a5

format.

735d36b

log failed batching.

9441d5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ViT torch.compile + CUDA Graph#33

Enable ViT torch.compile + CUDA Graph#33
b-mu wants to merge 155 commits intomlperf-inf-mm-q3vl-v6.0from
reduce-vit-kernel-gaps

b-mu commented Jan 30, 2026 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b-mu commented Jan 30, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b-mu commented Jan 30, 2026 •

edited by github-actions bot

Loading