Skip to content

Using max_attention_window (VSWA) reduces concurrent batch size and causes drop in throughput (gemma3 trt backend) #6503

@lkm2835

Description

@lkm2835

System Info

GPU: NVIDIA A100, NVIDIA H100
TensorRT-LLM version: 1.0.0rc5
TensorRT-LLM commit: b3ca159

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build: following gemma3 guide

Serve: trtllm-serve with max_attention_window = [512, 512, 512, 512, 512, 3100]

Expected behavior

When the sequence length is shorter than the minimum attention window,
concurrent batch size remains the same compared to when max_attention_window is not used.
When it is longer, the batch size increases.
It works well in vLLM.

actual behavior

However the batch size decreases in both cases, resulting in a significant drop in throughput.

additional notes

This behavior has continued since the referenced commit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceTRTLLM model inference speed, throughput, efficiency. Latency, benchmarks, regressions, opts.bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions