[WebGPU] Add compact attention mask support by qjia7 · Pull Request #2022 · microsoft/onnxruntime-genai

qjia7 · 2026-03-13T03:21:19Z

Problem

The standard attention mask [batch_size, total_sequence_length] has below disadvantages for WebGPU graph capture:

Large pre-allocation — Graph capture requires all tensor shapes to be constant across runs. So the mask must be pre-allocated as [batch_size, max_length] (e.g., 4096) even when the actual sequence is short (e.g., 100 tokens). This wastes GPU memory.
Full buffer copy every step — ORT WebGPU's CopyTensors supports creating a smaller tensor view on an existing buffer, but only starting from offset 0 — you can't copy to an arbitrary offset within the buffer. For the attention mask, each decode step appends a 1 at the end (position = total_seq_len - 1). Since we can't copy starting from that offset, we must copy all previous values from the start — i.e., copy total_seq_len elements every step instead of just the 1 new element. As the sequence grows, this copy grows linearly (100 elements at step 100, 1000 at step 1000, etc.).
Additionally, the ONNX graph contains ReduceSum/ReduceMax subgraphs to derive seqlens_k/total_seq_len from the full binary mask, even though GenAI already knows these values on the CPU.

Solution

Compact attention mask [batch_size, 1] — a single scalar per batch containing the total sequence length. This eliminates the large buffer, replaces the full-buffer copy with a 4-byte scalar copy, and simplifies the ONNX graph.

Result

Graph capture works with continuous decoding. Simpler ONNX graph. No large static mask allocation needed.

Copilot

Pull request overview

Adds an optional “compact attention mask” mode that represents attention mask inputs as per-batch sequence lengths ([batch_size, 1]) rather than a full binary mask ([batch_size, total_sequence_length]), targeting simpler graphs and improved WebGPU graph-capture behavior.

Changes:

Introduces compact_attention_mask configuration (Python model builder emits it; C++ config parses/uses it).
Extends DeviceInterface with UpdateCompactAttentionMask and implements it for CPU and WebGPU.
Updates runtime position/mask input handling to create/update the compact mask.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/webgpu/interface.cpp	Adds WebGPU implementation to upload compact mask values via ORT `CopyTensors`.
src/smartptrs.h	Adds `DeviceInterface::UpdateCompactAttentionMask` virtual hook.
src/python/py/models/builders/base.py	Switches attention_mask input shape in compact mode; adds compact GQA mask reformatting subgraph; writes config flag.
src/python/py/models/builder.py	Adds CLI/extra option plumbing + help text for `compact_attention_mask`.
src/models/position_inputs.h	Declares compact mask creation/update helpers.
src/models/position_inputs.cpp	Implements compact mask creation + update paths in `DefaultPositionInputs`.
src/cpu/interface.cpp	Implements CPU `UpdateCompactAttentionMask` (fills lengths).
src/config.h	Adds `decoder.compact_attention_mask` flag to config struct.
src/config.cpp	Parses `compact_attention_mask` from `genai_config.json`.

You can also share your feedback on Copilot code review. Take the survey.

src/webgpu/interface.cpp

src/models/position_inputs.cpp

src/python/py/models/builders/base.py

…act mask paths

…compact mask to GQA

Copilot

Pull request overview

This PR adds end-to-end support for a “compact attention mask” mode, where the decoder attention mask is represented as [batch_size, 1] (sequence lengths) instead of a full [batch_size, total_sequence_length] binary mask, improving graph simplicity and reducing work—especially for WebGPU graph capture and continuous decoding.

Changes:

Add a new model.decoder.compact_attention_mask config flag and wire it through config parsing and Python model building.
Extend the device interface with UpdateCompactAttentionMask, implementing it for CPU and WebGPU.
Update runtime position/mask input handling to initialize/update/rewind compact masks (including a WebGPU-friendly fast path for batch_size == 1).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/webgpu/interface.cpp	Implements WebGPU device-side update path for compact attention mask via a single CPU→GPU copy.
src/cpu/interface.cpp	Implements CPU write logic for compact attention mask values.
src/smartptrs.h	Extends `DeviceInterface` with `UpdateCompactAttentionMask`.
src/models/position_inputs.h	Declares compact mask initialization/update helpers.
src/models/position_inputs.cpp	Adds compact mask initialization, per-step update logic, and rewind handling (incl. WebGPU fast path for batch=1).
src/python/py/models/builders/base.py	Adds GQA subgraph generation for compact attention mask mode and overrides attention_mask input shape.
src/python/py/models/builder.py	Exposes `compact_attention_mask` as a validated boolean extra option and documents it in CLI help.
src/config.h	Adds `compact_attention_mask` to the decoder config schema.
src/config.cpp	Parses `compact_attention_mask` from `genai_config.json`.

src/webgpu/interface.cpp

src/python/py/models/builders/base.py

src/models/position_inputs.cpp

qjia7 · 2026-03-27T04:04:06Z

@kunal-vaishnavi @baijumeswani Please take a look, thanks.

kunal-vaishnavi · 2026-03-30T19:45:07Z

src/models/position_inputs.cpp

+    mask_data[0] = static_cast<T>(prompt_length);
+  } else {
+    // Count non-pad tokens per batch to get the effective sequence length
+    const auto* word_id = const_cast<DeviceSpan<int32_t>&>(next_tokens).CpuSpan().data();


This feels like it is doing the same thing as the ONNX model already performs within itself: count the number of tokens to get the sequence length. Why not let the ONNX model do this work with a boolean mask of size [batch_size, total_sequence_length]? If the issue is about dtypes, we can always convert the dtype internally while keeping the existing input/output (I/O) structure.

In other words, what is the benefit of doing this? We intentionally picked a boolean mask of shape [batch_size, total_sequence_length] to standardize the I/O structure across models, to support a wide range of attention ops across EPs, and to allow the produced ONNX models to match and run with other solutions such as Optimum and Transformers.js for direct comparisons.

Does graph capture not work with the existing attention mask? If yes, then why do we have the following subgraph?

onnxruntime-genai/src/python/py/models/builders/base.py

Lines 4394 to 4412 in 41c4ce1

def make_attention_mask_graph_capture_reformatting_for_gqa(self, attn_mask_basename):

# Make nodes for the attention mask subgraph that calculates

# attributes about the 2D attention mask to use in GroupQueryAttention

#

# Key difference vs make_attention_mask_standard_reformatting_for_gqa:

# - Standard mode: total_seq_len is calculated from Shape op (always runs on CPU)

# - Graph capture mode: No Shape ops inserted to ensure all ops run on GPU (no CPU ops)

#

# attention_mask

# |

# Cast to int32

# |

# ReduceSum (keepdims=0)

# / \

# / \

# Sub ReduceMax

# | |

# seqlens_k total_seq_len

# (1D) (int)

Why not let the ONNX model do this work with a boolean mask of size [batch_size, total_sequence_length]?

In current genai logic, it seems that we have known the total_sequence_length in cpu, but still pass a boolean mask of size [batch_size, total_sequence_length] to the model. And let the model to get the total_sequence_length again. It seems that the purpose of attention mask is let model get the total sequence length. Why we can't directly pass the total_sequence_length to the model? In this case, the model itself doesn't need to calculate the total_sequence_length/seqlens_k from a boolean mask again. The benefit is that

we don't need to allocate a big [batch_size, max_length] as the static tensor to support graph capture and the model itself can also be simplified.

The interface's UpdateCompactAttentionMask method can be easily implemented by CopyTensors API. However, for the boolean attention mask, I need a CopyTensors with offset support to update a partial part of attention mask. (We haven't gotten an agreement how to support CopyTensors with offset in ORT)

That's why I want to support the compact attention which can perfectly resolve above two issues.

Ideally, I hope compact attention mask is only used when batch_size = 1. But we can't forbid users to use it when batch_size > 1. That's why I also add the else branch when batch_size > 1. How do you think?

kunal-vaishnavi · 2026-03-30T19:45:42Z

src/config.h


      std::vector<PipelineModel> pipeline;

+      bool compact_attention_mask{false};  // When true, attention_mask has shape [batch_size, 1] with the total sequence length value


Instead of having an attribute and then adding a boolean here, couldn't we check the shape of the attention mask tensor to know whether it is compact or not? I don't think a user should need to specify this. It should be inferrable from the ONNX model.

The config flag is needed because a [batch_size, 1] attention mask shape is ambiguous at runtime — it could be a compact mask (fixed shape, value = total_seq_len) or a standard binary mask where total_sequence_length happens to be 1.

When enable_webgpu_graph=1 is passed to the builder, automatically enable compact_attention_mask unless the user explicitly set it. This ensures graph capture models use the optimized compact mask by default, while still allowing opt-out via compact_attention_mask=0.

qjia7 mentioned this pull request Mar 13, 2026

[WebGPU] Add compact attention mask support and fix prefill regression for graph capture #2015

Closed

qjia7 marked this pull request as ready for review March 13, 2026 10:18

Copilot AI review requested due to automatic review settings March 13, 2026 10:18

Copilot started reviewing on behalf of qjia7 March 13, 2026 10:19 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

src/webgpu/interface.cpp Outdated Show resolved Hide resolved

src/models/position_inputs.cpp Show resolved Hide resolved

src/python/py/models/builders/base.py Outdated Show resolved Hide resolved

qjia7 added 4 commits March 24, 2026 11:19

compact attention mask support

dcc7fa4

Address PR review: use CpuSpan() instead of CopyDeviceToCpu() in comp…

ed21bbf

…act mask paths

fix lint errors

9d36e43

Address PR #2022 review: fix alignment UB, stale CpuSpan read, scope …

ac3fbfe

…compact mask to GQA

qjia7 force-pushed the compact_attention_mask_v2 branch from d0b7db2 to ac3fbfe Compare March 24, 2026 06:45

qjia7 requested a review from Copilot March 24, 2026 07:02

Copilot started reviewing on behalf of qjia7 March 24, 2026 07:03 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

src/webgpu/interface.cpp Show resolved Hide resolved

src/python/py/models/builders/base.py Outdated Show resolved Hide resolved

src/models/position_inputs.cpp Show resolved Hide resolved

qjia7 force-pushed the compact_attention_mask_v2 branch from ce68bb1 to 4d1d66b Compare March 27, 2026 02:06

Address PR review: fix heap alloc, D2H copy, input name, EP check

fd35276

qjia7 force-pushed the compact_attention_mask_v2 branch from 4d1d66b to fd35276 Compare March 27, 2026 02:20

qjia7 requested review from baijumeswani and kunal-vaishnavi March 27, 2026 04:03

kunal-vaishnavi reviewed Mar 30, 2026

View reviewed changes

qjia7 requested a review from kunal-vaishnavi April 2, 2026 09:53

qjia7 added 2 commits April 3, 2026 10:14

Merge branch 'main' into compact_attention_mask_v2

6ed04d5

qjia7 force-pushed the compact_attention_mask_v2 branch from 50289b6 to c9162ba Compare April 3, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Add compact attention mask support#2022

[WebGPU] Add compact attention mask support#2022
qjia7 wants to merge 7 commits intomainfrom
compact_attention_mask_v2

qjia7 commented Mar 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Mar 27, 2026

Uh oh!

kunal-vaishnavi Mar 30, 2026

Uh oh!

qjia7 Mar 31, 2026

Uh oh!

kunal-vaishnavi Mar 30, 2026

Uh oh!

qjia7 Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def make_attention_mask_graph_capture_reformatting_for_gqa(self, attn_mask_basename):
	# Make nodes for the attention mask subgraph that calculates
	# attributes about the 2D attention mask to use in GroupQueryAttention
	#
	# Key difference vs make_attention_mask_standard_reformatting_for_gqa:
	# - Standard mode: total_seq_len is calculated from Shape op (always runs on CPU)
	# - Graph capture mode: No Shape ops inserted to ensure all ops run on GPU (no CPU ops)
	#
	# attention_mask
	# \|
	# Cast to int32
	# \|
	# ReduceSum (keepdims=0)
	# / \
	# / \
	# Sub ReduceMax
	# \| \|
	# seqlens_k total_seq_len
	# (1D) (int)


		std::vector<PipelineModel> pipeline;

		bool compact_attention_mask{false}; // When true, attention_mask has shape [batch_size, 1] with the total sequence length value

Conversation

qjia7 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Mar 27, 2026

Uh oh!

kunal-vaishnavi Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

qjia7 Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

qjia7 Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qjia7 commented Mar 13, 2026 •

edited

Loading