[webgpu] Flash attention for generation #23808

qjia7 · 2025-02-25T03:14:23Z

This PR adds the flash decoding support to optimization the generation speed when the total sequence length is large. Previously, when the total sequence length is big enough, the softmax and softmax * v shaders will become the bottleneck since it only uses limited gpu cores. In this changes, we add the flash decoding support to split the present key/value based on the total sequence length, then do reduce to get the final result.

On NV RTX 2000 Ada, the TPS becomes 41.4 from 34.4 for 1K tokens for phi4 static kv cache
On Meteor Lake, the TPS becomes 19 from 16 for 1K tokens for phi4 static kv cache

Side effect of this PR:
It adds two extra buffers to store 1) metadata (max and exp_sum in each split), 2) the splited qkv results with shape [B, N, split_k, H], which increase the memory size.

TODO:
Ideally, there should only be two shaders, which can also reduce the intermediate memory. The computeQKT can be merged into split shader and do the final softmax adjustment in the reduce shader. However, I meet some issues that when the total sequence length exceeds some value, the result will become garbage. Since I can't resolve it in a short time, leave it in as TODO to fix it in future.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

1. Only copy the new kv data for static kv cache 2. Add flash decoding for sequence_length = 1

guschmue · 2025-03-20T16:34:02Z

can you merge with main?

qjia7 · 2025-03-21T03:35:38Z

can you merge with main?

Done.

This PR is ready for review. Thanks.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

qjia7

Rename valid_new_present_shape to copy_kv_shape to help understand. Thanks for your suggestion.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

guschmue · 2025-04-04T23:32:39Z

can you merge with main?

qjia7

can you merge with main?

Done

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

This PR adds the flash decoding support to optimization the generation speed when the total sequence length is large. Previously, when the total sequence length is big enough, the softmax and softmax * v shaders will become the bottleneck since it only uses limited gpu cores. In this changes, we add the flash decoding support to split the present key/value based on the total sequence length, then do reduce to get the final result. On NV RTX 2000 Ada, the TPS becomes 41.4 from 34.4 for 1K tokens for phi4 static kv cache On Meteor Lake, the TPS becomes 19 from 16 for 1K tokens for phi4 static kv cache Side effect of this PR: It adds two extra buffers to store 1) metadata (max and exp_sum in each split), 2) the splited qkv results with shape [B, N, split_k, H], which increase the memory size. TODO: Ideally, there should only be two shaders, which can also reduce the intermediate memory. The computeQKT can be merged into split shader and do the final softmax adjustment in the reduce shader. However, I meet some issues that when the total sequence length exceeds some value, the result will become garbage. Since I can't resolve it in a short time, leave it in as TODO to fix it in future.

github-actions bot reviewed Feb 25, 2025

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 26, 2025

[webgpu] Add flash decoding

f0424fd

1. Only copy the new kv data for static kv cache 2. Add flash decoding for sequence_length = 1

qjia7 force-pushed the attention_generate_fa branch from 6f6d6d1 to f0424fd Compare March 10, 2025 14:10

fix CI errors

48affd3

qjia7 changed the title ~~[WIP] Flash attention for generation~~ [webgpu] Flash attention for generation Mar 11, 2025

limit it to static kv cache

96aaa89

qjia7 requested review from sushraja-msft and guschmue March 11, 2025 13:22

qjia7 marked this pull request as ready for review March 11, 2025 13:22

qjia7 added 4 commits March 18, 2025 13:38

Merge branch 'main' into attention_generate_fa_good

a97ad56

remove the limitations

40aa7ad

Merge branch 'main' into attention_generate_fa_good

99df2e9

Use 1D dispatch group size

e9c18db

qjia7 marked this pull request as draft March 19, 2025 10:12

add annotations

c96e925

qjia7 marked this pull request as ready for review March 19, 2025 13:26

Use simialr var name with matmul

0fb5c2f

Merge branch 'main' into attention_generate_fa_good

0d3a738

update cache hints

7cbed5f

sushraja-msft requested changes Mar 26, 2025

View reviewed changes

address comments

2526992

qjia7 commented Mar 27, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Outdated Show resolved Hide resolved

qjia7 requested a review from sushraja-msft March 28, 2025 02:04

sushraja-msft reviewed Apr 1, 2025

View reviewed changes

qjia7 added 3 commits April 1, 2025 15:27

address comments

922ca1b

Merge branch 'main' into attention_generate_fa_good

5159580

Rename XXXSplitK to XXXSplitVxScore

2f4a1f7

Modify the comments

acbf544

sushraja-msft approved these changes Apr 4, 2025

View reviewed changes

sushraja-msft previously approved these changes Apr 4, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

guschmue previously approved these changes Apr 4, 2025

View reviewed changes

qjia7 added 2 commits April 7, 2025 11:26

Merge branch 'main' into attention_generate_fa_good

639a2ab

address comments

191cf41

qjia7 dismissed stale reviews from guschmue and sushraja-msft via 191cf41 April 7, 2025 06:19

qjia7 commented Apr 7, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

qjia7 requested review from sushraja-msft and guschmue April 7, 2025 06:29

sushanthr approved these changes Apr 8, 2025

View reviewed changes

guschmue approved these changes Apr 8, 2025

View reviewed changes

guschmue removed the request for review from sushraja-msft April 8, 2025 15:23

guschmue merged commit 18f91e5 into main Apr 8, 2025
87 of 89 checks passed

guschmue deleted the attention_generate_fa branch April 8, 2025 15:28

[webgpu] Flash attention for generation #23808

[webgpu] Flash attention for generation #23808

Uh oh!

Conversation

qjia7 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guschmue commented Mar 20, 2025

Uh oh!

qjia7 commented Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guschmue commented Apr 4, 2025

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 commented Feb 25, 2025 •

edited

Loading