Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgpu] optimize SkipLayerNormalization operator #24164

Merged
merged 2 commits into from
Apr 8, 2025

Conversation

xhcao
Copy link
Contributor

@xhcao xhcao commented Mar 25, 2025

If the sizes of batch_size and sequence_length are ones, split the hidden_size to improve parallelism.

Description

Motivation and Context

If the sizes of batch_size and sequence_length are ones,
split the hidden_size to improve parallelism.
@xhcao xhcao force-pushed the skip-norm-layer branch from fc3af9b to 32dd4cd Compare March 25, 2025 08:50
@xhcao
Copy link
Contributor Author

xhcao commented Mar 25, 2025

The outputs of SkipLayerNormalization operator in phi3.5 are output and input_skip_bias_sum, and their shapes are [batch_size, sequence_length, hidden_size], on decoding stage, the batch_size and sequence_length are always 1, the outputs' shapes are [1, 1, 3072], there is only one work group, which does not use gpu resources well.
For this situation, 1. the PR splits hidden dim to add workgroup, although adds workload, but reduces the average workload of one work group. 2. Handle output and input_skip_bias_sum in different work groups. If so, the total work groups are 12 for the shape [1, 1, 3072].
Use Intel GPA tool to capture the data, from ~20us to ~10us.
SkipNL-Before
SkipNL-After
@jchen10 @hujiajie PTAL, thanks

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Mar 25, 2025
@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Copy link

Azure Pipelines successfully started running 7 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Apr 1, 2025

lgtm.
CI pipelines changed - can you merge with main?

@xhcao
Copy link
Contributor Author

xhcao commented Apr 1, 2025

lgtm. CI pipelines changed - can you merge with main?

Updated

@guschmue
Copy link
Contributor

guschmue commented Apr 8, 2025

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@guschmue guschmue merged commit 0acb048 into microsoft:main Apr 8, 2025
60 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants