Vectorize `[un]pack4x{I, U}8[Clamp]` on spv and msl #7664

robamler · 2025-05-03T19:19:44Z

Connections
Related to, but independent of:

Description
Emits simpler code (using vectorized ops) for unpack4xI8, unpack4xU8, pack4xI8, pack4xU8, pack4xI8Clamp, and pack4xU8Clamp on SPIR-V (if capability "Int8" is available) and Metal (with MSL 2.1+).

on SPIR-V (if capability "Int8" is available): exploit that SClamp, UClamp, and OpUConvert accept vector arguments (in which case results are computed per component), and that OpBitcast can cast between vectors and scalars, with a well-defined bit order that matches that required by the WGSL spec (see below for details).
on Metal (MSL 2.1+ only): translate (see shader playground)
- unpack4x{I, U}8(x) → [u]int4(as_type<packed_[u]char4>(x)),
- pack4x{I, U}8(x) → as_type<uint>(packed_[u]char4(x)),
- pack4xI8Clamp(x) → as_type<uint>(packed_char4(metal::clamp(x, -128, 127))), and
- pack4xU8Clamp(x) → as_type<uint>(packed_uchar4(metal::clamp(x, 0, 255))).

Regarding byte order (this was not relevant for the dot product in #7653 but is relevant now):

WGSL states:

Component e[i] of the input is mapped to bits 8 x i through 8 x i + 7 of the result.
This corresponds to little-endian byte order, which is what Apple uses for Metal.
SPIR-V states the same byte order in more complicated words:

Within this mapping, any single component of S [remark: a 32-bit int in our case] (mapping to multiple components of L [remark: a 4-vector of 8-bit ints in our case]) maps its lower-ordered bits to the lower-numbered components of L

Testing

SPIR-V part was already covered by snapshot tests.
Added a snapshot test for MSL 2.1+
It would be nice to add tests for correctness (rather than just validity) of the MSL and SPIR-V output here. I.e., run something like the test in wgpu_gpu::shader::data_builtins::create_unpack4xU8_test on both the optimized and the unoptimized generated code. I'm not sure how to do this but I'd welcome any hints.

Squash or Rebase?
Each commit should pass CI.

Checklist

Run cargo fmt.
Run taplo format.
Run cargo clippy --tests. If applicable, add:
- --target wasm32-unknown-unknown
Run cargo xtask test to run tests.
If this contains user-facing changes, add a CHANGELOG.md entry.

robamler · 2025-05-04T22:30:08Z

FWIW, the optimization of pack4xU8 speeds up our motivating use case (same as in #7595 (comment)) by between 1.1% and 1.7%, with a standard error of at most 0.06% across all benchmarks. So the improvement here is much smaller than what we found for dot4I8Packed, but it's definitely statistically significant. Curiously, the impact of the optimizations of unpack4xU8 is within the error bars, even though this function is also used in our shader, and I verified manually that the vectorized code for it gets emitted. These are very ad-hoc benchmarks, and I ran them only on SPIR-V for now.

robamler · 2025-05-20T10:18:49Z

I squashed some fixes to make this PR easier to review. Each commit should now separately pass CI.

teoxoy

Looks good overall, left a few small comments.

wgpu-hal/src/vulkan/adapter.rs

naga/src/back/msl/writer.rs

naga/src/back/spv/block.rs

robamler · 2025-05-22T22:21:22Z

Thank you so much for the review, this is great! I addressed your comments.

teoxoy · 2025-05-23T11:09:35Z

It would be nice to add tests for correctness (rather than just validity) of the MSL and SPIR-V output here. I.e., run something like the test in wgpu_gpu::shader::data_builtins::create_unpack4xU8_test on both the optimized and the unoptimized generated code. I'm not sure how to do this but I'd welcome any hints.

This is blocked on #4582.

teoxoy · 2025-05-23T11:09:53Z

Thanks for addressing the comments, let's land it!

teoxoy · 2025-05-23T11:12:28Z

Ah it would probably be nice to land this without squashing, could you move the code changes in the last commit to their respective commits?

Emits vectorized SPIR-V code for the WGSL functions `unpack4xI8`, `unpack4xU8`, `pack4xI8`, `pack4xU8`, `pack4xI8Clamp`, and `pack4xU8Clamp` if `Capability::Int8` is available. Exploits the following facts about SPIR-V ops: - `SClamp`, `UClamp`, and `OpUConvert` accept vector arguments, in which case results are computed per component; and - `OpBitcast` can cast between vectors and scalars, with a well-defined bit order that matches that required by the WGSL spec, see below. WGSL spec for `pack4xI8` [1]: > Component e[i] of the input is mapped to bits 8 x i through 8 x i + 7 > of the result. SPIR-V spec for `OpBitcast` [2]: > Within this mapping, any single component of `S` [remark: the type > with fewer but wider components] (mapping to multiple components of > `L` [remark: the type with more but narrower components]) maps its > lower-ordered bits to the lower-numbered components of `L`. [1] https://www.w3.org/TR/WGSL/#pack4xI8-builtin [2] https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpBitcast

Implements more direct conversions between 32-bit integers and 4x8-bit integer vectors using bit casting to/from `packed_[u]char4` when on MSL 2.1+ (older versions of MSL don't seem to support these bit casts). - `unpack4x{I, U}8(x)` becomes `[u]int4(as_type<packed_[u]char4>(x))`; - `pack4x{I, U}8(x)` becomes `as_type<uint>(packed_[u]char4(x))`; and - `pack4x{I, U}8Clamp(x)` becomes `as_type<uint>(packed_uchar4(metal::clamp(x, 0, 255)))`. These bit casts match the WGSL spec for these functions because Metal runs on little-endian machines.

Separates the Vulkan feature sets `VkPhysicalDeviceShaderFloat16Int8Features` and `VkPhysicalDevice16BitStorageFeatures`, which previously were used "together, or not at all". This commit should not change any behavior yet, but I'd like to run full CI tests on it for now. If the CI tests pass, I'll use this separation to enable the `shader_int8` feature separately from the rest of the features to enable optimizations of `[un]pack4x{I,U}8[Clamp]` on SPIR-V.

This allows declaring the SPIR-V capability "Int8", which allows us to generate faster code for `[un]pack4x{I, U}8[Clamp]`.

robamler · 2025-05-23T13:46:54Z

No problem, done.

teoxoy · 2025-05-23T14:06:08Z

Thank you!

robamler added a commit to robamler/wgpu that referenced this pull request May 3, 2025

Add changelog entry for gfx-rs#7664

f6e7693

robamler force-pushed the optimize-pack4x8 branch 2 times, most recently from c8e640f to ac120e9 Compare May 3, 2025 21:16

This comment was marked as resolved.

Sign in to view

robamler requested a review from a team as a code owner May 4, 2025 18:57

robamler changed the title ~~[naga] Vectorize [un]pack4x{I, U}8[Clamp] on spv and msl~~ Vectorize [un]pack4x{I, U}8[Clamp] on spv and msl May 4, 2025

robamler force-pushed the optimize-pack4x8 branch from ced786f to fcb58b3 Compare May 4, 2025 20:40

robamler added a commit to robamler/wgpu that referenced this pull request May 4, 2025

Add changelog entry for gfx-rs#7664

c22e8fa

robamler force-pushed the optimize-pack4x8 branch 2 times, most recently from a136017 to 659df18 Compare May 4, 2025 21:48

This comment was marked as resolved.

Sign in to view

robamler added a commit to robamler/wgpu that referenced this pull request May 20, 2025

Add changelog entry for gfx-rs#7664

14d5916

robamler force-pushed the optimize-pack4x8 branch from 659df18 to 5ba088c Compare May 20, 2025 09:55

robamler mentioned this pull request May 20, 2025

Potentially optimize dot4{I,U}8Packed on Metal #7653

Merged

6 tasks

cwfitzgerald assigned teoxoy May 21, 2025

teoxoy reviewed May 21, 2025

View reviewed changes

wgpu-hal/src/vulkan/adapter.rs Outdated Show resolved Hide resolved

naga/src/back/msl/writer.rs Outdated Show resolved Hide resolved

naga/src/back/spv/block.rs Outdated Show resolved Hide resolved

robamler requested a review from teoxoy May 22, 2025 22:21

teoxoy approved these changes May 23, 2025

View reviewed changes

robamler added 5 commits May 23, 2025 15:34

Add changelog entry for gfx-rs#7664

9569fa0

[wgpu-hal] Add PrivateCapabilities::shader_int8 on Vulkan

871d295

This allows declaring the SPIR-V capability "Int8", which allows us to generate faster code for `[un]pack4x{I, U}8[Clamp]`.

robamler force-pushed the optimize-pack4x8 branch from 4d633d2 to 871d295 Compare May 23, 2025 13:43

robamler requested a review from teoxoy May 23, 2025 13:56

teoxoy merged commit ff29165 into gfx-rs:trunk May 23, 2025
38 checks passed

teoxoy pushed a commit that referenced this pull request May 23, 2025

Add changelog entry for #7664

1be38fa

robamler deleted the optimize-pack4x8 branch May 23, 2025 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize `[un]pack4x{I, U}8[Clamp]` on spv and msl #7664

Vectorize `[un]pack4x{I, U}8[Clamp]` on spv and msl #7664

Uh oh!

robamler commented May 3, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

robamler commented May 4, 2025

Uh oh!

robamler commented May 20, 2025

Uh oh!

teoxoy left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robamler commented May 22, 2025 •

edited

Loading

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

robamler commented May 23, 2025

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

Vectorize [un]pack4x{I, U}8[Clamp] on spv and msl #7664

Vectorize [un]pack4x{I, U}8[Clamp] on spv and msl #7664

Uh oh!

Conversation

robamler commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

robamler commented May 4, 2025

Uh oh!

robamler commented May 20, 2025

Uh oh!

teoxoy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robamler commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

robamler commented May 23, 2025

Uh oh!

teoxoy commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

Vectorize `[un]pack4x{I, U}8[Clamp]` on spv and msl #7664

Vectorize `[un]pack4x{I, U}8[Clamp]` on spv and msl #7664

robamler commented May 3, 2025 •

edited

Loading

robamler commented May 22, 2025 •

edited

Loading