-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Vectorize [un]pack4x{I, U}8[Clamp]
on spv and msl
#7664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c8e640f
to
ac120e9
Compare
This comment was marked as resolved.
This comment was marked as resolved.
[un]pack4x{I, U}8[Clamp]
on spv and msl[un]pack4x{I, U}8[Clamp]
on spv and msl
a136017
to
659df18
Compare
This comment was marked as resolved.
This comment was marked as resolved.
FWIW, the optimization of |
659df18
to
5ba088c
Compare
I squashed some fixes to make this PR easier to review. Each commit should now separately pass CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, left a few small comments.
Thank you so much for the review, this is great! I addressed your comments. |
This is blocked on #4582. |
Thanks for addressing the comments, let's land it! |
Ah it would probably be nice to land this without squashing, could you move the code changes in the last commit to their respective commits? |
Emits vectorized SPIR-V code for the WGSL functions `unpack4xI8`, `unpack4xU8`, `pack4xI8`, `pack4xU8`, `pack4xI8Clamp`, and `pack4xU8Clamp` if `Capability::Int8` is available. Exploits the following facts about SPIR-V ops: - `SClamp`, `UClamp`, and `OpUConvert` accept vector arguments, in which case results are computed per component; and - `OpBitcast` can cast between vectors and scalars, with a well-defined bit order that matches that required by the WGSL spec, see below. WGSL spec for `pack4xI8` [1]: > Component e[i] of the input is mapped to bits 8 x i through 8 x i + 7 > of the result. SPIR-V spec for `OpBitcast` [2]: > Within this mapping, any single component of `S` [remark: the type > with fewer but wider components] (mapping to multiple components of > `L` [remark: the type with more but narrower components]) maps its > lower-ordered bits to the lower-numbered components of `L`. [1] https://www.w3.org/TR/WGSL/#pack4xI8-builtin [2] https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpBitcast
Implements more direct conversions between 32-bit integers and 4x8-bit integer vectors using bit casting to/from `packed_[u]char4` when on MSL 2.1+ (older versions of MSL don't seem to support these bit casts). - `unpack4x{I, U}8(x)` becomes `[u]int4(as_type<packed_[u]char4>(x))`; - `pack4x{I, U}8(x)` becomes `as_type<uint>(packed_[u]char4(x))`; and - `pack4x{I, U}8Clamp(x)` becomes `as_type<uint>(packed_uchar4(metal::clamp(x, 0, 255)))`. These bit casts match the WGSL spec for these functions because Metal runs on little-endian machines.
Separates the Vulkan feature sets `VkPhysicalDeviceShaderFloat16Int8Features` and `VkPhysicalDevice16BitStorageFeatures`, which previously were used "together, or not at all". This commit should not change any behavior yet, but I'd like to run full CI tests on it for now. If the CI tests pass, I'll use this separation to enable the `shader_int8` feature separately from the rest of the features to enable optimizations of `[un]pack4x{I,U}8[Clamp]` on SPIR-V.
This allows declaring the SPIR-V capability "Int8", which allows us to generate faster code for `[un]pack4x{I, U}8[Clamp]`.
4d633d2
to
871d295
Compare
No problem, done. |
Thank you! |
Connections
Related to, but independent of:
dot4{I, U}8Packed
on SPIR-V and HLSL #7574dot4{I,U}8Packed
on Metal #7653Description
Emits simpler code (using vectorized ops) for
unpack4xI8
,unpack4xU8
,pack4xI8
,pack4xU8
,pack4xI8Clamp
, andpack4xU8Clamp
on SPIR-V (if capability "Int8" is available) and Metal (with MSL 2.1+).SClamp
,UClamp
, andOpUConvert
accept vector arguments (in which case results are computed per component), and thatOpBitcast
can cast between vectors and scalars, with a well-defined bit order that matches that required by the WGSL spec (see below for details).unpack4x{I, U}8(x)
→[u]int4(as_type<packed_[u]char4>(x))
,pack4x{I, U}8(x)
→as_type<uint>(packed_[u]char4(x))
,pack4xI8Clamp(x)
→as_type<uint>(packed_char4(metal::clamp(x, -128, 127)))
, andpack4xU8Clamp(x)
→as_type<uint>(packed_uchar4(metal::clamp(x, 0, 255)))
.Regarding byte order (this was not relevant for the dot product in #7653 but is relevant now):
Testing
wgpu_gpu::shader::data_builtins::create_unpack4xU8_test
on both the optimized and the unoptimized generated code. I'm not sure how to do this but I'd welcome any hints.Squash or Rebase?
Each commit should pass CI.
Checklist
cargo fmt
.taplo format
.cargo clippy --tests
. If applicable, add:--target wasm32-unknown-unknown
cargo xtask test
to run tests.CHANGELOG.md
entry.