[ET-VK] Implement aten.pixel_shuffle.default op by pytorchbot · Pull Request #19438 · pytorch/executorch

pytorchbot · 2026-05-09T04:58:11Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #19404 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/531/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/531/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/527/orig
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/531/orig
Differential Revision: D104462059
@diff-train-skip-merge

The previous commit on this stack added the fused `q8ta_pixel_shuffle` custom op and, to make pattern matching easier, added `aten.pixel_shuffle.default` to the partitioner's `ops_not_to_decompose` list. That change had a side effect: any non-quantized model that uses `aten.pixel_shuffle.default` now reaches the Vulkan backend with the op intact, but the backend had no implementation registered for it, so those models fail to lower. This commit adds a layout- and dtype-agnostic implementation of `aten.pixel_shuffle.default` so existing models keep working. The implementation rearranges `(N, C*r*r, H, W)` -> `(N, C, H*r, W*r)`, where output element `(n, c, h_out, w_out)` reads from input element `(n, c*r*r + (h_out%r)*r + (w_out%r), h_out/r, w_out/r)`. Two compute shaders are added because the work-assignment paradigm differs between storage types: - `pixel_shuffle_buffer.glsl` assigns one thread per output element and uses `linear_idx_to_tensor_idx` against the output `BufferMetadata`, which makes it agnostic to the underlying `dim_order`. - `pixel_shuffle_texture.glsl` assigns one thread per output texel and uses `TextureMetadata` plus `indexing.glslh` helpers so the same shader handles channels-, width-, and height-packed layouts. The texture shader uses the `safe_idx` / `safe_set` if/else helpers everywhere a UBO-backed `ivec4` is indexed by a spec-constant-derived value, to avoid the Adreno 740 SPIR-V compiler crash on `ubo_struct.sizes[spec_const]` when the spec const resolves to 1 or 2. The buffer shader does not dynamically index any UBO `ivec4`. Op registration: `register_pixel_shuffle()` in `op_registry.py` uses `ANY_STORAGE`, `FP_T`, and `supports_resize=True`, so the partitioner accepts both storage types and both fp32/fp16, across all packed layouts. Differential Revision: [D104462059](https://our.internmc.facebook.com/intern/diff/D104462059/) ghstack-source-id: 379519849 Pull Request resolved: #19404

pytorch-bot · 2026-05-09T04:58:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19438

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long GPU queue (g5, g6) on LF fleet

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…nels-packed int8 tensors Pull Request resolved: #19397 A RefineNet segmentation model spends ~860 us (~17% of inference) on the textbook decomposed PyTorch PixelShuffle chain (q8ta_dequantize -> view -> permute -> view -> q8ta_quantize) repeated four times in the segmentation head. This is wasteful: it materializes three buffers and round-trips through fp32 just to perform what is fundamentally a byte permutation on an int8 tensor. This diff introduces et_vk.q8ta_pixel_shuffle.default, a single fused kernel that operates directly on int8x4 packed buffers. Each thread writes one output int32 word (= 4 consecutive output channels at one (n, oh, ow) spatial position). Dispatch is 1D over total output int words, sized as N * div_up_4(C_out) * H_out * W_out with a 64-thread local workgroup. The four channel lanes inside an output int come from four different input int words (input channels are spaced by r*r), so each thread issues four input loads. The (oh % r, ow % r) -> input lane mapping is constant for a given thread because all four output lanes share (oh, ow). The first byte index is computed via the layout-aware helper tensor4d_idx_to_buf_idx; subsequent lanes derive their byte index by adding stride[packed_dim] * block_numel, a layout-only constant, so only one helper call is needed per thread. When input/output share scale and zero-point (the typical residual-path case), the requantize math is skipped and the kernel becomes a pure byte shuffle (selected via the passthrough push constant). The op accepts the channels-packed PACKED_INT8 family (PACKED_INT8_4W4C, PACKED_INT8_4C1W, PACKED_INT8_CONV2D) on both input and output. The partitioner routes the op into whichever channels-packed layout the surrounding q8ta_conv2d_pw / q8ta_add ops produce/consume (PACKED_INT8_4W4C on RefineNet). Restricting to the channels-packed family means the inner block axis is always C and the lane within an int word is constant per thread, which removes the need for layout-block-config spec consts in the shader. Rather than matching the decomposed view -> permute -> view chain after to_edge lowering, this diff preserves aten.pixel_shuffle.default through to_edge by adding it to the partitioner's ops_to_not_decompose list. The matcher then operates on the much simpler dq -> [clone] -> aten.pixel_shuffle.default -> [clone] -> q form. This keeps the matcher robust against edge-dialect / clone-insertion variations. Pieces in this diff: - Partitioner / fuser: - partitioner/vulkan_partitioner.py — adds aten.pixel_shuffle.default to ops_to_not_decompose so the framework preserves the op through to_edge lowering. - patterns/quantized_pixel_shuffle.py — detects dq -> [clone] -> aten.pixel_shuffle.default -> [clone] -> q and rewrites it to et_vk.q8ta_pixel_shuffle.default. Transparently skips clone / _clone_dim_order nodes between any pair of nodes. - Runtime kernel: - runtime/graph/ops/glsl/q8ta_pixel_shuffle.glsl + .yaml - runtime/graph/ops/impl/Q8taPixelShuffle.cpp + .h - Op definitions: - custom_ops_lib.py: register et_vk.q8ta_pixel_shuffle (Python op definition). - op_registry.py: inputs_storage = utils.PACKED_INT8_CHANNELS_PACKED_BUFFER. - Tests: - test/custom_ops/impl/TestQ8taPixelShuffle.cpp: test op that runs q -> [fused | unfused chain] -> dq, with selectable input/output int8 layouts via str args. The op accepts the channels-packed family; the layout_from_string helper currently exercises 4W4C. - test/custom_ops/test_q8ta_pixel_shuffle.cpp: 16 ACCU + 8 PERF cases (4 shapes x 2 qparam settings x 2 impl_selectors x 1 layout combination, 4W4C -> 4W4C). - test/test_vulkan_passes.py: positive and negative pattern-matcher unit tests against the un-decomposed form. ghstack-source-id: 379519848 @exported-using-ghexport Differential Revision: [D104099055](https://our.internmc.facebook.com/intern/diff/D104099055/)

pytorchbot requested a review from SS-JIA as a code owner May 9, 2026 04:58

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 9, 2026

SS-JIA approved these changes May 9, 2026

View reviewed changes

SS-JIA merged commit 7e11a2a into gh/SS-JIA/527/orig May 9, 2026
52 of 53 checks passed

SS-JIA deleted the gh/SS-JIA/531/orig branch May 9, 2026 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Implement aten.pixel_shuffle.default op#19438

[ET-VK] Implement aten.pixel_shuffle.default op#19438
SS-JIA merged 2 commits intogh/SS-JIA/527/origfrom
gh/SS-JIA/531/orig

pytorchbot commented May 9, 2026

Uh oh!

pytorch-bot Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pytorchbot commented May 9, 2026

Uh oh!

pytorch-bot Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19438

❗ 1 Active SEVs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented May 9, 2026 •

edited

Loading