Add rms_norm op + named-data upload + x86 CI (#19893)#19893
Add rms_norm op + named-data upload + x86 CI (#19893)#19893JulianCloudNTH wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19893
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
|
@JulianCloudNTH has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106887028. |
This PR needs a
|
1abcd20 to
38e33e5
Compare
Summary: Adds the `et_vk.rms_norm.default` operator to the WebGPU backend (a WGSL compute shader using a cooperative tree reduction, one workgroup per row), fixes constant upload so the op's weight loads correctly, and wires the backend into CI. The Vulkan serializer that the WebGPU backend reuses stores every non-empty constant (e.g. the rms_norm weight) in the PTE's named-data map with `offset == UINT64_MAX` and a `named_key`, rather than inline in the VK00 blob. `WebGPUGraph::build` previously handled only inline constants, so the weight was never uploaded and the op returned all zeros. `build` now also fetches named-data constants via `NamedDataMap::get_data`, mirroring the path `VulkanBackend` already uses. `aten.add` was unaffected since it has no constant tensors. The shader mirrors the Vulkan implementation (`backends/vulkan/runtime/graph/ops/impl/RmsNorm.cpp`, `backends/vulkan/runtime/graph/ops/glsl/rms_norm_buffer.glsl`); indexing assumes contiguous fp32 inputs. The handler fails loud (throws, mirroring Vulkan's `VK_CHECK_COND`) on invalid shape/dtype/dispatch-limit conditions, and defaults `eps` to the float32 machine epsilon. Also adds a simple x86 Linux CI job, mirroring the Vulkan delegate: `backends/test/suite/flows/webgpu.py` plus a `WebGPUTester`, run by `oss/.github/workflows/test-backend-webgpu.yml` on SwiftShader (a software Vulkan adapter, via `wgpu-native`, minimal dependencies, no GPU). Two fixes were needed for SwiftShader's downlevel limits: request the adapter's full `requiredLimits` at device creation (software adapters default storage-buffer limits to 0), and lower the `add` op `workgroup_size` from 256 to 64 (256 exceeded SwiftShader's 128-invocation cap; the Vulkan delegate uses 64). Differential Revision: D106887028
38e33e5 to
d6f278e
Compare
Summary:
Adds the
et_vk.rms_norm.defaultoperator to the WebGPU backend (a WGSL compute shader using a cooperative tree reduction, one workgroup per row), fixes constant upload so the op's weight loads correctly, and wires the backend into CI.The Vulkan serializer that the WebGPU backend reuses stores every non-empty constant (e.g. the rms_norm weight) in the PTE's named-data map with
offset == UINT64_MAXand anamed_key, rather than inline in the VK00 blob.WebGPUGraph::buildpreviously handled only inline constants, so the weight was never uploaded and the op returned all zeros.buildnow also fetches named-data constants viaNamedDataMap::get_data, mirroring the pathVulkanBackendalready uses.aten.addwas unaffected since it has no constant tensors.The shader mirrors the Vulkan implementation (
backends/vulkan/runtime/graph/ops/impl/RmsNorm.cpp,backends/vulkan/runtime/graph/ops/glsl/rms_norm_buffer.glsl); indexing assumes contiguous fp32 inputs. The handler fails loud (throws, mirroring Vulkan'sVK_CHECK_COND) on invalid shape/dtype/dispatch-limit conditions, and defaultsepsto the float32 machine epsilon.Also adds a simple x86 Linux CI job, mirroring the Vulkan delegate:
backends/test/suite/flows/webgpu.pyplus aWebGPUTester, run byoss/.github/workflows/test-backend-webgpu.ymlon SwiftShader (a software Vulkan adapter, viawgpu-native, minimal dependencies, no GPU). Two fixes were needed for SwiftShader's downlevel limits: request the adapter's fullrequiredLimitsat device creation (software adapters default storage-buffer limits to 0), and lower theaddopworkgroup_sizefrom 256 to 64 (256 exceeded SwiftShader's 128-invocation cap; the Vulkan delegate uses 64).Differential Revision: D106887028