[metal] synchronize acceleration structure builds using fences#9645
[metal] synchronize acceleration structure builds using fences#9645jtbirdsell wants to merge 1 commit into
Conversation
Metal does not order acceleration structure commands encoded on the same encoder, so place_acceleration_structure_barrier now splits the encoder: it updates an MTLFence, ends the encoder, and the next acceleration structure encoder waits on the fence before encoding anything. read_acceleration_structure_compact_size does the same when the open encoder contains builds, since wgpu-core encodes the size query without a barrier after the build it depends on. Fixes gfx-rs#9215
4fd2754 to
a9778b0
Compare
|
I think that this is a definite improvement, but I am concerned about what happens if the build commands are in separate command encoders. The only guarantee I could find was rather vague:
This seems to have been broken anyway, so I'm not sure it can be trusted. |
Vecvec
left a comment
There was a problem hiding this comment.
The code looks good, I hope to test this within the next day or two.
| // wgpu-core encodes this with no barrier after the build of the | ||
| // acceleration structure being queried, so if the current encoder | ||
| // contains builds, split it; otherwise the size could be read from a | ||
| // still-building acceleration structure. | ||
| if self.state.acceleration_structure_builder_has_builds { | ||
| self.split_acceleration_structure_builder(); | ||
| } |
There was a problem hiding this comment.
Note: This is probably a bug in wgpu-core (probably also #8825). Thanks for noticing this.
|
I've run the tests on my phone (my macbook still doesn't reproduce this), and this does fix the issue. However, there appears to be a race when splitting the encoder. I think it would be ideal if this could also be resolved. |
Connections
Fixes #9215. Related to #9100 (results below). This is option 1 from the discussion in #9215.
Description
On Metal,
place_acceleration_structure_barrierwas an empty no-op, and Metal does not order acceleration structure commands encoded on the sameMTLAccelerationStructureCommandEncoder. A TLAS build could end up consuming a BLAS that was still building, which shows up as garbage intersections in the repro and would be a hang in bigger workloads.Apple's docs rule out fixing this within one encoder ("Don't update a fence and then wait for the same fence within a pass because it can create a GPU deadlock"), so the fix splits the encoder at sync points instead:
place_acceleration_structure_barrierends any open AS encoder after encodingupdateFence:, and the next AS encoder starts withwaitForFence:. It splits unconditionally rather than interpreting the barrier's usage flags. wgpu-core only emits AS barriers where ordering is required, and an encoder is only open when prior AS commands exist, so this never splits needlessly, and it stays correct if core's barrier emission changes later.read_acceleration_structure_compact_sizesplits first if the open encoder contains builds. wgpu-core encodes the size query with no barrier after the build of the structure being queried, so without this the compacted size can be read from a still-building BLAS.MTLFenceis created lazily and reused for the encoder's lifetime, wait first then update within each pass, which is the reuse pattern Apple documents. Creating a fence per split would be a use after free hazard withcommandBufferWithUnretainedReferences, which doesn't retain the fence past encoding.end_encoding/discard_encoding, so a wait can never land in a different command buffer than its update. That could deadlock if buffers are submitted out of order.Both fence methods go back to macOS 11 / iOS 14, the same availability as acceleration structures themselves, so Intel Macs are fine (unlike the Metal 26
barrier(afterQueueStages:beforeStages:)alternative).Testing
On an M3 MacBook Pro (10 core GPU, hardware RT):
cargo xtask test: same results before and after (945 passed; the 4 failures are naga SPV snapshots and Metal shader passthrough, pre-existing on trunk on my machine and unrelated). The ray_tracing group passes 42/42 including theblas_compactiontests, which exercise the compact size path.cargo xtask cts --backend metal: no regressions. One pre-existingmaxStorageBufferBindingSize:validatefailure, also present on unpatched trunk.Squash or Rebase?
Squash.
Checklist
wgpumay be affected behaviorally.CHANGELOG.mdentries for the user-facing effects of this change are present.