Skip to content

[metal] synchronize acceleration structure builds using fences#9645

Open
jtbirdsell wants to merge 1 commit into
gfx-rs:trunkfrom
jtbirdsell:fix/metal-as-sync
Open

[metal] synchronize acceleration structure builds using fences#9645
jtbirdsell wants to merge 1 commit into
gfx-rs:trunkfrom
jtbirdsell:fix/metal-as-sync

Conversation

@jtbirdsell

Copy link
Copy Markdown

Connections

Fixes #9215. Related to #9100 (results below). This is option 1 from the discussion in #9215.

Description

On Metal, place_acceleration_structure_barrier was an empty no-op, and Metal does not order acceleration structure commands encoded on the same MTLAccelerationStructureCommandEncoder. A TLAS build could end up consuming a BLAS that was still building, which shows up as garbage intersections in the repro and would be a hang in bigger workloads.

Apple's docs rule out fixing this within one encoder ("Don't update a fence and then wait for the same fence within a pass because it can create a GPU deadlock"), so the fix splits the encoder at sync points instead:

  • place_acceleration_structure_barrier ends any open AS encoder after encoding updateFence:, and the next AS encoder starts with waitForFence:. It splits unconditionally rather than interpreting the barrier's usage flags. wgpu-core only emits AS barriers where ordering is required, and an encoder is only open when prior AS commands exist, so this never splits needlessly, and it stays correct if core's barrier emission changes later.
  • read_acceleration_structure_compact_size splits first if the open encoder contains builds. wgpu-core encodes the size query with no barrier after the build of the structure being queried, so without this the compacted size can be read from a still-building BLAS.
  • One MTLFence is created lazily and reused for the encoder's lifetime, wait first then update within each pass, which is the reuse pattern Apple documents. Creating a fence per split would be a use after free hazard with commandBufferWithUnretainedReferences, which doesn't retain the fence past encoding.
  • The pending wait flag is cleared at end_encoding/discard_encoding, so a wait can never land in a different command buffer than its update. That could deadlock if buffers are submitted out of order.

Both fence methods go back to macOS 11 / iOS 14, the same availability as acceleration structures themselves, so Intel Macs are fine (unlike the Metal 26 barrier(afterQueueStages:beforeStages:) alternative).

Testing

On an M3 MacBook Pro (10 core GPU, hardware RT):

  • https://github.com/Vecvec/macos-ray-tracing-test pointed at this branch: 20/20 runs failed before this change, 0/100 after. On this hardware the race isn't even intermittent, and only the cases that build BLAS and TLAS in the same submission fail.
  • cargo xtask test: same results before and after (945 passed; the 4 failures are naga SPV snapshots and Metal shader passthrough, pre-existing on trunk on my machine and unrelated). The ray_tracing group passes 42/42 including the blas_compaction tests, which exercise the compact size path.
  • cargo xtask cts --backend metal: no regressions. One pre-existing maxStorageBufferBindingSize:validate failure, also present on unpatched trunk.
  • Re Ray tracing example tests failing on metal #9100: the ray tracing example tests all pass on this M3 both with and without this change at current trunk, so I couldn't reproduce that one. Noted on the issue.

Squash or Rebase?

Squash.

Checklist

  • I self-reviewed and fully understand this PR.
  • WebGPU implementations built with wgpu may be affected behaviorally.
  • Validation and feature gates are in place to confine behavioral changes.
  • Tests demonstrate the validation and altered logic works.
  • CHANGELOG.md entries for the user-facing effects of this change are present.
  • The PR is minimal, and doesn't make sense to land as multiple PRs.
  • Commits are logically scoped and individually reviewable.
  • The PR description has enough context to understand the motivation and solution implemented.

Metal does not order acceleration structure commands encoded on the
same encoder, so place_acceleration_structure_barrier now splits the
encoder: it updates an MTLFence, ends the encoder, and the next
acceleration structure encoder waits on the fence before encoding
anything. read_acceleration_structure_compact_size does the same when
the open encoder contains builds, since wgpu-core encodes the size
query without a barrier after the build it depends on.

Fixes gfx-rs#9215
@Vecvec

Vecvec commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I think that this is a definite improvement, but I am concerned about what happens if the build commands are in separate command encoders. The only guarantee I could find was rather vague:

As much as possible, the perceived order in which Metal executes the commands is the same as the way you order them. Although Metal might reorder some of your commands before processing them, this usually only occurs when there’s a performance gain and no other perceivable impact.

This seems to have been broken anyway, so I'm not sure it can be trusted.

@Vecvec Vecvec left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, I hope to test this within the next day or two.

Comment on lines +1899 to +1905
// wgpu-core encodes this with no barrier after the build of the
// acceleration structure being queried, so if the current encoder
// contains builds, split it; otherwise the size could be read from a
// still-building acceleration structure.
if self.state.acceleration_structure_builder_has_builds {
self.split_acceleration_structure_builder();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is probably a bug in wgpu-core (probably also #8825). Thanks for noticing this.

@Vecvec

Vecvec commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

I've run the tests on my phone (my macbook still doesn't reproduce this), and this does fix the issue. However, there appears to be a race when splitting the encoder. I think it would be ideal if this could also be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal BLAS -> TLAS builds need synchronisation

3 participants