Skip to content

[hal metal] ray tracing acceleration structures #7660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: trunk
Choose a base branch
from

Conversation

Lichtso
Copy link
Contributor

@Lichtso Lichtso commented May 2, 2025

Connections
Fixes: #7402

Description
Implements the missing ray tracing acceleration structures in the HAL metal backend.

Testing
The examples ray_scene, ray_shadows, ray_cube_compute, ray_cube_fragment and ray_traced_triangle all work.
That is if invoked via cargo run --bin wgpu-examples ray_traced_triangle, but not via cargo xtask test ray_traced_triangle, still current CI runner is too old to catch that as it does not support hardware ray tracing.

Squash or Rebase?
Squash

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

@Lichtso Lichtso requested a review from a team as a code owner May 2, 2025 22:41
@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch 2 times, most recently from e30b663 to f3830cb Compare May 2, 2025 22:52
Copy link
Collaborator

@Vecvec Vecvec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Glad there didn't need to be any wgpu-core changes. Largely looks good, but I'm not extremely knowledgeable about metal. One question / comment, but haven't yet checked everything with spec.

Copy link
Collaborator

@Vecvec Vecvec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done checks against the Metal spec. It seems this requires MacOS 13.0+ not 11.0+ due to some more recent functions being used. Confusingly, the vertex buffer field suggests that the only format supported is f32x3 so I'm not sure what descriptor.set_vertex_format does.

@@ -890,6 +890,11 @@ impl super::PrivateCapabilities {
&& (device.supports_family(MTLGPUFamily::Apple7)
|| device.supports_family(MTLGPUFamily::Mac2)),
supports_shared_event: version.at_least((10, 14), (12, 0), os_is_mac),
supports_raytracing: if version.at_least((11, 0), (14, 0), os_is_mac) {
device.supports_raytracing()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think raytracing support needs supportsRaytracingFromRender due to support of ray queries in fragment shaders (Requires MacOS 12.0+).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function is not exposed in the Rust metal crate. But I did bump the min required versions to macOS 13 and iOS 16.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the metal crate is still taking PRs (Idk what state of deprecated they are in) it would probably be a good idea to add this (and the other later ones).

@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch from f3830cb to 234e75b Compare May 3, 2025 08:33
}

unsafe fn destroy_acceleration_structure(
&self,
_acceleration_structure: super::AccelerationStructure,
) {
unimplemented!()
// self.counters.acceleration_structures.sub(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I there a reason not to have HalCounters::acceleration_structures?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at the history I couldn't find a reason, but it's possible it's buried somewhere.

for descriptor in descriptors {
let acceleration_structure_descriptor =
conv::map_acceleration_structure_descriptor(descriptor.entries);
/* The Rust metal crate does not expose metal::MTLAccelerationStructureUsage yet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not exposed in the Rust metal crate.

@@ -35,6 +35,7 @@ var acc_struct: acceleration_structure;

struct PushConstants {
light: vec3<f32>,
padding: f32,
Copy link
Contributor Author

@Lichtso Lichtso May 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that metal always sends at least 16 bytes for push constants, even if we only pass in 12 bytes. And then the shader validation complains that the receiver here only expects 12 bytes.

@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch 2 times, most recently from 5f1c464 to 2a6d9b6 Compare May 3, 2025 11:13
@Lichtso
Copy link
Contributor Author

Lichtso commented May 3, 2025

Glad there didn't need to be any wgpu-core changes

Almost, had to remove the Option<> around the buffers and always pass the dummy zero buffer when computing the size of the acceleration structures and their scratch buffers because Metal does not like nil.

I can split those first four commits into a separate PR if that helps with the review.

@Vecvec
Copy link
Collaborator

Vecvec commented May 4, 2025

I just remembered that structures have minimum versions, and it seems MTLIndirectAccelerationStructureInstanceDescriptor required MacOS 14.0+ (probably should have checked that earlier...).

@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch 2 times, most recently from 38b3de7 to 9442511 Compare May 4, 2025 11:49
@Lichtso
Copy link
Contributor Author

Lichtso commented May 4, 2025

I just remembered that structures have minimum versions, and it seems MTLIndirectAccelerationStructureInstanceDescriptor required MacOS 14.0+ (probably should have checked that earlier...).

Bumped the min required version even further up.

@Lichtso
Copy link
Contributor Author

Lichtso commented May 4, 2025

I also managed to reduce the issue with the acelleration structure not intersecting any rays to a perfect reproducer and it is wild:

See the last commit "Bug reproducer", which modifies the ray_cube_fragment example to generate two BLASes: One with 152 triangles and one with 153 triangles.

With Metal on macOS the instances of the BLAS with 152 triangles (16344 bytes acceleration_structure_size) work as expected, but the ones with 153 triangles (16472 bytes acceleration_structure_size) suddenly stop intersecting rays after roughly 1.5 seconds no matter how many frames were rendered until then. 0x4000 = 2^14 = 16384 might be some special boundary being crossed. It also keeps happening even if I stop calling build_acceleration_structure() after the inital setup. Using MTLAccelerationStructureInstanceDescriptor or MTLIndirectAccelerationStructureInstanceDescriptor is also irrelevant. Same goes for calling encoder.use_resource_at(blas.as_native(), use_info.uses, use_info.stages) or not.

This also breaks Vulkan on Linux with a SIGSEGV upon Queue::submit: https://github.com/gfx-rs/wgpu/actions/runs/14820911901/job/41607697292?pr=7660

Using an example from metal-rs without wgpu does not reproduce this bug. It seems we are either lacking some validation step or are doing something wrong with our handling of acceleration structures in general.

@Vecvec: What testing hardware do you have available? Can you maybe see why Vulkan is failing this too?

@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch from 9442511 to 90082ad Compare May 4, 2025 12:40
@Vecvec
Copy link
Collaborator

Vecvec commented May 4, 2025

@Vecvec: What testing hardware do you have available? Can you maybe see why Vulkan is failing this too?

I've got a couple of raytracing supported machines (plus llvmpipe which I will also be testing on). I'll have a look and see if I can get any ideas of what the issue might be.

@Vecvec
Copy link
Collaborator

Vecvec commented May 4, 2025

Hits a divide by zero on Microsoft Basic Render Driver (though it doesn't seem to be related to the memory used, and only on one of my comuters). Can't get it to fail on the real gpus yet. Was able to reproduce the llvmpipe seg fault (edit: Don't think it's the same problem as the one here), will continue testing.

@Lichtso
Copy link
Contributor Author

Lichtso commented May 4, 2025

divide by zero

Might be that it tries to normalize a zero-length vector. The modified example does simply duplicate triangles so that could cause some vectors to become zero.

I narrowed the Metal issue down further and it is indeed caused by AccelerationStructureBuildSizes::acceleration_structure_size being greater or equal to 0x4000. For example if I modify device.new_acceleration_structure_with_size(descriptor.size.max(0x4000)) in Device::create_acceleration_structure() only (which is the latest point and makes sure that it is only related to the Metal backend) then all BLAS instances first work fine but disappear after 1.5 seconds. Reading the Metal docs it appears that 16384 (0x4000) is indeed used as API limit for other things like the mesh shader output buffer. So maybe there is a bug in the Metal driver, because I can not immagine that the limit for acceleration structure sizes is supposed to be so low.

Edit: Officially the limits are way higher, see https://developer.apple.com/documentation/metal/mtlaccelerationstructureusage/extendedlimits.

@Vecvec
Copy link
Collaborator

Vecvec commented May 4, 2025

Most other resources are created with an auto release pool around them, is it possible that that is fixing this issue somehow?

@Lichtso
Copy link
Contributor Author

Lichtso commented May 5, 2025

Most other resources are created with an auto release pool around them, is it possible that that is fixing this issue somehow?

Added one in Device::create_acceleration_structure() but unfortunately that was not it either. There must be some other conditions to trigger it because the metal-rs examples don't and the wgpu examples only do when called via cargo xtask test.

I would say we try to land this PR and then open an issue for it to solve that separately.

BTW, I noticed the CI runner "Test Mac aarch64" job is not failing. Probably the test runner is too old to support hardware raytracing and skips the relevant tests.

@Lichtso Lichtso requested a review from Vecvec May 5, 2025 07:50
@Vecvec
Copy link
Collaborator

Vecvec commented May 5, 2025

Added one in Device::create_acceleration_structure() but unfortunately that was not it either

That's annoying, I wonder what it could be

I would say we try to land this PR and then open an issue for it to solve that separately.

Yes, though it could be some time before it lands.

I noticed the CI runner "Test Mac aarch64" job is not failing. Probably the test runner is too old to support hardware raytracing and skips the relevant tests

I checked and it does skip.

Copy link
Collaborator

@Vecvec Vecvec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluding the things that aren't exposed by the metal crate this looks good to me.

.flags
.contains(wgt::AccelerationStructureGeometryFlags::OPAQUE),
);
// wgt::AccelerationStructureGeometryFlags::NO_DUPLICATE_ANY_HIT_INVOCATION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this should set allowDuplicateIntersectionFunctionInvocation if NO_DUPLICATE_ANY_HIT_INVOCATION is not set but metal-rs doesn't support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it to: gfx-rs/metal-rs#361

@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch 4 times, most recently from 28e71af to 5d1c263 Compare May 12, 2025 08:41
@Lichtso Lichtso force-pushed the metal/ray_tracing_acceleration_structures branch from 5d1c263 to d8ad785 Compare May 12, 2025 08:50
@Vecvec
Copy link
Collaborator

Vecvec commented May 13, 2025

Although setting TLAS dependencies and only stating that we are using the BLASes contained in them is good for optimization, the current implementation will not work due to it being possible to submit the encoders in a different order to the order they were recorded in. This can mean if we have recorded build 1 on encoder 1 and record build 2 on encoder 2 but encoder 2 was submitted before encoder 1 the TLAS contained in them would still have the BLASes from build 2 as its dependencies but would use the BLASes from build 1.

I think that it might be best if the program just claims to metal that we are using every BLAS that exists. Although there are some other possible options I think that they are too complex for an initial implementation. I can't think of many sensible reasons why people would be using large numbers of BLASes that aren't currently being used anyway.

@Lichtso
Copy link
Contributor Author

Lichtso commented May 13, 2025

the current implementation will not work due to it being possible to submit the encoders in a different order to the order they were recorded in

Instead of calling it inside command_encoder_build_acceleration_structures() we could call DynAccelerationStructure::set_dependencies() in Queue::submit(). There is only one queue per device in wgpu, right? So shouldn't that solve it too? Edit: Seems quite fiddly to wire it all the way through the command encoder, command buffer into the queue.

why people would be using large numbers of BLASes that aren't currently being used

Maybe you have every model in many LOD levels (to avoid high frequency noise in the distance) or do asset streaming? No idea either, hardware ray tracing is still somewhat new and I haven't seen that much code around it yet.

@Vecvec
Copy link
Collaborator

Vecvec commented May 13, 2025

Instead of calling it inside command_encoder_build_acceleration_structures() we could call DynAccelerationStructure::set_dependencies() in Queue::submit().

I think there would still be issues where you encode the build after a use of the TLAS because you can't edit encoders after encoding them.

Maybe you have every model in many LOD levels (to avoid high frequency noise in the distance) or do asset streaming?

Yes, I hadn't thought of that. How expensive is the use_resources call? If it's cheap it might still be worthwhile to still just call it on all BLASes anyway.

@Vecvec
Copy link
Collaborator

Vecvec commented May 14, 2025

Actually I've found something called a MTLResidencySet which seems like it could be used. I need to investigate it further, but it seems like you could keep one per command buffer and add all indirectly used BLASes to it. When submitted it could be committed and when the encoder was reset it would get cleared. Its very new though which is inconvenient.

@Lichtso
Copy link
Contributor Author

Lichtso commented May 14, 2025

I've found something called a MTLResidencySet

Interesting.

one per command buffer and add all indirectly used BLASes to it

That is essentially where we are right now with the dependency tracking. We add all indirectly used BLASes to the command buffer via use_resource().

Metal attaches all of a command queue’s residency sets to a command buffer from that queue when you call the command buffer’s commit() method.

@Lichtso
Copy link
Contributor Author

Lichtso commented May 14, 2025

I think I can simplify your counter example further: Imagine we build the same TLAS in two different command buffers, but we never submit (thus discard) the second. And then use that TLAS later in a render pass. Now, the actions in the second build of that TLAS should have no effect.

This might already be wrong in other aspects unrelated to this PR, like the validation layer and how it sees the dependencies.

@Vecvec
Copy link
Collaborator

Vecvec commented May 14, 2025

This might already be wrong in other aspects unrelated to this PR, like the validation layer and how it sees the dependencies.

Yep, this has been pain for me. I've previously reworked lots of the validation due to this problem. It's possible there is more, but I've been working on fixing this.

That is essentially where we are right now with the dependency tracking. We add all indirectly used BLASes to the command buffer via use_resource().

Except MTLResidencySets can be edited after a command buffer is finished but make the resources resident before the same command buffer is submitted. The documentation of use_resources implies that the resources only are guaranteed to be resident after that command buffer has hit that point.

@Lichtso
Copy link
Contributor Author

Lichtso commented May 15, 2025

About cargo run --bin wgpu-examples ray_traced_triangle working, but cargo xtask test ray_traced_triangle not:

I think I found a bug in the Metal driver. Acceleration structures don't work in headless mode. That is, if I attach a window to the test process (does not even have to have its surface linked to wgpu, nor does the window have to be presented / visible in the compositor), the tests suddenly succeed!

@Vecvec
Copy link
Collaborator

Vecvec commented May 16, 2025

Acceleration structures don't work in headless mode

That's an odd driver bug, I wonder how it's caused...

On another, completely unrelated, note, I've been looking at the possible ways to make the BLASes resident. I think there are 3 possible options based on the metal docs (which feel like they are very out of date):

  1. Associate BLASes with their TLAS in the build command

    • Use the instancedAccelerationStructures field (though still using MTLAccelerationStructureInstanceDescriptorType::indirect)

    Pros

    • Keeps most stuff the same.

    Cons

    • Unsure if this is allowed, can't find anything stating otherwise, but this quote suggests maybe not?

    Each instance in the instance descriptor buffer has an index into this array

  2. Allocate all acceleration structures from a giant heap

    • Terrible idea, should only use if all else fails
  3. Put all indirectly used BLASes into a MTLResidencySet

    • Keep this in the command buffer, add all BLASes used to it just before submit, and then submit the command buffer

    Pros

    • Should work

    Cons

    • Requires latest MacOS version. (though if whatever this bug is cannot be worked around it will probably require latest version anyway)

Fwiw I've never used Metal so I'm guessing based on docs alone and have probably missed some cool trick that all other impl.s use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Ray Tracing on Metal
2 participants