[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158

JackAKirk · 2025-04-23T12:56:08Z

This extension aims to provide a simpler and more efficient mechanism for programmers to write things like https://github.com/codeplaysoftware/cutlass-sycl/blob/sycl-develop/tools/util/include/cutlass/util/GPU_Clock.hpp
as well as provide a more efficient way of recording a sycl::event that bypasses the scheduler completely.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2025-04-23T12:56:47Z

@aacostadiaz What do you think of this?

It would be possible to add overloads that take a property list for the returned event, but for now I've just proposed that the returned event inherits the properties (e.g. whether the event supports timings) from the queue.

aacostadiaz · 2025-04-23T16:02:23Z

Hi @JackAKirk , thanks for pinging me

I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time.

Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch?

JackAKirk · 2025-04-23T16:12:12Z

Hi @JackAKirk , thanks for pinging me

I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time.

Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch?

So it would be possible if you wanted, but a big motivation for this is that it means you don't have to capture every event, you can instead switch to using the enqueue_functions extension https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_enqueue_functions.asciidoc which via nd_launch and submit_without_event means that (like the cuda backend of cutlass) APIs don't return events.
The user can then use queue.ext_oneapi_record_event() to get an event whenever they want; or if they don't need to (if e.g. they aren't profiling or using a single stream), then can not return an event at all.
This is exactly identical to the usage of

cudaEventCreate(cuEvent);
cudaEventRecord(cuEvent, stream);

except the above two calls will be lumped together, making usage of the queue properties in sycl to create the event properties.

This should also give cutlass a performance boost by not creating unnecessary events, which has been shown to be significant overhead for deep learning kernels that include lots of short kernels where the kernel execution time can be comparable to event creation time.

aacostadiaz · 2025-04-23T16:20:08Z

Nice, thanks for the clarification. Looks like it will be quite useful for us.

JackAKirk · 2025-04-23T16:22:57Z

Nice, thanks for the clarification. Looks like it will be quite useful for us.

OK, I'll add quick implementations on level_zero and cuda to demonstrate it working. The form of the API can change, I just made the simplest design that seemed to me to fit most naturally into existing sycl::queue and sycl::event, but it is open to suggestions based on what would be most useful/easiest to deal with.

jbrodman · 2025-04-23T16:30:22Z

Hi @JackAKirk , thanks for pinging me
I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time.
Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch?

So it would be possible if you wanted, but a big motivation for this is that it means you don't have to capture every event, you can instead switch to using the enqueue_functions extension https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_enqueue_functions.asciidoc which via nd_launch and submit_without_event means that (like the cuda backend of cutlass) APIs don't return events. The user can then use queue.ext_oneapi_record_event() to get an event whenever they want; or if they don't need to (if e.g. they aren't profiling or using a single stream), then can not return an event at all. This is exactly identical to the usage of
cudaEventCreate(cuEvent);
cudaEventRecord(cuEvent, stream);
except the above two calls will be lumped together, making usage of the queue properties in sycl to create the event properties.

This should also give cutlass a performance boost by not creating unnecessary events, which has been shown to be significant overhead for deep learning kernels that include lots of short kernels where the kernel execution time can be comparable to event creation time.

cudaEventRecord supports repeated invocations on the same event, overwriting the state. By combining creation and recording, are we disallowing this? Would a SYCL version have more overhead than a CUDA version due to this difference?

bashbaug · 2025-04-23T20:16:30Z

Could you explain how this is different than sycl_ext_oneapi_enqueue_barrier? It seems like this extension is trying to solve a similar problem, just for a very specific case (in-order queues). Thanks!

JackAKirk · 2025-04-24T09:52:15Z

Could you explain how this is different than sycl_ext_oneapi_enqueue_barrier? It seems like this extension is trying to solve a similar problem, just for a very specific case (in-order queues). Thanks!

Yeah it is exactly the same as event ext_oneapi_submit_barrier(); from the user perspective, except that it doesn't also add a corresponding method to the handler, so the implementation of it completely avoids the handler/scheduler and makes a direct call to UR.
Perhaps the bigger advantage would be what @jbrodman points out above, that if we split the calls into

event ev = queue.ext_oneapi_create_event()
queue.ext_oneapi_record_event(ev);

the same event can be reused.

Essentially it is thinking about ways to make the functionality of ext_oneapi_submit_barrier() a bit faster.

You could even just take @jbrodman's suggestion and make use of ext_oneapi_submit_barrier() via
introducing only:
void queue::ext_oneapi_record_event(ev);
e.g.

event ev = ext_oneapi_submit_barrier()
queue.ext_oneapi_record_event(ev);

JackAKirk · 2025-04-24T10:00:41Z

cudaEventRecord supports repeated invocations on the same event, overwriting the state. By combining creation and recording, are we disallowing this? Would a SYCL version have more overhead than a CUDA version due to this difference?

Yeah I forgot about this. It would definitely make sense to split it up into creating/recording events. Something like the suggestions in #18158 (comment)

EwanC · 2025-04-25T12:30:44Z

On the point of separating event creation/recording, it's worth noting from the SYCL-Graph perspective that a reusable SYCL event could influence our dynamic_event API design (analogous to cudaGraphAddEventRecordNode/cudaGraphAddEventWaitNode) such that we would build on this extension (which I think would probably be a better approach). As reusable events was a concept we liked, but didn't want to introduce just for that feature.

gmlueck · 2025-04-25T17:43:00Z

Yeah it is exactly the same as event ext_oneapi_submit_barrier(); from the user perspective, except that it doesn't also add a corresponding method to the handler, so the implementation of it completely avoids the handler/scheduler and makes a direct call to UR.

I think I have the same question as @bashbaug. If the semantics of this proposed extension are the same as ext_oneapi_submit_barrier, why can't we just add a short circuit in the implementation of ext_oneapi_submit_barrier? Something like:

if (queue in order)
  // do the fast thing you propose in this PR
else
  // do what we do now

Propose sycl_ext_oneapi_queue_record_event

b310e0a

Signed-off-by: JackAKirk <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158

[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158

JackAKirk commented Apr 23, 2025

JackAKirk commented Apr 23, 2025 •

edited

Loading

aacostadiaz commented Apr 23, 2025

JackAKirk commented Apr 23, 2025

aacostadiaz commented Apr 23, 2025

JackAKirk commented Apr 23, 2025

jbrodman commented Apr 23, 2025

bashbaug commented Apr 23, 2025 •

edited

Loading

JackAKirk commented Apr 24, 2025 •

edited

Loading

JackAKirk commented Apr 24, 2025

EwanC commented Apr 25, 2025

gmlueck commented Apr 25, 2025

[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158

Are you sure you want to change the base?

[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158

Conversation

JackAKirk commented Apr 23, 2025

JackAKirk commented Apr 23, 2025 • edited Loading

aacostadiaz commented Apr 23, 2025

JackAKirk commented Apr 23, 2025

aacostadiaz commented Apr 23, 2025

JackAKirk commented Apr 23, 2025

jbrodman commented Apr 23, 2025

bashbaug commented Apr 23, 2025 • edited Loading

JackAKirk commented Apr 24, 2025 • edited Loading

JackAKirk commented Apr 24, 2025

EwanC commented Apr 25, 2025

gmlueck commented Apr 25, 2025

JackAKirk commented Apr 23, 2025 •

edited

Loading

bashbaug commented Apr 23, 2025 •

edited

Loading

JackAKirk commented Apr 24, 2025 •

edited

Loading