Skip to content

[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: sycl
Choose a base branch
from

Conversation

JackAKirk
Copy link
Contributor

This extension aims to provide a simpler and more efficient mechanism for programmers to write things like https://github.com/codeplaysoftware/cutlass-sycl/blob/sycl-develop/tools/util/include/cutlass/util/GPU_Clock.hpp
as well as provide a more efficient way of recording a sycl::event that bypasses the scheduler completely.

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Apr 23, 2025

@aacostadiaz What do you think of this?

It would be possible to add overloads that take a property list for the returned event, but for now I've just proposed that the returned event inherits the properties (e.g. whether the event supports timings) from the queue.

@aacostadiaz
Copy link
Contributor

Hi @JackAKirk , thanks for pinging me

I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time.

Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch?

@JackAKirk
Copy link
Contributor Author

Hi @JackAKirk , thanks for pinging me

I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time.

Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch?

So it would be possible if you wanted, but a big motivation for this is that it means you don't have to capture every event, you can instead switch to using the enqueue_functions extension https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_enqueue_functions.asciidoc which via nd_launch and submit_without_event means that (like the cuda backend of cutlass) APIs don't return events.
The user can then use queue.ext_oneapi_record_event() to get an event whenever they want; or if they don't need to (if e.g. they aren't profiling or using a single stream), then can not return an event at all.
This is exactly identical to the usage of

cudaEventCreate(cuEvent);
cudaEventRecord(cuEvent, stream);

except the above two calls will be lumped together, making usage of the queue properties in sycl to create the event properties.

This should also give cutlass a performance boost by not creating unnecessary events, which has been shown to be significant overhead for deep learning kernels that include lots of short kernels where the kernel execution time can be comparable to event creation time.

@aacostadiaz
Copy link
Contributor

Nice, thanks for the clarification. Looks like it will be quite useful for us.

@JackAKirk
Copy link
Contributor Author

Nice, thanks for the clarification. Looks like it will be quite useful for us.

OK, I'll add quick implementations on level_zero and cuda to demonstrate it working. The form of the API can change, I just made the simplest design that seemed to me to fit most naturally into existing sycl::queue and sycl::event, but it is open to suggestions based on what would be most useful/easiest to deal with.

@jbrodman
Copy link
Contributor

Hi @JackAKirk , thanks for pinging me
I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time.
Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch?

So it would be possible if you wanted, but a big motivation for this is that it means you don't have to capture every event, you can instead switch to using the enqueue_functions extension https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_enqueue_functions.asciidoc which via nd_launch and submit_without_event means that (like the cuda backend of cutlass) APIs don't return events. The user can then use queue.ext_oneapi_record_event() to get an event whenever they want; or if they don't need to (if e.g. they aren't profiling or using a single stream), then can not return an event at all. This is exactly identical to the usage of

cudaEventCreate(cuEvent);
cudaEventRecord(cuEvent, stream);

except the above two calls will be lumped together, making usage of the queue properties in sycl to create the event properties.

This should also give cutlass a performance boost by not creating unnecessary events, which has been shown to be significant overhead for deep learning kernels that include lots of short kernels where the kernel execution time can be comparable to event creation time.

cudaEventRecord supports repeated invocations on the same event, overwriting the state. By combining creation and recording, are we disallowing this? Would a SYCL version have more overhead than a CUDA version due to this difference?

@bashbaug
Copy link
Contributor

bashbaug commented Apr 23, 2025

Could you explain how this is different than sycl_ext_oneapi_enqueue_barrier? It seems like this extension is trying to solve a similar problem, just for a very specific case (in-order queues). Thanks!

@JackAKirk
Copy link
Contributor Author

JackAKirk commented Apr 24, 2025

Could you explain how this is different than sycl_ext_oneapi_enqueue_barrier? It seems like this extension is trying to solve a similar problem, just for a very specific case (in-order queues). Thanks!

Yeah it is exactly the same as event ext_oneapi_submit_barrier(); from the user perspective, except that it doesn't also add a corresponding method to the handler, so the implementation of it completely avoids the handler/scheduler and makes a direct call to UR.
Perhaps the bigger advantage would be what @jbrodman points out above, that if we split the calls into

event ev = queue.ext_oneapi_create_event()
queue.ext_oneapi_record_event(ev);

the same event can be reused.

Essentially it is thinking about ways to make the functionality of ext_oneapi_submit_barrier() a bit faster.

You could even just take @jbrodman's suggestion and make use of ext_oneapi_submit_barrier() via
introducing only:
void queue::ext_oneapi_record_event(ev);
e.g.

event ev = ext_oneapi_submit_barrier()
queue.ext_oneapi_record_event(ev);

@JackAKirk
Copy link
Contributor Author

cudaEventRecord supports repeated invocations on the same event, overwriting the state. By combining creation and recording, are we disallowing this? Would a SYCL version have more overhead than a CUDA version due to this difference?

Yeah I forgot about this. It would definitely make sense to split it up into creating/recording events. Something like the suggestions in #18158 (comment)

@EwanC
Copy link
Contributor

EwanC commented Apr 25, 2025

On the point of separating event creation/recording, it's worth noting from the SYCL-Graph perspective that a reusable SYCL event could influence our dynamic_event API design (analogous to cudaGraphAddEventRecordNode/cudaGraphAddEventWaitNode) such that we would build on this extension (which I think would probably be a better approach). As reusable events was a concept we liked, but didn't want to introduce just for that feature.

@gmlueck
Copy link
Contributor

gmlueck commented Apr 25, 2025

Yeah it is exactly the same as event ext_oneapi_submit_barrier(); from the user perspective, except that it doesn't also add a corresponding method to the handler, so the implementation of it completely avoids the handler/scheduler and makes a direct call to UR.

I think I have the same question as @bashbaug. If the semantics of this proposed extension are the same as ext_oneapi_submit_barrier, why can't we just add a short circuit in the implementation of ext_oneapi_submit_barrier? Something like:

if (queue in order)
  // do the fast thing you propose in this PR
else
  // do what we do now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants