-
Notifications
You must be signed in to change notification settings - Fork 768
[SYCL][DOC] Propose sycl_ext_oneapi_queue_record_event extension #18158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
Signed-off-by: JackAKirk <[email protected]>
@aacostadiaz What do you think of this? It would be possible to add overloads that take a property list for the returned event, but for now I've just proposed that the returned event inherits the properties (e.g. whether the event supports timings) from the queue. |
Hi @JackAKirk , thanks for pinging me I'm not sure I understand how the extension is supposed to work. I can see it's returning a single event but what we have in GPU_clock is designed to capture all the events between start and stop, extract the execution time for each event, and return their sum as the total execution time. Is it possible to do the same with the extension, or will it require some extra code to capture the events for each kernel launch? |
So it would be possible if you wanted, but a big motivation for this is that it means you don't have to capture every event, you can instead switch to using the enqueue_functions extension https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_enqueue_functions.asciidoc which via
except the above two calls will be lumped together, making usage of the queue properties in sycl to create the event properties. This should also give cutlass a performance boost by not creating unnecessary events, which has been shown to be significant overhead for deep learning kernels that include lots of short kernels where the kernel execution time can be comparable to event creation time. |
Nice, thanks for the clarification. Looks like it will be quite useful for us. |
OK, I'll add quick implementations on level_zero and cuda to demonstrate it working. The form of the API can change, I just made the simplest design that seemed to me to fit most naturally into existing sycl::queue and sycl::event, but it is open to suggestions based on what would be most useful/easiest to deal with. |
cudaEventRecord supports repeated invocations on the same event, overwriting the state. By combining creation and recording, are we disallowing this? Would a SYCL version have more overhead than a CUDA version due to this difference? |
Could you explain how this is different than sycl_ext_oneapi_enqueue_barrier? It seems like this extension is trying to solve a similar problem, just for a very specific case (in-order queues). Thanks! |
Yeah it is exactly the same as
the same event can be reused. Essentially it is thinking about ways to make the functionality of You could even just take @jbrodman's suggestion and make use of
|
Yeah I forgot about this. It would definitely make sense to split it up into creating/recording events. Something like the suggestions in #18158 (comment) |
On the point of separating event creation/recording, it's worth noting from the SYCL-Graph perspective that a reusable SYCL event could influence our dynamic_event API design (analogous to |
I think I have the same question as @bashbaug. If the semantics of this proposed extension are the same as
|
This extension aims to provide a simpler and more efficient mechanism for programmers to write things like https://github.com/codeplaysoftware/cutlass-sycl/blob/sycl-develop/tools/util/include/cutlass/util/GPU_Clock.hpp
as well as provide a more efficient way of recording a
sycl::event
that bypasses the scheduler completely.