-
Notifications
You must be signed in to change notification settings - Fork 283
Drop experimental TMA exposure in cuda::barrier #6225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is all exposed by cuda::ptx
// Forward-declare CUtensorMap for use in cp_async_bulk_tensor_* PTX wrapping | ||
// functions. These functions take a pointer to CUtensorMap, so do not need to | ||
// know its size. This type is defined in cuda.h (driver API) as: | ||
// | ||
// typedef struct CUtensorMap_st { [ .. snip .. ] } CUtensorMap; | ||
// | ||
// We need to forward-declare both CUtensorMap_st (the struct) and CUtensorMap | ||
// (the typedef): | ||
struct CUtensorMap_st; | ||
typedef struct CUtensorMap_st CUtensorMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a borderline breaking change. In principle, we require users to include the headers for what they should use, so I think they need to ensure they have the right header to use CUtensorMap
. But I am also fine leaving the forward declaration in.
inline _CCCL_DEVICE void cp_async_bulk_global_to_shared( | ||
void* __dest, const void* __src, ::cuda::std::uint32_t __size, ::cuda::barrier<::cuda::thread_scope_block>& __bar) | ||
{ | ||
_CCCL_ASSERT(__size % 16 == 0, "Size must be multiple of 16."); | ||
_CCCL_ASSERT(::cuda::device::is_address_from(__dest, ::cuda::device::address_space::shared), | ||
"Destination must be shared memory address."); | ||
_CCCL_ASSERT(::cuda::device::is_address_from(__src, ::cuda::device::address_space::global), | ||
"Source must be global memory address."); | ||
|
||
::cuda::ptx::cp_async_bulk( | ||
::cuda::ptx::space_cluster, | ||
::cuda::ptx::space_global, | ||
__dest, | ||
__src, | ||
__size, | ||
::cuda::device::barrier_native_handle(__bar)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit sad to leave some of the assertions behind here. But those functions were not used anyway in our barrier
and memcpy_async
implementation etc. I think we should consider adding some assertions to the PTX exposure in cuda::ptx. @ahendriksen do you think we can add such assertions there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding assertions in the code generator could be very hard because they are specific for each instruction. For example, I wrote the code for warp_shuffle
by hand for this reason
@ahendriksen I would like your review as well, since I think you added this functionality. |
🥳 CI Workflow Results🟩 Finished in 2h 48m: Pass: 100%/84 | Total: 2d 00h | Max: 2h 47m | Hits: 81%/208367See results here. |
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Switching to draft to prevent accidential merging. Waiting for approval from @ahendriksen. |
This is all exposed by
cuda::ptx
. It was originally introduced in #379