Skip to content

Conversation

bernhardmgruber
Copy link
Contributor

This is all exposed by cuda::ptx. It was originally introduced in #379

This is all exposed by cuda::ptx
@bernhardmgruber bernhardmgruber requested a review from a team as a code owner October 14, 2025 13:38
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 14, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 14, 2025
Comment on lines -41 to -50
// Forward-declare CUtensorMap for use in cp_async_bulk_tensor_* PTX wrapping
// functions. These functions take a pointer to CUtensorMap, so do not need to
// know its size. This type is defined in cuda.h (driver API) as:
//
// typedef struct CUtensorMap_st { [ .. snip .. ] } CUtensorMap;
//
// We need to forward-declare both CUtensorMap_st (the struct) and CUtensorMap
// (the typedef):
struct CUtensorMap_st;
typedef struct CUtensorMap_st CUtensorMap;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a borderline breaking change. In principle, we require users to include the headers for what they should use, so I think they need to ensure they have the right header to use CUtensorMap. But I am also fine leaving the forward declaration in.

Comment on lines -73 to -89
inline _CCCL_DEVICE void cp_async_bulk_global_to_shared(
void* __dest, const void* __src, ::cuda::std::uint32_t __size, ::cuda::barrier<::cuda::thread_scope_block>& __bar)
{
_CCCL_ASSERT(__size % 16 == 0, "Size must be multiple of 16.");
_CCCL_ASSERT(::cuda::device::is_address_from(__dest, ::cuda::device::address_space::shared),
"Destination must be shared memory address.");
_CCCL_ASSERT(::cuda::device::is_address_from(__src, ::cuda::device::address_space::global),
"Source must be global memory address.");

::cuda::ptx::cp_async_bulk(
::cuda::ptx::space_cluster,
::cuda::ptx::space_global,
__dest,
__src,
__size,
::cuda::device::barrier_native_handle(__bar));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it a bit sad to leave some of the assertions behind here. But those functions were not used anyway in our barrier and memcpy_async implementation etc. I think we should consider adding some assertions to the PTX exposure in cuda::ptx. @ahendriksen do you think we can add such assertions there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding assertions in the code generator could be very hard because they are specific for each instruction. For example, I wrote the code for warp_shuffle by hand for this reason

@bernhardmgruber
Copy link
Contributor Author

@ahendriksen I would like your review as well, since I think you added this functionality.

Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 48m: Pass: 100%/84 | Total: 2d 00h | Max: 2h 47m | Hits: 81%/208367

See results here.

@bernhardmgruber bernhardmgruber marked this pull request as draft October 14, 2025 18:36
Copy link
Contributor

copy-pr-bot bot commented Oct 14, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Oct 14, 2025
@bernhardmgruber
Copy link
Contributor Author

Switching to draft to prevent accidential merging. Waiting for approval from @ahendriksen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants