Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block> #6250

bernhardmgruber · 2025-10-15T15:12:16Z

Due to lack of a good example for a SASS test, I used this simple example:

#include <cuda/barrier>
#include <cuda/ptx>

// selects a single leader thread from the block
__device__ bool elect_one() {
  // elect_sync is important to help the optimizer generate a uniform datapath
  return cuda::ptx::elect_sync(~0) && threadIdx.x < 32;
}

__global__ void example_kernel(int* gmem1, double* gmem2) {
  constexpr int tile_size = 1024;
  __shared__ alignas(16)    int smem1[tile_size];
  __shared__ alignas(16) double smem2[tile_size];
  #pragma nv_diag_suppress static_var_with_dynamic_init
  using barrier_t = cuda::barrier<cuda::thread_scope_block>;
  __shared__  barrier_t bar;
  // setup the barrier where only the leader thread arrives
  if (elect_one()) {
    init(&bar, 1);
    // issue two TMA bulk copy operations
    cuda::device::memcpy_async_tx(smem1, gmem1, cuda::aligned_size_t<16>(tile_size * sizeof(int)   ), bar);
    cuda::device::memcpy_async_tx(smem2, gmem2, cuda::aligned_size_t<16>(tile_size * sizeof(double)), bar);
    // arrive and update the barrier's expect_tx with the **total** number of loaded bytes
    (void)cuda::device::barrier_arrive_tx(bar, 1, tile_size * (sizeof(int) + sizeof(double)));
  }
  __syncthreads(); // need to sync so the barrier is set up when the other threads arrive and wait
  // wait for the current barrier phase to complete
  bar.wait_parity(0);
  // process data in smem ...
}

Compiled for sm100, it does differ in SASS a little bit, but the compiler just flipped a branch:

miscco

looks technically correct, but the formatting is atrocious. Could we add an else to the conditions, as all early branches are returning?

libcudacxx/include/cuda/__barrier/barrier_block_scope.h

miscco · 2025-10-15T17:49:31Z

libcudacxx/include/cuda/__barrier/barrier_block_scope.h

+    if (!::cuda::device::is_object_from(__barrier, ::cuda::device::address_space::cluster_shared))
+    {
+      return __barrier.arrive(__update);
+    }
+    if (!::cuda::device::is_object_from(__barrier, ::cuda::device::address_space::shared))
+    {
+      ::__trap();
+    }


This is strange, because the first condition takes anything but cluster_shared, so the second one seems wrong

Anything that is shared is also cluster_shared. Because the shared memory address space is part of the cluster shared memory space.

Here, we trap for any barrier that is in cluster shared memory, but not in the shared memory of the current CTA.

could we turn that into an else if

could not be a precondition instead?

github-actions · 2025-10-15T21:49:05Z

😬 CI Workflow Results

🟥 Finished in 4h 41m: Pass: 42%/84 | Total: 7h 57m | Max: 39m 16s | Hits: 99%/25847

See results here.

fbusato · 2025-10-15T22:45:47Z

libcudacxx/include/cuda/__barrier/barrier_block_scope.h

-          ::__cvta_generic_to_shared(&__barrier)))
-                     : "memory");
-      }))
+      (if (::cuda::device::is_object_from(__barrier, ::cuda::device::address_space::cluster_shared)) { ::__trap(); }))


cuda::std::terminate() ?

fbusato · 2025-10-15T22:58:27Z

libcudacxx/include/cuda/__barrier/barrier_block_scope.h

+    }
+    if (!::cuda::device::is_object_from(__barrier, ::cuda::device::address_space::shared))
+    {
+      ::__trap();


could be a _CCCL_ASSERT instead?

I agree, we should probably assert here. And check the documentation whether we make it clear that barriers ought not live in cluster shared memory.

fbusato · 2025-10-15T23:00:41Z

libcudacxx/include/cuda/__barrier/barrier_block_scope.h

+    unsigned int __activeA = ::__match_any_sync(__mask, __update);
+    unsigned int __activeB = ::__match_any_sync(__mask, reinterpret_cast<::cuda::std::uintptr_t>(&__barrier));
+    unsigned int __active  = __activeA & __activeB;
+    int __inc              = ::__popc(__active) * __update;


Probably not worth to move to their C++ versions

fbusato · 2025-10-15T23:04:06Z

libcudacxx/include/cuda/__barrier/barrier_block_scope.h

+    if (!::cuda::device::is_object_from(__barrier, ::cuda::device::address_space::cluster_shared))
+    {
+      return __barrier.arrive(__update);
+    }
+    if (!::cuda::device::is_object_from(__barrier, ::cuda::device::address_space::shared))
+    {
+      ::__trap();
+    }


could not be a precondition instead?

Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block>

b3edf5c

bernhardmgruber requested a review from a team as a code owner October 15, 2025 15:12

bernhardmgruber requested a review from pciolkosz October 15, 2025 15:12

github-project-automation bot added this to CCCL Oct 15, 2025

github-project-automation bot moved this to Todo in CCCL Oct 15, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 15, 2025

Fix

d15984f

miscco reviewed Oct 15, 2025

View reviewed changes

libcudacxx/include/cuda/__barrier/barrier_block_scope.h Outdated Show resolved Hide resolved

libcudacxx/include/cuda/__barrier/barrier_block_scope.h Outdated Show resolved Hide resolved

bernhardmgruber added 3 commits October 15, 2025 17:37

split arrive into functions per SM

6209604

split more branches into functions

36d0b64

refactor

81939ba

bernhardmgruber force-pushed the barrier_ptx branch from 6a63546 to 81939ba Compare October 15, 2025 16:56

miscco reviewed Oct 15, 2025

View reviewed changes

fbusato reviewed Oct 15, 2025

View reviewed changes

Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block> #6250

Are you sure you want to change the base?

Replace inline PTX by cuda::ptx in cuda::barrier<thread_scope_block> #6250

Uh oh!

Conversation

bernhardmgruber commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 15, 2025

😬 CI Workflow Results

🟥 Finished in 4h 41m: Pass: 42%/84 | Total: 7h 57m | Max: 39m 16s | Hits: 99%/25847

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bernhardmgruber commented Oct 15, 2025 •

edited

Loading