Sdma ccl by wuyl1 · Pull Request #235 · ROCm/mori

wuyl1 · 2026-03-27T05:58:16Z

This PR merges intra-node collective communication (CCL) over SDMA into main. It extends the existing shmem / anvil stack with device-side SDMA PUTs, symmetric memory registration, and completion synchronization, and adds AllGather, All2All, and AllReduce (including async) with C++ implementations, Python bindings, examples, and tests.

- Update SDMA device primitives and anvil transport layer - Add collective framework base classes (topology detector, algorithm selector) - Update shmem API and symmetric memory registration - Update build system (CMakeLists, setup.py)

- Oneshot AllGather with registered output buffer support - Async AllGather (start_async/wait_async) with monotonic flag sync - Segmented AllGather with pre-allocated metadata buffers - Oneshot All2All with async support - Range-based registered buffer lookup for zero-copy SDMA writes

- Twoshot AllReduce: ReduceScatter + AllGather via SDMA - Async AllReduce with split PUT/WAIT phases - Support bf16/fp16/fp32/uint32 dtypes - Intra-node executors - Intra-node allreduce with CU-based reduction

- Pybind11 bindings for AllGather, All2All, AllReduce SDMA classes - Python AllgatherSdma/All2allSdma/AllreduceSdma wrappers - Support any dtype via uint32 reinterpretation - Register/deregister output buffers and async API from Python

- AllGather, All2All, AllReduce C++ examples and benchmarks - SDMA bandwidth benchmark (sdma_bw_allgather) - Intra-node example/benchmarks - Python test suites: allgather, all2all, allreduce, async, overlap - RCCL comparison benchmark

wuyl1 added 5 commits March 25, 2026 17:40

Feat(ccl): Add SDMA AllReduce and intra-node collectives

f18923c

- Twoshot AllReduce: ReduceScatter + AllGather via SDMA - Async AllReduce with split PUT/WAIT phases - Support bf16/fp16/fp32/uint32 dtypes - Intra-node executors - Intra-node allreduce with CU-based reduction

wuyl1 force-pushed the sdma-wuyl branch 13 times, most recently from 5629b26 to 295614e Compare March 27, 2026 12:55

Squash sdma-new commits after 7aac195

3456022

wuyl1 force-pushed the sdma-wuyl branch from 295614e to 100d5af Compare April 1, 2026 07:25

add AR overlap UT

60b8394

wuyl1 force-pushed the sdma-wuyl branch 9 times, most recently from 4ea3662 to d20e151 Compare April 2, 2026 05:49

wuyl1 force-pushed the sdma-wuyl branch 10 times, most recently from 939b18e to 6b3d66f Compare April 2, 2026 07:43

3 phrase AR

81cd3f3

wuyl1 force-pushed the sdma-wuyl branch 12 times, most recently from 7fccd60 to 941c4a9 Compare April 3, 2026 09:33

4 stage overlap

a463647

wuyl1 force-pushed the sdma-wuyl branch from 941c4a9 to a463647 Compare April 3, 2026 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sdma ccl#235

Sdma ccl#235
wuyl1 wants to merge 9 commits intomainfrom
sdma-wuyl

wuyl1 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wuyl1 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant