Skip to content

Sdma ccl#235

Open
wuyl1 wants to merge 9 commits intomainfrom
sdma-wuyl
Open

Sdma ccl#235
wuyl1 wants to merge 9 commits intomainfrom
sdma-wuyl

Conversation

@wuyl1
Copy link
Copy Markdown
Contributor

@wuyl1 wuyl1 commented Mar 27, 2026

This PR merges intra-node collective communication (CCL) over SDMA into main. It extends the existing shmem / anvil stack with device-side SDMA PUTs, symmetric memory registration, and completion synchronization, and adds AllGather, All2All, and AllReduce (including async) with C++ implementations, Python bindings, examples, and tests.

wuyl1 added 5 commits March 25, 2026 17:40
- Update SDMA device primitives and anvil transport layer

- Add collective framework base classes (topology detector, algorithm selector)

- Update shmem API and symmetric memory registration

- Update build system (CMakeLists, setup.py)
- Oneshot AllGather with registered output buffer support

- Async AllGather (start_async/wait_async) with monotonic flag sync

- Segmented AllGather with pre-allocated metadata buffers

- Oneshot All2All with async support

- Range-based registered buffer lookup for zero-copy SDMA writes
- Twoshot AllReduce: ReduceScatter + AllGather via SDMA

- Async AllReduce with split PUT/WAIT phases

- Support bf16/fp16/fp32/uint32 dtypes

- Intra-node executors

- Intra-node allreduce with CU-based reduction
- Pybind11 bindings for AllGather, All2All, AllReduce SDMA classes

- Python AllgatherSdma/All2allSdma/AllreduceSdma wrappers

- Support any dtype via uint32 reinterpretation

- Register/deregister output buffers and async API from Python
- AllGather, All2All, AllReduce C++ examples and benchmarks

- SDMA bandwidth benchmark (sdma_bw_allgather)

- Intra-node example/benchmarks

- Python test suites: allgather, all2all, allreduce, async, overlap

- RCCL comparison benchmark
@wuyl1 wuyl1 force-pushed the sdma-wuyl branch 13 times, most recently from 5629b26 to 295614e Compare March 27, 2026 12:55
@wuyl1 wuyl1 force-pushed the sdma-wuyl branch 9 times, most recently from 4ea3662 to d20e151 Compare April 2, 2026 05:49
@wuyl1 wuyl1 force-pushed the sdma-wuyl branch 10 times, most recently from 939b18e to 6b3d66f Compare April 2, 2026 07:43
@wuyl1 wuyl1 force-pushed the sdma-wuyl branch 12 times, most recently from 7fccd60 to 941c4a9 Compare April 3, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant