Conversation
added 5 commits
March 25, 2026 17:40
- Update SDMA device primitives and anvil transport layer - Add collective framework base classes (topology detector, algorithm selector) - Update shmem API and symmetric memory registration - Update build system (CMakeLists, setup.py)
- Oneshot AllGather with registered output buffer support - Async AllGather (start_async/wait_async) with monotonic flag sync - Segmented AllGather with pre-allocated metadata buffers - Oneshot All2All with async support - Range-based registered buffer lookup for zero-copy SDMA writes
- Twoshot AllReduce: ReduceScatter + AllGather via SDMA - Async AllReduce with split PUT/WAIT phases - Support bf16/fp16/fp32/uint32 dtypes - Intra-node executors - Intra-node allreduce with CU-based reduction
- Pybind11 bindings for AllGather, All2All, AllReduce SDMA classes - Python AllgatherSdma/All2allSdma/AllreduceSdma wrappers - Support any dtype via uint32 reinterpretation - Register/deregister output buffers and async API from Python
- AllGather, All2All, AllReduce C++ examples and benchmarks - SDMA bandwidth benchmark (sdma_bw_allgather) - Intra-node example/benchmarks - Python test suites: allgather, all2all, allreduce, async, overlap - RCCL comparison benchmark
5629b26 to
295614e
Compare
4ea3662 to
d20e151
Compare
939b18e to
6b3d66f
Compare
7fccd60 to
941c4a9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR merges intra-node collective communication (CCL) over SDMA into main. It extends the existing shmem / anvil stack with device-side SDMA PUTs, symmetric memory registration, and completion synchronization, and adds AllGather, All2All, and AllReduce (including async) with C++ implementations, Python bindings, examples, and tests.