Add ReduceScatter SDMA implementation and AG/RS benchmarks by amd-andycha · Pull Request #200 · ROCm/mori

amd-andycha · 2026-03-18T09:54:13Z

Summary

Implement ReduceScatterSdma C++ class and Python bindings, enabling SDMA-based ReduceScatter collective operation
Add comprehensive benchmark scripts for standalone AG/RS latency sweep and compute-communication overlap testing (AG/RS + GEMM)
Update existing AllGather test scripts with GEMM overlap support

Key Results (MI355X, 8 GPUs, 128MB/PE)

Metric	SDMA	RCCL	SDMA Advantage
RS standalone BW	381 GB/s	326 GB/s	+16.9%
AG standalone BW	361 GB/s	333 GB/s	+8.5%
RS+GEMM overlap (16384³)	1.34x	1.25x	better overlap
AG+GEMM overlap (16384³)	1.41x	1.23x	better overlap
CU usage during comm	0 CU	uses CU	no CU contention

Files Changed

New C++ Implementation:

include/mori/collective/reducescatter/reducescatter_sdma_class.hpp
src/collective/core/reducescatter_sdma_class.cpp

Modified (bindings & build):

src/pybind/mori.cpp — ReduceScatterSdmaHandle binding
src/collective/CMakeLists.txt — add new source
python/mori/ccl/collective.py — ReduceScatterSdma wrapper class
python/mori/ccl/__init__.py — export ReduceScatterSdma

New Test & Benchmark Scripts:

tests/python/ccl/test_reducescatter_overlap.py — RS SDMA test with optional GEMM overlap
tests/python/ccl/test_rccl_reducescatter.py — RS RCCL baseline test
tests/python/ccl/bench_gemm_standalone.py — GEMM standalone baseline
tests/python/ccl/bench_allgather_sweep.sh — AG latency sweep
tests/python/ccl/bench_reducescatter_sweep.sh — RS latency sweep
tests/python/ccl/bench_ag_overlap_sweep.sh — AG + GEMM overlap sweep
tests/python/ccl/bench_rs_overlap_sweep.sh — RS + GEMM overlap sweep

Modified Existing Tests:

tests/python/ccl/test_allgather_overlap.py — add GEMM overlap support
tests/python/ccl/test_rccl_allgather.py — add GEMM overlap support

Test Plan

ReduceScatter SDMA correctness verified (output matches expected on all 8 PEs)
Standalone latency sweep (10-128 MB) — SDMA outperforms RCCL at all sizes
Compute-communication overlap tested with GEMM 4096/8192/16384
SDMA ON/OFF comparison completed
All tests re-run and verified on MI355X 8-GPU node

- Add ReduceScatterSdma C++ class (hpp/cpp) leveraging SdmaReduceScatterKernel - Add pybind11 binding for ReduceScatterSdmaHandle - Add Python wrapper class ReduceScatterSdma in ccl module - Add standalone and overlap benchmark scripts for AG/RS - Add GEMM standalone baseline script - Update AllGather test scripts with GEMM overlap support - Update CMakeLists.txt to include new source file

TianDi101 · 2026-03-18T11:22:48Z

@zhangfei829 @wuyl1 Can you help review this PR, thanks!

wuyl1 · 2026-03-18T12:33:20Z

LGTM, thanks for your work

zhangfei829 · 2026-03-24T03:51:18Z

nice work, @amd-andycha

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ReduceScatter SDMA implementation and AG/RS benchmarks#200

Add ReduceScatter SDMA implementation and AG/RS benchmarks#200
amd-andycha wants to merge 1 commit intosdma-newfrom
feat/reducescatter-sdma

amd-andycha commented Mar 18, 2026

Uh oh!

TianDi101 commented Mar 18, 2026

Uh oh!

wuyl1 commented Mar 18, 2026

Uh oh!

zhangfei829 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

amd-andycha commented Mar 18, 2026

Summary

Key Results (MI355X, 8 GPUs, 128MB/PE)

Files Changed

Test Plan

Uh oh!

TianDi101 commented Mar 18, 2026

Uh oh!

wuyl1 commented Mar 18, 2026

Uh oh!

zhangfei829 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants