Skip to content

Add ReduceScatter SDMA implementation and AG/RS benchmarks#200

Open
amd-andycha wants to merge 1 commit intosdma-newfrom
feat/reducescatter-sdma
Open

Add ReduceScatter SDMA implementation and AG/RS benchmarks#200
amd-andycha wants to merge 1 commit intosdma-newfrom
feat/reducescatter-sdma

Conversation

@amd-andycha
Copy link
Copy Markdown

Summary

  • Implement ReduceScatterSdma C++ class and Python bindings, enabling SDMA-based ReduceScatter collective operation
  • Add comprehensive benchmark scripts for standalone AG/RS latency sweep and compute-communication overlap testing (AG/RS + GEMM)
  • Update existing AllGather test scripts with GEMM overlap support

Key Results (MI355X, 8 GPUs, 128MB/PE)

Metric SDMA RCCL SDMA Advantage
RS standalone BW 381 GB/s 326 GB/s +16.9%
AG standalone BW 361 GB/s 333 GB/s +8.5%
RS+GEMM overlap (16384³) 1.34x 1.25x better overlap
AG+GEMM overlap (16384³) 1.41x 1.23x better overlap
CU usage during comm 0 CU uses CU no CU contention

Files Changed

New C++ Implementation:

  • include/mori/collective/reducescatter/reducescatter_sdma_class.hpp
  • src/collective/core/reducescatter_sdma_class.cpp

Modified (bindings & build):

  • src/pybind/mori.cpp — ReduceScatterSdmaHandle binding
  • src/collective/CMakeLists.txt — add new source
  • python/mori/ccl/collective.py — ReduceScatterSdma wrapper class
  • python/mori/ccl/__init__.py — export ReduceScatterSdma

New Test & Benchmark Scripts:

  • tests/python/ccl/test_reducescatter_overlap.py — RS SDMA test with optional GEMM overlap
  • tests/python/ccl/test_rccl_reducescatter.py — RS RCCL baseline test
  • tests/python/ccl/bench_gemm_standalone.py — GEMM standalone baseline
  • tests/python/ccl/bench_allgather_sweep.sh — AG latency sweep
  • tests/python/ccl/bench_reducescatter_sweep.sh — RS latency sweep
  • tests/python/ccl/bench_ag_overlap_sweep.sh — AG + GEMM overlap sweep
  • tests/python/ccl/bench_rs_overlap_sweep.sh — RS + GEMM overlap sweep

Modified Existing Tests:

  • tests/python/ccl/test_allgather_overlap.py — add GEMM overlap support
  • tests/python/ccl/test_rccl_allgather.py — add GEMM overlap support

Test Plan

  • ReduceScatter SDMA correctness verified (output matches expected on all 8 PEs)
  • Standalone latency sweep (10-128 MB) — SDMA outperforms RCCL at all sizes
  • Compute-communication overlap tested with GEMM 4096/8192/16384
  • SDMA ON/OFF comparison completed
  • All tests re-run and verified on MI355X 8-GPU node

- Add ReduceScatterSdma C++ class (hpp/cpp) leveraging SdmaReduceScatterKernel
- Add pybind11 binding for ReduceScatterSdmaHandle
- Add Python wrapper class ReduceScatterSdma in ccl module
- Add standalone and overlap benchmark scripts for AG/RS
- Add GEMM standalone baseline script
- Update AllGather test scripts with GEMM overlap support
- Update CMakeLists.txt to include new source file
@TianDi101
Copy link
Copy Markdown
Collaborator

@zhangfei829 @wuyl1 Can you help review this PR, thanks!

@wuyl1
Copy link
Copy Markdown
Contributor

wuyl1 commented Mar 18, 2026

LGTM, thanks for your work

@zhangfei829
Copy link
Copy Markdown
Collaborator

nice work, @amd-andycha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants