Add ReduceScatter SDMA implementation and AG/RS benchmarks#200
Open
amd-andycha wants to merge 1 commit intosdma-newfrom
Open
Add ReduceScatter SDMA implementation and AG/RS benchmarks#200amd-andycha wants to merge 1 commit intosdma-newfrom
amd-andycha wants to merge 1 commit intosdma-newfrom
Conversation
- Add ReduceScatterSdma C++ class (hpp/cpp) leveraging SdmaReduceScatterKernel - Add pybind11 binding for ReduceScatterSdmaHandle - Add Python wrapper class ReduceScatterSdma in ccl module - Add standalone and overlap benchmark scripts for AG/RS - Add GEMM standalone baseline script - Update AllGather test scripts with GEMM overlap support - Update CMakeLists.txt to include new source file
Collaborator
|
@zhangfei829 @wuyl1 Can you help review this PR, thanks! |
Contributor
|
LGTM, thanks for your work |
Collaborator
|
nice work, @amd-andycha |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ReduceScatterSdmaC++ class and Python bindings, enabling SDMA-based ReduceScatter collective operationKey Results (MI355X, 8 GPUs, 128MB/PE)
Files Changed
New C++ Implementation:
include/mori/collective/reducescatter/reducescatter_sdma_class.hppsrc/collective/core/reducescatter_sdma_class.cppModified (bindings & build):
src/pybind/mori.cpp— ReduceScatterSdmaHandle bindingsrc/collective/CMakeLists.txt— add new sourcepython/mori/ccl/collective.py— ReduceScatterSdma wrapper classpython/mori/ccl/__init__.py— export ReduceScatterSdmaNew Test & Benchmark Scripts:
tests/python/ccl/test_reducescatter_overlap.py— RS SDMA test with optional GEMM overlaptests/python/ccl/test_rccl_reducescatter.py— RS RCCL baseline testtests/python/ccl/bench_gemm_standalone.py— GEMM standalone baselinetests/python/ccl/bench_allgather_sweep.sh— AG latency sweeptests/python/ccl/bench_reducescatter_sweep.sh— RS latency sweeptests/python/ccl/bench_ag_overlap_sweep.sh— AG + GEMM overlap sweeptests/python/ccl/bench_rs_overlap_sweep.sh— RS + GEMM overlap sweepModified Existing Tests:
tests/python/ccl/test_allgather_overlap.py— add GEMM overlap supporttests/python/ccl/test_rccl_allgather.py— add GEMM overlap supportTest Plan