Merged
Conversation
Collaborator
|
@yanboshao Issues to address:
|
coderfeli
reviewed
Apr 2, 2026
| stream, params, extra)); | ||
| } | ||
|
|
||
| extern "C" void mgpuLaunchClusterKernel(hipFunction_t function, |
Collaborator
There was a problem hiding this comment.
why remove all of these?
coderfeli
reviewed
Apr 2, 2026
| @@ -0,0 +1,830 @@ | |||
| """Custom all-reduce kernel + Python-facing shim. | |||
coderfeli
reviewed
Apr 2, 2026
| @@ -0,0 +1,820 @@ | |||
| """FlyDSL all-reduce kernels using signal protocol for multi-GPU communication. | |||
coderfeli
reviewed
Apr 2, 2026
| self_sg_i64 = _unwrap_value(self_sg) | ||
| sg_ptrs_i64 = _unwrap_value(sg_ptrs) | ||
| in_ptrs_i64 = _unwrap_value(in_ptrs) | ||
| out_ptr_i64 = _unwrap_value(out_ptr) |
Collaborator
There was a problem hiding this comment.
why use so many ir.* wrap, unwrap? try to use native types in numeric.py
coderfeli
reviewed
Apr 2, 2026
| Each warp loads data from one rank into shared memory, then warp 0 | ||
| reduces across all warps and writes the result to global memory. | ||
| """ | ||
| from flydsl._mlir.dialects import arith, memref, scf, vector |
Collaborator
There was a problem hiding this comment.
use fly.memref and arith?
coderfeli
reviewed
Apr 2, 2026
| p = bfor.arguments[0] | ||
| cond = arith.CmpIOp(arith.CmpIPredicate.ult, p, | ||
| ea.constant(num_packs, type=i32)).result | ||
| scf.ConditionOp(cond, [p, bfor.arguments[1]]) |
Collaborator
There was a problem hiding this comment.
directly use > < and arith .cond?
158a605 to
37cf182
Compare
Collaborator
|
@yanboshao still has conflicts with main |
37cf182 to
8907433
Compare
coderfeli
reviewed
Apr 7, 2026
| self._call_state_cache[cache_key] = state | ||
| except Exception: | ||
| pass | ||
|
|
Collaborator
There was a problem hiding this comment.
seems already in main. try using main directly
- Add stream_ptr param to run_{1stage,2stage,2stage_ptr} wrappers and launch via async deps.
- Pass torch current stream in FlyDSL custom_all_reduce to avoid per-launch stream create/sync/destroy.
- Keep AIter CustomAllreduce import compatible across package layouts.
Co-authored-by: Cursor <cursoragent@cursor.com>
Extend run_2stage_ptr ABI with inp_ptr_override to avoid per-call H2D pointer updates. Cache grid_x and add optional out reuse / validation controls to reduce host overhead. Made-with: Cursor
Keep write-mode graph replay on stable output buffers and harden write-mode pointer accesses so the large fp16 cudagraph path runs reliably while preserving the end_sync ISA alignment changes. Made-with: Cursor
8907433 to
f745541
Compare
f745541 to
2db536a
Compare
coderfeli
approved these changes
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Add Allreduce kernel in FlyDSL which include 1-stage kernel and 2-stage kernel.
Technical Details
Test Plan
Test accuracy and performance on MI308 and MI355.
Test Result
Detailed Results on MI308
1-stage Kernel Path
World Size = 2
World Size = 4
World Size = 8
2-stage Kernel Path
World Size = 4
World Size = 8
write-mode Kernel Path (World Size = 8 only)
Stress Test (Near 64 MB Limit)
Irregular Shapes
Detailed Results on MI355
1-stage Kernel Path
World Size = 2
World Size = 4
World Size = 8
2-stage Kernel Path
World Size = 4
World Size = 8
Stress Test (Near 64 MB Limit)
Irregular Shapes
Speedup = aiter_avg_time_us / flydsl_avg_time_us | Bold indicates speedup ≥ 1.2x | ⚠ indicates regression (speedup < 0.8)
Submission Checklist