feat(ep): add tuning config system for dispatch/combine by isytwu · Pull Request #242 · ROCm/mori

isytwu · 2026-03-31T04:57:31Z

Motivation

The current EP dispatch/combine kernel launch parameters (block_num, warp_per_block, rdma_block_num) are either hardcoded or manually configured via environment variables. Different GPU architectures, EP sizes, token counts, and hidden dimensions require different optimal parameters. This PR adds an automated tuning config system that stores benchmark-derived optimal parameters in JSON files and loads them at runtime.

Technical Details

New module python/mori/ops/tuning_config.py: unified dtype registry, JSON config loader with validation/caching, runtime lookup (exact match on dtype + hidden_dim, ceiling match on num_tokens), and atomic save with keep-best merge strategy.
JSON files split by {gpu_arch}_{kernel_type}_ep{ep_size}[_{quant}].json, containing separate dispatch_rules and combine_rules lists to support independent dtypes (e.g., dispatch fp4 + combine bf16).
dispatch_combine.py: AUTO mode loads config at init, _resolve_launch_params extended to accept dtype + tuning_rules, dispatch/combine/standard_moe methods pass actual input.dtype for lookup. Falls back to existing hardcoded defaults when no config match.
Benchmark scripts (bench_dispatch_combine.py, test_dispatch_combine_internode.py): new --hidden-dim and --save-tuning-config CLI args. Tuning results auto-saved to source repo for distribution.
Batch tuning scripts: tools/batch_intranode_tuning.sh (single-node, sweeps tokens x hidden_dims) and tools/batch_internode_tuning.sh (multi-node via SSH).
Packaging: setup.py and MANIFEST.in updated to include JSON configs in wheel/sdist.

Test Plan

Unit test: verify dtype registry, filename generation, save/load/merge, lookup matching (exact dtype + hidden_dim, ceiling num_tokens, miss → None)
End-to-end single-node: run --cmd tuning --save-tuning-config auto with EP4/bf16 at multiple token counts (64, 128, 512, 4096), verify JSON accumulates correctly with keep-best merge
End-to-end multi-node: run internode tuning with --save-tuning-config auto on 2-node setup
Backward compatibility: MANUAL mode unchanged, AUTO mode without JSON file falls back to legacy hardcoded values

Test Result

Single-node tuning on MI308X (gfx942) EP4 bf16 completed successfully. Generated gfx942_IntraNode_ep4.json with 4 rules (64/128/512/4096 tokens). JSON format validated, lookup returns correct LaunchParams, keep-best merge preserves higher-bandwidth results across repeated runs.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Add JSON-based config files to store tuned block_num/warp_per_block for dispatch and combine kernels. Benchmark scripts can auto-save tuning results; runtime loads config based on GPU arch, kernel type, EP size, and matches by dtype + hidden_dim + num_tokens. New files: tuning_config.py, batch tuning scripts, sample JSON.

isytwu self-assigned this Mar 31, 2026

isytwu force-pushed the ep-tuning branch 2 times, most recently from b1dde07 to ec27092 Compare April 1, 2026 05:24

isytwu added 9 commits April 1, 2026 17:16

add gpu model

9e05a30

handling zero-copy

d855cc5

add TO handling

28b3a44

fix bug

627c858

update json

1a2e8ce

add phase

35ab573

refine quant

a2333b5

tuning add quick mode

228f8b0

isytwu force-pushed the ep-tuning branch from d0e665f to 228f8b0 Compare April 2, 2026 04:15

isytwu added 3 commits April 2, 2026 04:19

fix low latency case bug

e9d96ce

fix gpu model

a1a839c

update mi355 json

81d856d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ep): add tuning config system for dispatch/combine#242

feat(ep): add tuning config system for dispatch/combine#242
isytwu wants to merge 12 commits intomainfrom
ep-tuning

isytwu commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

isytwu commented Mar 31, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant