Skip to content

feat(ep): add tuning config system for dispatch/combine#242

Open
isytwu wants to merge 12 commits intomainfrom
ep-tuning
Open

feat(ep): add tuning config system for dispatch/combine#242
isytwu wants to merge 12 commits intomainfrom
ep-tuning

Conversation

@isytwu
Copy link
Copy Markdown
Collaborator

@isytwu isytwu commented Mar 31, 2026

Motivation

The current EP dispatch/combine kernel launch parameters (block_num, warp_per_block, rdma_block_num) are either hardcoded or manually configured via environment variables. Different GPU architectures, EP sizes, token counts, and hidden dimensions require different optimal parameters. This PR adds an automated tuning config system that stores benchmark-derived optimal parameters in JSON files and loads them at runtime.

Technical Details

  • New module python/mori/ops/tuning_config.py: unified dtype registry, JSON config loader with validation/caching, runtime lookup (exact match on dtype + hidden_dim, ceiling match on num_tokens), and atomic save with keep-best merge strategy.
  • JSON files split by {gpu_arch}_{kernel_type}_ep{ep_size}[_{quant}].json, containing separate dispatch_rules and combine_rules lists to support independent dtypes (e.g., dispatch fp4 + combine bf16).
  • dispatch_combine.py: AUTO mode loads config at init, _resolve_launch_params extended to accept dtype + tuning_rules, dispatch/combine/standard_moe methods pass actual input.dtype for lookup. Falls back to existing hardcoded defaults when no config match.
  • Benchmark scripts (bench_dispatch_combine.py, test_dispatch_combine_internode.py): new --hidden-dim and --save-tuning-config CLI args. Tuning results auto-saved to source repo for distribution.
  • Batch tuning scripts: tools/batch_intranode_tuning.sh (single-node, sweeps tokens x hidden_dims) and tools/batch_internode_tuning.sh (multi-node via SSH).
  • Packaging: setup.py and MANIFEST.in updated to include JSON configs in wheel/sdist.

Test Plan

  • Unit test: verify dtype registry, filename generation, save/load/merge, lookup matching (exact dtype + hidden_dim, ceiling num_tokens, miss → None)
  • End-to-end single-node: run --cmd tuning --save-tuning-config auto with EP4/bf16 at multiple token counts (64, 128, 512, 4096), verify JSON accumulates correctly with keep-best merge
  • End-to-end multi-node: run internode tuning with --save-tuning-config auto on 2-node setup
  • Backward compatibility: MANUAL mode unchanged, AUTO mode without JSON file falls back to legacy hardcoded values

Test Result

Single-node tuning on MI308X (gfx942) EP4 bf16 completed successfully. Generated gfx942_IntraNode_ep4.json with 4 rules (64/128/512/4096 tokens). JSON format validated, lookup returns correct LaunchParams, keep-best merge preserves higher-bandwidth results across repeated runs.

Submission Checklist

@isytwu isytwu self-assigned this Mar 31, 2026
@isytwu isytwu force-pushed the ep-tuning branch 2 times, most recently from b1dde07 to ec27092 Compare April 1, 2026 05:24
isytwu added 9 commits April 1, 2026 17:16
Add JSON-based config files to store tuned block_num/warp_per_block
for dispatch and combine kernels. Benchmark scripts can auto-save
tuning results; runtime loads config based on GPU arch, kernel type,
EP size, and matches by dtype + hidden_dim + num_tokens.

New files: tuning_config.py, batch tuning scripts, sample JSON.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant