Add test coverage for Muon muon_lr/adam_lr overrides (#8047)

sowndappan5 · web-flow · commit 3dc98deb96d9 · 2026-06-06T02:18:34.000Z
## Summary

Add coverage for separate learning rate overrides in the Muon optimizer
path and fix the related Muon blog documentation.

## Background

Muon parameters and non-Muon parameters are automatically split into
separate optimizer groups. The intended behavior is:
- `muon_lr` applies to Muon parameter groups
- `adam_lr` applies to Adam parameter groups
- `lr` remains the fallback for both groups when overrides are not
provided

## Changes

- add a parameterized test covering:
  - legacy `lr` fallback behavior
  - separate `muon_lr` / `adam_lr` override behavior
- fix the Muon blog table header to label `muon_lr` and `adam_lr`
correctly

## Validation

Ran:
`python -m pytest
DeepSpeed/tests/unit/ops/muon/test_muon_partial_training.py -k
learning_rate_overrides -q -rs`

Result:
- test collected successfully
- skipped locally because this distributed test requires 2 GPUs, while
the local environment has 1 GPU

---------

Signed-off-by: Sowndappan S &lt;147894621+sowndappan5@users.noreply.github.com&gt;
diff --git a/tests/unit/ops/muon/test_muon_partial_training.py b/tests/unit/ops/muon/test_muon_partial_training.py
@@ -22,6 +22,7 @@
 
 import torch.nn as nn
 import deepspeed
+import pytest
 from unit.common import DistributedTest
 
 
@@ -173,3 +174,43 @@ def test_muon_with_mixed_trainable_params(self):
 
         # Verify the model was initialized successfully
         assert model_engine is not None
+
+    @pytest.mark.parametrize(
+        "optimizer_params, expected_muon_lr, expected_adam_lr",
+        [
+            ({
+                "lr": 0.02,
+                "weight_decay": 0.01
+            }, 0.02, 0.02),
+            ({
+                "lr": 0.02,
+                "muon_lr": 0.04,
+                "adam_lr": 0.001,
+                "weight_decay": 0.01
+            }, 0.04, 0.001),
+        ],
+    )
+    def test_muon_adam_learning_rate_overrides(self, optimizer_params, expected_muon_lr, expected_adam_lr):
+        model = PartialTrainableModel()
+
+        ds_config = {
+            "train_micro_batch_size_per_gpu": 1,
+            "optimizer": {
+                "type": "Muon",
+                "params": optimizer_params
+            },
+            "zero_optimization": {
+                "stage": 2
+            },
+        }
+
+        model_engine, _, _, _ = deepspeed.initialize(model=model,
+                                                     model_parameters=model.parameters(),
+                                                     config=ds_config)
+
+        group_lrs = {
+            param_group["use_muon"]: param_group["lr"]
+            for param_group in model_engine.basic_optimizer.param_groups
+        }
+        assert group_lrs[True] == expected_muon_lr
+        assert group_lrs[False] == expected_adam_lr