-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Open
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestmodule: moeneeds-follow-upIssue needs follow-upIssue needs follow-up
Description
Problem
When training MoE models (e.g., Qwen3.5-35B-A3B) with optimizer_cpu_offload=True, HybridDeviceOptimizer raises a KeyError at the first optimizer step.
Error
File "hybrid_optimizer.py", line 160, in step
self._sync_hdo_param_groups_to_sub_optimizers()
File "hybrid_optimizer.py", line 339, in _sync_hdo_param_groups_to_sub_optimizers
inner_param = self.param_to_inner_param[param]
KeyError: tensor([...], device='cuda:6')
Cause
param_to_inner_param is built in _init_sub_optimizers using tensor object identity as dict keys. After initialization, the outer optimizer wrapper (e.g., MixedPrecisionOptimizer) may replace tensor objects in self.param_groups, making the original mapping stale.
Config
- Model: Qwen3.5-35B-A3B (MoE, 64 experts)
- TP=2, PP=4, EP=2, bf16
- optimizer_cpu_offload=True, optimizer_offload_fraction=1.0
Suggested Fix
Use a stable key (e.g., parameter index) instead of tensor identity, or rebuild the mapping before each step().
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcommunity-requestmodule: moeneeds-follow-upIssue needs follow-upIssue needs follow-up