Skip to content

[Bug] HybridDeviceOptimizer KeyError with MoE models when using CPU offload #4042

@khazic

Description

@khazic

Problem

When training MoE models (e.g., Qwen3.5-35B-A3B) with optimizer_cpu_offload=True, HybridDeviceOptimizer raises a KeyError at the first optimizer step.

Error

File "hybrid_optimizer.py", line 160, in step
    self._sync_hdo_param_groups_to_sub_optimizers()
File "hybrid_optimizer.py", line 339, in _sync_hdo_param_groups_to_sub_optimizers
    inner_param = self.param_to_inner_param[param]
KeyError: tensor([...], device='cuda:6')

Cause

param_to_inner_param is built in _init_sub_optimizers using tensor object identity as dict keys. After initialization, the outer optimizer wrapper (e.g., MixedPrecisionOptimizer) may replace tensor objects in self.param_groups, making the original mapping stale.

Config

  • Model: Qwen3.5-35B-A3B (MoE, 64 experts)
  • TP=2, PP=4, EP=2, bf16
  • optimizer_cpu_offload=True, optimizer_offload_fraction=1.0

Suggested Fix

Use a stable key (e.g., parameter index) instead of tensor identity, or rebuild the mapping before each step().

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions