[Bug] HybridDeviceOptimizer KeyError with MoE models when using CPU offload

## Problem

When training MoE models (e.g., Qwen3.5-35B-A3B) with optimizer_cpu_offload=True, HybridDeviceOptimizer raises a KeyError at the first optimizer step.

## Error

```
File "hybrid_optimizer.py", line 160, in step
    self._sync_hdo_param_groups_to_sub_optimizers()
File "hybrid_optimizer.py", line 339, in _sync_hdo_param_groups_to_sub_optimizers
    inner_param = self.param_to_inner_param[param]
KeyError: tensor([...], device='cuda:6')
```

## Cause

param_to_inner_param is built in _init_sub_optimizers using tensor object identity as dict keys. After initialization, the outer optimizer wrapper (e.g., MixedPrecisionOptimizer) may replace tensor objects in self.param_groups, making the original mapping stale.

## Config

- Model: Qwen3.5-35B-A3B (MoE, 64 experts)
- TP=2, PP=4, EP=2, bf16
- optimizer_cpu_offload=True, optimizer_offload_fraction=1.0

## Suggested Fix

Use a stable key (e.g., parameter index) instead of tensor identity, or rebuild the mapping before each step().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] HybridDeviceOptimizer KeyError with MoE models when using CPU offload #4042

Problem

Error

Cause

Config

Suggested Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] HybridDeviceOptimizer KeyError with MoE models when using CPU offload #4042

Description

Problem

Error

Cause

Config

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions