Skip to content

Fix silent MoE corruption in the legacy multi-chunk weight sync#1737

Open
jamesbraza wants to merge 4 commits into
NovaSky-AI:mainfrom
EdisonScientific:fix/vllm-worker-moe-reload-weights
Open

Fix silent MoE corruption in the legacy multi-chunk weight sync#1737
jamesbraza wants to merge 4 commits into
NovaSky-AI:mainfrom
EdisonScientific:fix/vllm-worker-moe-reload-weights

Conversation

@jamesbraza
Copy link
Copy Markdown
Contributor

@jamesbraza jamesbraza commented Jun 1, 2026

Builds on #1685, which switched the legacy WorkerWrap.load_weights path to reload_weights. With that, load_weights ran a self-contained reload_weights per chunk, so vLLM's per-call finalize_layerwise_processing restored every layer absent from that chunk — silently corrupting any multi-chunk weight sync into MoE gibberish after the first sync ( #1680; upstream vllm-project/vllm#42821).

This PR brackets the whole sync with a single layerwise-reload initialize/finalize via a shared LayerwiseReloadWorkerMixin, sharing this lifecycle with the 'new' inference path (new_inference_worker_wrap.py).

Closes #1680.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shared vLLM layerwise-reload lifecycle (LayerwiseReloadWorkerMixin) to bracket multi-chunk weight syncs, preventing corruption on MoE models. The review feedback is highly constructive, pointing out critical deadlock risks in the broadcast and CUDA IPC strategies if rank 0 fails, importability issues on non-vLLM environments due to a top-level import in vllm_worker.py, and a potential worker-bricking state in layerwise_reload.py if an update is aborted. All comments provide actionable code suggestions and should be addressed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread skyrl/backends/skyrl_train/weight_sync/broadcast_strategy.py
Comment thread skyrl/backends/skyrl_train/weight_sync/broadcast_strategy.py
Comment thread skyrl/backends/skyrl_train/weight_sync/cuda_ipc_strategy.py
Comment thread skyrl/backends/skyrl_train/weight_sync/cuda_ipc_strategy.py
Comment thread skyrl/backends/skyrl_train/inference_servers/vllm_worker.py Outdated
Comment thread skyrl/backends/skyrl_train/inference_servers/vllm_worker.py
Comment thread skyrl/backends/skyrl_train/inference_servers/layerwise_reload.py
@erictang000
Copy link
Copy Markdown
Collaborator

hey @jamesbraza, thanks for looking into this the approach/code all make sense to me, can you fix merge conflicts + address or resolve gemini comments?

jamesbraza and others added 4 commits June 1, 2026 17:42
Guards the legacy `WorkerWrap` weight-reload path against the unquantized-MoE
corruption on FlashInfer-CUTLASS / FlashInfer-TRTLLM backends since vllm 0.20
(vllm-project/vllm#42821): `process_weights_after_loading` swaps
`[w1;w3] -> [w3;w1]` in-place on `w13_weight` at load, so a subsequent reload
that does not invert that swap silently corrupts the model and collapses
greedy decode into multi-language subword soup.

The test (`test_vllm_worker_moe_reload.py`) drives the legacy `WorkerWrap`
path end-to-end on Qwen1.5-MoE-A2.7B-Chat (smallest MoE with the `w13_weight`
layout): it uses `AsyncLLMEngine` directly with a `WorkerWrap` subclass that
self-injects as its own weight receiver and iterates safetensors from disk --
the bug is downstream of `receive_weights`, so the real NCCL/IPC
`WeightTransferReceiver` is unnecessary and a 2-GPU trainer-side sender setup
is avoided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ight_sync.py

Move the legacy `WorkerWrap.load_weights` MoE reload regression added in
the prior commit out of its standalone `test_vllm_worker_moe_reload.py`
and into `test_weight_sync.py`, alongside the existing NCCL-broadcast and
CUDA-IPC weight-sync scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…change)

`NewInferenceWorkerWrap` carried the vLLM layerwise-reload lifecycle
(`start_weight_update`/`finish_weight_update`, which initialize/finalize the
reload) and the `conv_weights` SKIP_TENSORS guard (Mamba/Gated-DeltaNet conv1d).
Move both verbatim into a shared `LayerwiseReloadWorkerMixin` that
`NewInferenceWorkerWrap` now inherits.

Pure refactor: the new inference path behaves identically. The mixin exists so
the legacy `WorkerWrap` path can reuse the same lifecycle next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cycle

Per-chunk `reload_weights` runs initialize+load+finalize on every chunk, so
vLLM's `finalize_layerwise_processing` restores each layer absent from the
current chunk to its pre-call snapshot (`load_numel <= 0` ->
`_place_kernel_tensors`), corrupting multi-chunk syncs and emitting one
"Failed to load weights" warning per such layer per chunk. Production syncs are
multi-chunk: one `WorkerWrap.load_weights` per `WeightChunk` (issue NovaSky-AI#1680).

Port the new path's three-phase lifecycle to the legacy path via the shared
`LayerwiseReloadWorkerMixin`: `WorkerWrap` inherits
`start_weight_update`/`finish_weight_update`, each chunk loads raw between them,
and both `_send_chunks_legacy` senders bracket the chunk loop. `load_weights`
falls back to a self-contained `reload_weights` when unbracketed, so single-call
callers stay correct.

Adds a multi-chunk real-sender regression test that drives the production
`CudaIpcWeightTransferSender._send_chunks_legacy` and asserts zero
"Failed to load weights" warnings across the sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@erictang000
Copy link
Copy Markdown
Collaborator

hmm actually @jamesbraza running with just the previous #1685 fix seems to be working fine with old inference for https://github.com/NovaSky-AI/SkyRL/blob/main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh#L4, which definitely has more than 1 chunk.

this still seems valuable to avoid iterating model.modules() on every chunk, but i'm wondering what conditions we would expect to actually produce garbled outputs prior to the PR

@jamesbraza
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews, appreciated.

hey @jamesbraza, thanks for looking into this the approach/code all make sense to me, can you fix merge conflicts + address or resolve gemini comments?

Yep just did this now.

hmm actually @jamesbraza running with just the previous #1685 fix seems to be working fine with old inference for main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh#L4, which definitely has more than 1 chunk.

Yeah thanks for sharing that, and I see it has the legacy path _SKYRL_USE_NEW_INFERENCE=0.

Basically, severity of the issue we're fixing here scales with chunks. Since your Megatron + GLM 4.7 example opts into colocate_all, there's not many chunks. It's actually still being corrupted; if you did many syncs, the issue would compound.

I posted this comment: #1680 (comment), and I show GLM 4.7 corruption in it. Let me know what you're thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WorkerWrap.load_weights silently corrupts unquantized MoE rollouts after the first weight sync since vllm>=0.20.0

2 participants