Fix silent MoE corruption in the legacy multi-chunk weight sync by jamesbraza · Pull Request #1737 · NovaSky-AI/SkyRL

jamesbraza · 2026-06-01T23:49:02Z

Builds on #1685, which switched the legacy WorkerWrap.load_weights path to reload_weights. With that, load_weights ran a self-contained reload_weights per chunk, so vLLM's per-call finalize_layerwise_processing restored every layer absent from that chunk — silently corrupting any multi-chunk weight sync into MoE gibberish after the first sync ( #1680; upstream vllm-project/vllm#42821).

This PR brackets the whole sync with a single layerwise-reload initialize/finalize via a shared LayerwiseReloadWorkerMixin, sharing this lifecycle with the 'new' inference path (new_inference_worker_wrap.py).

Closes #1680.

gemini-code-assist

Code Review

This pull request introduces a shared vLLM layerwise-reload lifecycle (LayerwiseReloadWorkerMixin) to bracket multi-chunk weight syncs, preventing corruption on MoE models. The review feedback is highly constructive, pointing out critical deadlock risks in the broadcast and CUDA IPC strategies if rank 0 fails, importability issues on non-vLLM environments due to a top-level import in vllm_worker.py, and a potential worker-bricking state in layerwise_reload.py if an update is aborted. All comments provide actionable code suggestions and should be addressed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

erictang000 · 2026-06-02T00:34:06Z

hey @jamesbraza, thanks for looking into this the approach/code all make sense to me, can you fix merge conflicts + address or resolve gemini comments?

Guards the legacy `WorkerWrap` weight-reload path against the unquantized-MoE corruption on FlashInfer-CUTLASS / FlashInfer-TRTLLM backends since vllm 0.20 (vllm-project/vllm#42821): `process_weights_after_loading` swaps `[w1;w3] -> [w3;w1]` in-place on `w13_weight` at load, so a subsequent reload that does not invert that swap silently corrupts the model and collapses greedy decode into multi-language subword soup. The test (`test_vllm_worker_moe_reload.py`) drives the legacy `WorkerWrap` path end-to-end on Qwen1.5-MoE-A2.7B-Chat (smallest MoE with the `w13_weight` layout): it uses `AsyncLLMEngine` directly with a `WorkerWrap` subclass that self-injects as its own weight receiver and iterates safetensors from disk -- the bug is downstream of `receive_weights`, so the real NCCL/IPC `WeightTransferReceiver` is unnecessary and a 2-GPU trainer-side sender setup is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ight_sync.py Move the legacy `WorkerWrap.load_weights` MoE reload regression added in the prior commit out of its standalone `test_vllm_worker_moe_reload.py` and into `test_weight_sync.py`, alongside the existing NCCL-broadcast and CUDA-IPC weight-sync scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…change) `NewInferenceWorkerWrap` carried the vLLM layerwise-reload lifecycle (`start_weight_update`/`finish_weight_update`, which initialize/finalize the reload) and the `conv_weights` SKIP_TENSORS guard (Mamba/Gated-DeltaNet conv1d). Move both verbatim into a shared `LayerwiseReloadWorkerMixin` that `NewInferenceWorkerWrap` now inherits. Pure refactor: the new inference path behaves identically. The mixin exists so the legacy `WorkerWrap` path can reuse the same lifecycle next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…cycle Per-chunk `reload_weights` runs initialize+load+finalize on every chunk, so vLLM's `finalize_layerwise_processing` restores each layer absent from the current chunk to its pre-call snapshot (`load_numel <= 0` -> `_place_kernel_tensors`), corrupting multi-chunk syncs and emitting one "Failed to load weights" warning per such layer per chunk. Production syncs are multi-chunk: one `WorkerWrap.load_weights` per `WeightChunk` (issue NovaSky-AI#1680). Port the new path's three-phase lifecycle to the legacy path via the shared `LayerwiseReloadWorkerMixin`: `WorkerWrap` inherits `start_weight_update`/`finish_weight_update`, each chunk loads raw between them, and both `_send_chunks_legacy` senders bracket the chunk loop. `load_weights` falls back to a self-contained `reload_weights` when unbracketed, so single-call callers stay correct. Adds a multi-chunk real-sender regression test that drives the production `CudaIpcWeightTransferSender._send_chunks_legacy` and asserts zero "Failed to load weights" warnings across the sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

erictang000 · 2026-06-02T01:16:55Z

hmm actually @jamesbraza running with just the previous #1685 fix seems to be working fine with old inference for https://github.com/NovaSky-AI/SkyRL/blob/main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh#L4, which definitely has more than 1 chunk.

this still seems valuable to avoid iterating model.modules() on every chunk, but i'm wondering what conditions we would expect to actually produce garbled outputs prior to the PR

jamesbraza · 2026-06-02T18:43:11Z

Thanks for the reviews, appreciated.

hey @jamesbraza, thanks for looking into this the approach/code all make sense to me, can you fix merge conflicts + address or resolve gemini comments?

Yep just did this now.

hmm actually @jamesbraza running with just the previous #1685 fix seems to be working fine with old inference for main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh#L4, which definitely has more than 1 chunk.

Yeah thanks for sharing that, and I see it has the legacy path _SKYRL_USE_NEW_INFERENCE=0.

Basically, severity of the issue we're fixing here scales with chunks. Since your Megatron + GLM 4.7 example opts into colocate_all, there's not many chunks. It's actually still being corrupted; if you did many syncs, the issue would compound.

I posted this comment: #1680 (comment), and I show GLM 4.7 corruption in it. Let me know what you're thinking.

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

jamesbraza and others added 4 commits June 1, 2026 17:42

jamesbraza force-pushed the fix/vllm-worker-moe-reload-weights branch from f925ac6 to 2686488 Compare June 2, 2026 02:04

jamesbraza mentioned this pull request Jun 2, 2026

WorkerWrap.load_weights silently corrupts unquantized MoE rollouts after the first weight sync since vllm>=0.20.0 #1680

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silent MoE corruption in the legacy multi-chunk weight sync#1737

Fix silent MoE corruption in the legacy multi-chunk weight sync#1737
jamesbraza wants to merge 4 commits into
NovaSky-AI:mainfrom
EdisonScientific:fix/vllm-worker-moe-reload-weights

jamesbraza commented Jun 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jun 2, 2026

Uh oh!

erictang000 commented Jun 2, 2026

Uh oh!

jamesbraza commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesbraza commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erictang000 commented Jun 2, 2026

Uh oh!

erictang000 commented Jun 2, 2026

Uh oh!

jamesbraza commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesbraza commented Jun 1, 2026 •

edited

Loading