Fix silent MoE corruption in the legacy multi-chunk weight sync#1737
Fix silent MoE corruption in the legacy multi-chunk weight sync#1737jamesbraza wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a shared vLLM layerwise-reload lifecycle (LayerwiseReloadWorkerMixin) to bracket multi-chunk weight syncs, preventing corruption on MoE models. The review feedback is highly constructive, pointing out critical deadlock risks in the broadcast and CUDA IPC strategies if rank 0 fails, importability issues on non-vLLM environments due to a top-level import in vllm_worker.py, and a potential worker-bricking state in layerwise_reload.py if an update is aborted. All comments provide actionable code suggestions and should be addressed.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
hey @jamesbraza, thanks for looking into this the approach/code all make sense to me, can you fix merge conflicts + address or resolve gemini comments? |
Guards the legacy `WorkerWrap` weight-reload path against the unquantized-MoE corruption on FlashInfer-CUTLASS / FlashInfer-TRTLLM backends since vllm 0.20 (vllm-project/vllm#42821): `process_weights_after_loading` swaps `[w1;w3] -> [w3;w1]` in-place on `w13_weight` at load, so a subsequent reload that does not invert that swap silently corrupts the model and collapses greedy decode into multi-language subword soup. The test (`test_vllm_worker_moe_reload.py`) drives the legacy `WorkerWrap` path end-to-end on Qwen1.5-MoE-A2.7B-Chat (smallest MoE with the `w13_weight` layout): it uses `AsyncLLMEngine` directly with a `WorkerWrap` subclass that self-injects as its own weight receiver and iterates safetensors from disk -- the bug is downstream of `receive_weights`, so the real NCCL/IPC `WeightTransferReceiver` is unnecessary and a 2-GPU trainer-side sender setup is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ight_sync.py Move the legacy `WorkerWrap.load_weights` MoE reload regression added in the prior commit out of its standalone `test_vllm_worker_moe_reload.py` and into `test_weight_sync.py`, alongside the existing NCCL-broadcast and CUDA-IPC weight-sync scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…change) `NewInferenceWorkerWrap` carried the vLLM layerwise-reload lifecycle (`start_weight_update`/`finish_weight_update`, which initialize/finalize the reload) and the `conv_weights` SKIP_TENSORS guard (Mamba/Gated-DeltaNet conv1d). Move both verbatim into a shared `LayerwiseReloadWorkerMixin` that `NewInferenceWorkerWrap` now inherits. Pure refactor: the new inference path behaves identically. The mixin exists so the legacy `WorkerWrap` path can reuse the same lifecycle next. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cycle Per-chunk `reload_weights` runs initialize+load+finalize on every chunk, so vLLM's `finalize_layerwise_processing` restores each layer absent from the current chunk to its pre-call snapshot (`load_numel <= 0` -> `_place_kernel_tensors`), corrupting multi-chunk syncs and emitting one "Failed to load weights" warning per such layer per chunk. Production syncs are multi-chunk: one `WorkerWrap.load_weights` per `WeightChunk` (issue NovaSky-AI#1680). Port the new path's three-phase lifecycle to the legacy path via the shared `LayerwiseReloadWorkerMixin`: `WorkerWrap` inherits `start_weight_update`/`finish_weight_update`, each chunk loads raw between them, and both `_send_chunks_legacy` senders bracket the chunk loop. `load_weights` falls back to a self-contained `reload_weights` when unbracketed, so single-call callers stay correct. Adds a multi-chunk real-sender regression test that drives the production `CudaIpcWeightTransferSender._send_chunks_legacy` and asserts zero "Failed to load weights" warnings across the sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
hmm actually @jamesbraza running with just the previous #1685 fix seems to be working fine with old inference for https://github.com/NovaSky-AI/SkyRL/blob/main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh#L4, which definitely has more than 1 chunk. this still seems valuable to avoid iterating |
f925ac6 to
2686488
Compare
|
Thanks for the reviews, appreciated.
Yep just did this now.
Yeah thanks for sharing that, and I see it has the legacy path Basically, severity of the issue we're fixing here scales with chunks. Since your Megatron + GLM 4.7 example opts into I posted this comment: #1680 (comment), and I show GLM 4.7 corruption in it. Let me know what you're thinking. |
Builds on #1685, which switched the legacy
WorkerWrap.load_weightspath toreload_weights. With that,load_weightsran a self-containedreload_weightsper chunk, so vLLM's per-callfinalize_layerwise_processingrestored every layer absent from that chunk — silently corrupting any multi-chunk weight sync into MoE gibberish after the first sync ( #1680; upstream vllm-project/vllm#42821).This PR brackets the whole sync with a single layerwise-reload initialize/finalize via a shared
LayerwiseReloadWorkerMixin, sharing this lifecycle with the 'new' inference path (new_inference_worker_wrap.py).Closes #1680.