chore: bump vLLM/vLLM-Omni to 0.22.0 and adapt the worker stack#70
Merged
Conversation
timzsu
requested changes
Jun 12, 2026
38bb1e9 to
a76abbd
Compare
vllm and vllm-omni 0.22 require transformers >=5, which cascades through the GPU/inference stack: transformers 5.8.1, peft 0.19.1 (0.17 imports the removed HybridCache and unregisters every training executor), diffusers 0.38, pillow 12.2, torch 2.11, safetensors 0.8, and fastembed 0.8 (lifts the pillow<12 cap). Re-lock and regenerate the worker requirements. vLLM's PyPI wheel for 0.22 is built for CUDA 13; the GPU worker runs CUDA 12.9. Pin the +cu129 release wheel for linux/x86_64 via [tool.uv.sources] so it matches torch and flashinfer, with a PyPI fallback for other platforms. The bump clears most ignored pip-audit advisories (vllm, gradio, pillow, diffusers, transformers, starlette no longer fire at the new versions); prune them from security.yml and CODE_STYLE.md, leaving torch, lxml, and diskcache. The cu129 wheel is not on PyPI, so pip-audit skips it like flashinfer-jit-cache. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
… 0.38 transformers 5, trl 0.23, and diffusers 0.38 changed APIs the worker executors relied on: trl's PPOConfig dropped save_safetensors, diffusers made encode_prompt's do_classifier_free_guidance required for SD1.x/2.x/XL, transformers types tokenizer.decode() as str | list[str], and the bf16 GPU probe can now raise when CUDA can't initialize. Adapt the executors accordingly and cover the new fp16 fallback. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
transformers 5 eagerly initializes the CUDA device when a TrainingArguments-derived config is constructed, so config-mapping unit tests crash on a host whose driver can't init the installed torch build. Default CUDA_VISIBLE_DEVICES to empty; set it explicitly to run GPU tests. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
transformers 5 types tokenizer.decode() as str | list[str]. The single-sequence calls always return str, so assert the type to verify the invariant at runtime instead of casting it away unchecked. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
is_bf16_supported() only raises when CUDA can't initialize, in which case the device is unusable and fp16 buys nothing — the 4-bit load fails moments later regardless. Catching it masked a fatal misconfiguration that a GPU worker should surface so the task retries on a healthy worker. The genuine no-bf16 case returns False and already falls through to fp16. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
vllm 0.22.0 hard-pins flashinfer-python==0.6.11.post2, so a >= floor there is misleading — match it exactly, as the group already does for vllm and vllm-omni. Bump the deepspeed floor to 0.19.1 (the latest, resolves cleanly). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
a76abbd to
070d1e5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Bumps
vllm/vllm-omnifrom 0.18.0 to 0.22.0 to clear the bulk of the ignored pip-audit advisories. 0.22 requirestransformers>=5, which cascades through the whole GPU/inference stack — transformers 4.57→5.8.1, peft 0.17→0.19.1, diffusers 0.36→0.38, torch 2.10→2.11, pillow 11.3→12.2, safetensors 0.6→0.8, fastembed 0.7→0.8, deepspeed 0.18→0.19 — so this PR also adapts the worker executors to the changed APIs and pins vLLM to its CUDA-12.9 wheel (the PyPI default is built for CUDA 13, which the GPU workers can't run).Changes
pyproject.toml,uv.lock,src/worker/requirements/requirements.txt,src/worker/requirements/requirements.gpu.txt: raise the floors and re-lock / regenerate.fastembedis moved past its<0.8cap (it, not gradio, was the realpillow<12capper);peft>=0.18is required (0.17 imports the removedtransformers.HybridCacheand silently unregisters every training executor);deepspeedmoves to>=0.19.1; andflashinfer-pythonis pinned to==0.6.11.post2to match vLLM 0.22's exact requirement (the GPU group already pinsvllm/vllm-omniwith==for the same ABI-locking reason).pyproject.toml[tool.uv.sources]: pinvllmto its+cu129release wheel forlinux/x86_64(PyPI fallback elsewhere) so it matches torch (UV_TORCH_BACKEND=cu129), flashinfer, and the CUDA 12.9 base image. The PyPI default linkslibcudart.so.13..github/workflows/security.yml,docs/CODE_STYLE.md: the bump clears most ignored advisories (vllm, gradio, pillow, diffusers, transformers, starlette no longer fire at the new versions). The ignore list collapses to what still fires:torch GHSA-rrmf-rvhw-rf47,lxml PYSEC-2026-87(crawl4ai capslxml<6), anddiskcache GHSA-w8v5-vhqr-4h9v. The+cu129wheel is unauditable on PyPI, so the GPU run skips it (documented, likeflashinfer-jit-cache).src/worker/executors/ppo_executor.py(drop the removedPPOConfig.save_safetensors; mapfp16/bf16likeSFTConfig),diffusers_executor.py(passdo_classifier_free_guidancetoencode_promptwhen the signature accepts it — required for SD1.x/2.x/XL in diffusers 0.38),transformers_executor.py+ppo_executor.py(assertthe single-sequencetokenizer.decode()result isstr, now typedstr | list[str]),sft_executor.py(delheavy locals instead of= None).tests/conftest.py: default the unit suite to CPU-only (transformers 5 eagerly inits the CUDA device when aTrainingArguments-derived config is constructed).Design
+cu129release wheel, so pinning it is a self-contained build-config change that keeps the whole image coherent on CUDA 12 — versus a fleet-wide driver upgrade to ≥580 (with reboots and co-tenant disruption) that would also cascade torch/flashinfer to CUDA 13. Same vLLM version either way, so the CVE wins are preserved.CUDA_VISIBLE_DEVICES=""makes the suite deterministic anywhere; it's overridable for GPU-marked tests, and the one real-GPU test is already excluded in CI.Test Plan
pre-commit run --all-filesandpytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py.pip-auditagainst the three generated requirements files, exactly as CI runs it.Trainer/TRL training, transformers CPU inference, fastembed RAG, the vLLM-Omni task types, and a 2-GPU DeepSpeed ZeRO-2 SFT run) to confirm the new wheels actually load and run.Test Result
pytest tests/, and all three pip-audit scans pass.DONE— the cu129 wheels load and run across vLLM, transformers-5 training/inference, diffusers 0.38, fastembed, all four vLLM-Omni task types, and DeepSpeed 0.19 multi-GPU training.Pre-submission Checklist
pre-commit run --all-filesand fixed any issues.uv run pytest tests/passes locally.uv sync --all-packages --group ci --frozen). (No SDK/CLI code changes; dependency floors only.)[BREAKING]and described migration steps above.