chore: bump vLLM/vLLM-Omni to 0.22.0 and adapt the worker stack by kaiitunnz · Pull Request #70 · mlsys-io/FlowMesh

kaiitunnz · 2026-06-12T09:06:02Z

Purpose

Bumps vllm / vllm-omni from 0.18.0 to 0.22.0 to clear the bulk of the ignored pip-audit advisories. 0.22 requires transformers>=5, which cascades through the whole GPU/inference stack — transformers 4.57→5.8.1, peft 0.17→0.19.1, diffusers 0.36→0.38, torch 2.10→2.11, pillow 11.3→12.2, safetensors 0.6→0.8, fastembed 0.7→0.8, deepspeed 0.18→0.19 — so this PR also adapts the worker executors to the changed APIs and pins vLLM to its CUDA-12.9 wheel (the PyPI default is built for CUDA 13, which the GPU workers can't run).

Changes

Dependency bump — pyproject.toml, uv.lock, src/worker/requirements/requirements.txt, src/worker/requirements/requirements.gpu.txt: raise the floors and re-lock / regenerate. fastembed is moved past its <0.8 cap (it, not gradio, was the real pillow<12 capper); peft>=0.18 is required (0.17 imports the removed transformers.HybridCache and silently unregisters every training executor); deepspeed moves to >=0.19.1; and flashinfer-python is pinned to ==0.6.11.post2 to match vLLM 0.22's exact requirement (the GPU group already pins vllm/vllm-omni with == for the same ABI-locking reason).
vLLM CUDA-12 pin — pyproject.toml [tool.uv.sources]: pin vllm to its +cu129 release wheel for linux/x86_64 (PyPI fallback elsewhere) so it matches torch (UV_TORCH_BACKEND=cu129), flashinfer, and the CUDA 12.9 base image. The PyPI default links libcudart.so.13.
pip-audit ignore prune — .github/workflows/security.yml, docs/CODE_STYLE.md: the bump clears most ignored advisories (vllm, gradio, pillow, diffusers, transformers, starlette no longer fire at the new versions). The ignore list collapses to what still fires: torch GHSA-rrmf-rvhw-rf47, lxml PYSEC-2026-87 (crawl4ai caps lxml<6), and diskcache GHSA-w8v5-vhqr-4h9v. The +cu129 wheel is unauditable on PyPI, so the GPU run skips it (documented, like flashinfer-jit-cache).
Worker executor adaptations — src/worker/executors/ppo_executor.py (drop the removed PPOConfig.save_safetensors; map fp16/bf16 like SFTConfig), diffusers_executor.py (pass do_classifier_free_guidance to encode_prompt when the signature accepts it — required for SD1.x/2.x/XL in diffusers 0.38), transformers_executor.py + ppo_executor.py (assert the single-sequence tokenizer.decode() result is str, now typed str | list[str]), sft_executor.py (del heavy locals instead of = None).
Tests — tests/conftest.py: default the unit suite to CPU-only (transformers 5 eagerly inits the CUDA device when a TrainingArguments-derived config is constructed).

Design

cu129 wheel pin over a driver upgrade. vLLM 0.22's PyPI wheel is CUDA 13; the GPU fleet is CUDA 12.9 / driver 560. vLLM publishes a +cu129 release wheel, so pinning it is a self-contained build-config change that keeps the whole image coherent on CUDA 12 — versus a fleet-wide driver upgrade to ≥580 (with reboots and co-tenant disruption) that would also cascade torch/flashinfer to CUDA 13. Same vLLM version either way, so the CVE wins are preserved.
Ignore list pruned to what actually fires. The advisory table was rebuilt empirically (running pip-audit with no ignores against the regenerated requirements), not by editing the old list — the new torch/transformers versions fall outside the affected ranges of many no-fix advisories, so they drop too.
CPU-only unit suite. transformers 5 resolves the CUDA device during config construction, which crashes on any host whose driver can't init the installed torch build. Defaulting CUDA_VISIBLE_DEVICES="" makes the suite deterministic anywhere; it's overridable for GPU-marked tests, and the one real-GPU test is already excluded in CI.

Test Plan

pre-commit run --all-files and pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py.
pip-audit against the three generated requirements files, exactly as CI runs it.
End-to-end on the rebuilt server/worker images: one workflow per touched dependency (vLLM inference, diffusers, transformers Trainer/TRL training, transformers CPU inference, fastembed RAG, the vLLM-Omni task types, and a 2-GPU DeepSpeed ZeRO-2 SFT run) to confirm the new wheels actually load and run.

Test Result

pre-commit, pytest tests/, and all three pip-audit scans pass.
Every end-to-end workflow reaches DONE — the cu129 wheels load and run across vLLM, transformers-5 training/inference, diffusers 0.38, fastembed, all four vLLM-Omni task types, and DeepSpeed 0.19 multi-GPU training.

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable).
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker. (No schema/proto changes.)
If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen). (No SDK/CLI code changes; dependency floors only.)
If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
I have updated documentation or config examples if user-facing behavior changed.

timzsu

Some comments. PTAL.

vllm and vllm-omni 0.22 require transformers >=5, which cascades through the GPU/inference stack: transformers 5.8.1, peft 0.19.1 (0.17 imports the removed HybridCache and unregisters every training executor), diffusers 0.38, pillow 12.2, torch 2.11, safetensors 0.8, and fastembed 0.8 (lifts the pillow<12 cap). Re-lock and regenerate the worker requirements. vLLM's PyPI wheel for 0.22 is built for CUDA 13; the GPU worker runs CUDA 12.9. Pin the +cu129 release wheel for linux/x86_64 via [tool.uv.sources] so it matches torch and flashinfer, with a PyPI fallback for other platforms. The bump clears most ignored pip-audit advisories (vllm, gradio, pillow, diffusers, transformers, starlette no longer fire at the new versions); prune them from security.yml and CODE_STYLE.md, leaving torch, lxml, and diskcache. The cu129 wheel is not on PyPI, so pip-audit skips it like flashinfer-jit-cache. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

… 0.38 transformers 5, trl 0.23, and diffusers 0.38 changed APIs the worker executors relied on: trl's PPOConfig dropped save_safetensors, diffusers made encode_prompt's do_classifier_free_guidance required for SD1.x/2.x/XL, transformers types tokenizer.decode() as str | list[str], and the bf16 GPU probe can now raise when CUDA can't initialize. Adapt the executors accordingly and cover the new fp16 fallback. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

transformers 5 eagerly initializes the CUDA device when a TrainingArguments-derived config is constructed, so config-mapping unit tests crash on a host whose driver can't init the installed torch build. Default CUDA_VISIBLE_DEVICES to empty; set it explicitly to run GPU tests. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

transformers 5 types tokenizer.decode() as str | list[str]. The single-sequence calls always return str, so assert the type to verify the invariant at runtime instead of casting it away unchecked. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

is_bf16_supported() only raises when CUDA can't initialize, in which case the device is unusable and fp16 buys nothing — the 4-bit load fails moments later regardless. Catching it masked a fatal misconfiguration that a GPU worker should surface so the task retries on a healthy worker. The genuine no-bf16 case returns False and already falls through to fp16. Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

vllm 0.22.0 hard-pins flashinfer-python==0.6.11.post2, so a >= floor there is misleading — match it exactly, as the group already does for vllm and vllm-omni. Bump the deepspeed floor to 0.19.1 (the latest, resolves cleanly). Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>

timzsu

LGTM.

kaiitunnz marked this pull request as ready for review June 12, 2026 12:29

kaiitunnz requested a review from timzsu as a code owner June 12, 2026 12:29

timzsu requested changes Jun 12, 2026

View reviewed changes

kaiitunnz force-pushed the kaiitunnz/chore/bump-vllm-0.22 branch 2 times, most recently from 38bb1e9 to a76abbd Compare June 13, 2026 09:58

kaiitunnz added 6 commits June 13, 2026 18:16

kaiitunnz force-pushed the kaiitunnz/chore/bump-vllm-0.22 branch from a76abbd to 070d1e5 Compare June 13, 2026 10:16

kaiitunnz requested a review from timzsu June 13, 2026 11:13

timzsu approved these changes Jun 13, 2026

View reviewed changes

kaiitunnz merged commit 3a02e55 into main Jun 13, 2026
12 of 13 checks passed

kaiitunnz deleted the kaiitunnz/chore/bump-vllm-0.22 branch June 13, 2026 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: bump vLLM/vLLM-Omni to 0.22.0 and adapt the worker stack#70

chore: bump vLLM/vLLM-Omni to 0.22.0 and adapt the worker stack#70
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/chore/bump-vllm-0.22

kaiitunnz commented Jun 12, 2026 •

edited

Loading

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaiitunnz commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Test Plan

Test Result

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timzsu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaiitunnz commented Jun 12, 2026 •

edited

Loading