Skip to content

chore: bump vLLM/vLLM-Omni to 0.22.0 and adapt the worker stack#70

Merged
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/chore/bump-vllm-0.22
Jun 13, 2026
Merged

chore: bump vLLM/vLLM-Omni to 0.22.0 and adapt the worker stack#70
kaiitunnz merged 6 commits into
mainfrom
kaiitunnz/chore/bump-vllm-0.22

Conversation

@kaiitunnz

@kaiitunnz kaiitunnz commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Bumps vllm / vllm-omni from 0.18.0 to 0.22.0 to clear the bulk of the ignored pip-audit advisories. 0.22 requires transformers>=5, which cascades through the whole GPU/inference stack — transformers 4.57→5.8.1, peft 0.17→0.19.1, diffusers 0.36→0.38, torch 2.10→2.11, pillow 11.3→12.2, safetensors 0.6→0.8, fastembed 0.7→0.8, deepspeed 0.18→0.19 — so this PR also adapts the worker executors to the changed APIs and pins vLLM to its CUDA-12.9 wheel (the PyPI default is built for CUDA 13, which the GPU workers can't run).

Changes

  • Dependency bumppyproject.toml, uv.lock, src/worker/requirements/requirements.txt, src/worker/requirements/requirements.gpu.txt: raise the floors and re-lock / regenerate. fastembed is moved past its <0.8 cap (it, not gradio, was the real pillow<12 capper); peft>=0.18 is required (0.17 imports the removed transformers.HybridCache and silently unregisters every training executor); deepspeed moves to >=0.19.1; and flashinfer-python is pinned to ==0.6.11.post2 to match vLLM 0.22's exact requirement (the GPU group already pins vllm/vllm-omni with == for the same ABI-locking reason).
  • vLLM CUDA-12 pinpyproject.toml [tool.uv.sources]: pin vllm to its +cu129 release wheel for linux/x86_64 (PyPI fallback elsewhere) so it matches torch (UV_TORCH_BACKEND=cu129), flashinfer, and the CUDA 12.9 base image. The PyPI default links libcudart.so.13.
  • pip-audit ignore prune.github/workflows/security.yml, docs/CODE_STYLE.md: the bump clears most ignored advisories (vllm, gradio, pillow, diffusers, transformers, starlette no longer fire at the new versions). The ignore list collapses to what still fires: torch GHSA-rrmf-rvhw-rf47, lxml PYSEC-2026-87 (crawl4ai caps lxml<6), and diskcache GHSA-w8v5-vhqr-4h9v. The +cu129 wheel is unauditable on PyPI, so the GPU run skips it (documented, like flashinfer-jit-cache).
  • Worker executor adaptationssrc/worker/executors/ppo_executor.py (drop the removed PPOConfig.save_safetensors; map fp16/bf16 like SFTConfig), diffusers_executor.py (pass do_classifier_free_guidance to encode_prompt when the signature accepts it — required for SD1.x/2.x/XL in diffusers 0.38), transformers_executor.py + ppo_executor.py (assert the single-sequence tokenizer.decode() result is str, now typed str | list[str]), sft_executor.py (del heavy locals instead of = None).
  • Teststests/conftest.py: default the unit suite to CPU-only (transformers 5 eagerly inits the CUDA device when a TrainingArguments-derived config is constructed).

Design

  • cu129 wheel pin over a driver upgrade. vLLM 0.22's PyPI wheel is CUDA 13; the GPU fleet is CUDA 12.9 / driver 560. vLLM publishes a +cu129 release wheel, so pinning it is a self-contained build-config change that keeps the whole image coherent on CUDA 12 — versus a fleet-wide driver upgrade to ≥580 (with reboots and co-tenant disruption) that would also cascade torch/flashinfer to CUDA 13. Same vLLM version either way, so the CVE wins are preserved.
  • Ignore list pruned to what actually fires. The advisory table was rebuilt empirically (running pip-audit with no ignores against the regenerated requirements), not by editing the old list — the new torch/transformers versions fall outside the affected ranges of many no-fix advisories, so they drop too.
  • CPU-only unit suite. transformers 5 resolves the CUDA device during config construction, which crashes on any host whose driver can't init the installed torch build. Defaulting CUDA_VISIBLE_DEVICES="" makes the suite deterministic anywhere; it's overridable for GPU-marked tests, and the one real-GPU test is already excluded in CI.

Test Plan

  • pre-commit run --all-files and pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py.
  • pip-audit against the three generated requirements files, exactly as CI runs it.
  • End-to-end on the rebuilt server/worker images: one workflow per touched dependency (vLLM inference, diffusers, transformers Trainer/TRL training, transformers CPU inference, fastembed RAG, the vLLM-Omni task types, and a 2-GPU DeepSpeed ZeRO-2 SFT run) to confirm the new wheels actually load and run.

Test Result

  • pre-commit, pytest tests/, and all three pip-audit scans pass.
  • Every end-to-end workflow reaches DONE — the cu129 wheels load and run across vLLM, transformers-5 training/inference, diffusers 0.38, fastembed, all four vLLM-Omni task types, and DeepSpeed 0.19 multi-GPU training.

Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker. (No schema/proto changes.)
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen). (No SDK/CLI code changes; dependency floors only.)
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

@kaiitunnz kaiitunnz marked this pull request as ready for review June 12, 2026 12:29
@kaiitunnz kaiitunnz requested a review from timzsu as a code owner June 12, 2026 12:29

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. PTAL.

Comment thread src/worker/executors/utils/huggingface.py Outdated
Comment thread pyproject.toml
Comment thread src/worker/executors/ppo_executor.py Outdated
Comment thread src/worker/executors/transformers_executor.py Outdated
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml
Comment thread pyproject.toml
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/chore/bump-vllm-0.22 branch 2 times, most recently from 38bb1e9 to a76abbd Compare June 13, 2026 09:58
vllm and vllm-omni 0.22 require transformers >=5, which cascades through
the GPU/inference stack: transformers 5.8.1, peft 0.19.1 (0.17 imports the
removed HybridCache and unregisters every training executor), diffusers
0.38, pillow 12.2, torch 2.11, safetensors 0.8, and fastembed 0.8 (lifts
the pillow<12 cap). Re-lock and regenerate the worker requirements.

vLLM's PyPI wheel for 0.22 is built for CUDA 13; the GPU worker runs CUDA
12.9. Pin the +cu129 release wheel for linux/x86_64 via [tool.uv.sources]
so it matches torch and flashinfer, with a PyPI fallback for other
platforms.

The bump clears most ignored pip-audit advisories (vllm, gradio, pillow,
diffusers, transformers, starlette no longer fire at the new versions);
prune them from security.yml and CODE_STYLE.md, leaving torch, lxml, and
diskcache. The cu129 wheel is not on PyPI, so pip-audit skips it like
flashinfer-jit-cache.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
… 0.38

transformers 5, trl 0.23, and diffusers 0.38 changed APIs the worker
executors relied on: trl's PPOConfig dropped save_safetensors, diffusers
made encode_prompt's do_classifier_free_guidance required for SD1.x/2.x/XL,
transformers types tokenizer.decode() as str | list[str], and the bf16 GPU
probe can now raise when CUDA can't initialize. Adapt the executors
accordingly and cover the new fp16 fallback.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
transformers 5 eagerly initializes the CUDA device when a
TrainingArguments-derived config is constructed, so config-mapping unit
tests crash on a host whose driver can't init the installed torch build.
Default CUDA_VISIBLE_DEVICES to empty; set it explicitly to run GPU tests.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
transformers 5 types tokenizer.decode() as str | list[str]. The
single-sequence calls always return str, so assert the type to verify the
invariant at runtime instead of casting it away unchecked.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
is_bf16_supported() only raises when CUDA can't initialize, in which case
the device is unusable and fp16 buys nothing — the 4-bit load fails moments
later regardless. Catching it masked a fatal misconfiguration that a GPU
worker should surface so the task retries on a healthy worker. The genuine
no-bf16 case returns False and already falls through to fp16.

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
vllm 0.22.0 hard-pins flashinfer-python==0.6.11.post2, so a >= floor there
is misleading — match it exactly, as the group already does for vllm and
vllm-omni. Bump the deepspeed floor to 0.19.1 (the latest, resolves cleanly).

Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
@kaiitunnz kaiitunnz force-pushed the kaiitunnz/chore/bump-vllm-0.22 branch from a76abbd to 070d1e5 Compare June 13, 2026 10:16
@kaiitunnz kaiitunnz requested a review from timzsu June 13, 2026 11:13

@timzsu timzsu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kaiitunnz kaiitunnz merged commit 3a02e55 into main Jun 13, 2026
12 of 13 checks passed
@kaiitunnz kaiitunnz deleted the kaiitunnz/chore/bump-vllm-0.22 branch June 13, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants