vLLM does not support per request user provided logits processors #1517

cpfiffer · 2025-03-27T17:58:06Z

Describe the issue as clearly as possible:

vLLM 0.8+ does not seem to work with outlines anymore, due to this very strange error. vLLM v1 should likely not have this error:

ValueError: vLLM V1 does not support per request user provided logits processors.

I suspect this is an upstream problem but wanted to flag it here in case anyone else is experiencing this issue. The current workaround is to downgrade to vllm==0.7.1.

Steps/code to reproduce the bug:

from outlines import models, generate
from pydantic import BaseModel

model = models.vllm("microsoft/Phi-3-mini-4k-instruct")

class Example(BaseModel):
    name: str
    description: str


prompt = "France: "
generator = generate.json(model, Example)
response = generator(prompt)

print(response)

Expected result:

An `Example` object

Error message:

(oss-debug) λ ~/dottxt/oss-debug/ python vllm_test.py
INFO 03-27 10:53:14 [__init__.py:239] Automatically detected platform cuda.
INFO 03-27 10:53:24 [config.py:585] This model supports multiple tasks: {'reward', 'classify', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 03-27 10:53:24 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 03-27 10:53:26 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=microsoft/Phi-3-mini-4k-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 03-27 10:53:26 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x763983533110>
INFO 03-27 10:53:27 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-27 10:53:27 [cuda.py:220] Using Flash Attention backend on V1 engine.
INFO 03-27 10:53:27 [gpu_model_runner.py:1174] Starting to load model microsoft/Phi-3-mini-4k-instruct...
WARNING 03-27 10:53:28 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 03-27 10:53:28 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.37it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.74it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.67it/s]

INFO 03-27 10:53:29 [loader.py:447] Loading weights took 1.25 seconds
INFO 03-27 10:53:29 [gpu_model_runner.py:1186] Model loading took 7.1184 GB and 2.009474 seconds
INFO 03-27 10:53:38 [backends.py:415] Using cache directory: /home/cameron/.cache/vllm/torch_compile_cache/dc009c0fc6/rank_0_0 for vLLM's torch.compile
INFO 03-27 10:53:38 [backends.py:425] Dynamo bytecode transform time: 8.92 s
INFO 03-27 10:53:41 [backends.py:132] Cache the graph of shape None for later use
INFO 03-27 10:54:06 [backends.py:144] Compiling a graph for general shape takes 27.13 s
INFO 03-27 10:54:15 [monitor.py:33] torch.compile takes 36.05 s in total
INFO 03-27 10:54:16 [kv_cache_utils.py:566] GPU KV cache size: 92,752 tokens
INFO 03-27 10:54:16 [kv_cache_utils.py:569] Maximum concurrency for 4,096 tokens per request: 22.64x
INFO 03-27 10:54:35 [gpu_model_runner.py:1534] Graph capturing finished in 19 secs, took 0.47 GiB
INFO 03-27 10:54:35 [core.py:151] init engine (profile, create kv cache, warmup model) took 65.58 seconds
Traceback (most recent call last):
  File "/home/cameron/dottxt/oss-debug/vllm_test.py", line 13, in <module>
    response = generator(prompt)
               ^^^^^^^^^^^^^^^^^
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/outlines/generate/api.py", line 504, in __call__
    completions = self.model.generate(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/outlines/models/vllm.py", line 130, in generate
    results = self.model.generate(
              ^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/utils.py", line 1072, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 457, in generate
    self._validate_and_add_requests(
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1308, in _validate_and_add_requests
    self._add_request(
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1326, in _add_request
    self.llm_engine.add_request(
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 184, in add_request
    request = self.processor.process_inputs(request_id, prompt, params,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 183, in process_inputs
    self._validate_params(params)
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 114, in _validate_params
    self._validate_supported_sampling_params(params)
  File "/home/cameron/dottxt/oss-debug/.venv/lib/python3.12/site-packages/vllm/v1/engine/processor.py", line 97, in _validate_supported_sampling_params
    raise ValueError("vLLM V1 does not support per request "
ValueError: vLLM V1 does not support per request user provided logits processors.

Outlines/Python version information:

Version information

``` (oss-debug) λ ~/dottxt/oss-debug/ python -c "from outlines import _version; print(_version.version)"; python -c "import sys; print('Python', sys.version)"; uv pip freeze; 0.1.11 Python 3.12.8 (main, Jan 14 2025, 22:49:14) [Clang 19.1.6 ] aiohappyeyeballs==2.6.1 aiohttp==3.11.14 aiosignal==1.3.2 airportsdata==20250224 annotated-types==0.7.0 anyio==4.9.0 astor==0.8.1 attrs==25.3.0 blake3==1.0.4 cachetools==5.5.2 certifi==2025.1.31 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 compressed-tensors==0.9.2 cupy-cuda12x==13.4.1 depyf==0.18.0 dill==0.3.9 diskcache==5.6.3 distro==1.9.0 dnspython==2.7.0 einops==0.8.1 email-validator==2.2.0 fastapi==0.115.12 fastapi-cli==0.0.7 fastrlock==0.8.3 filelock==3.18.0 frozenlist==1.5.0 fsspec==2025.3.0 gguf==0.10.0 h11==0.14.0 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.29.3 idna==3.10 importlib-metadata==8.6.1 interegular==0.3.3 jinja2==3.1.6 jiter==0.9.0 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lark==1.2.2 llguidance==0.7.10 llvmlite==0.43.0 lm-format-enforcer==0.10.11 markdown-it-py==3.0.0 markupsafe==3.0.2 mdurl==0.1.2 mistral-common==1.5.4 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.2.0 nest-asyncio==1.6.0 networkx==3.4.2 ninja==1.11.1.4 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 openai==1.68.2 opencv-python-headless==4.11.0.86 outlines==0.1.11 outlines-core==0.1.26 packaging==24.2 partial-json-parser==0.2.1.1.post5 pillow==11.1.0 prometheus-client==0.21.1 prometheus-fastapi-instrumentator==7.1.0 propcache==0.3.1 protobuf==6.30.2 psutil==7.0.0 py-cpuinfo==9.0.0 pycountry==24.6.1 pydantic==2.10.6 pydantic-core==2.27.2 pygments==2.19.1 python-dotenv==1.1.0 python-json-logger==3.3.0 python-multipart==0.0.20 pyyaml==6.0.2 pyzmq==26.3.0 ray==2.44.1 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rich-toolkit==0.14.0 rpds-py==0.24.0 safetensors==0.5.3 scipy==1.15.2 sentencepiece==0.2.0 setuptools==78.1.0 shellingham==1.5.4 six==1.17.0 sniffio==1.3.1 starlette==0.46.1 sympy==1.13.1 tiktoken==0.9.0 tokenizers==0.21.1 torch==2.6.0 torchaudio==2.6.0 torchvision==0.21.0 tqdm==4.67.1 transformers==4.50.2 triton==3.2.0 typer==0.15.2 typing-extensions==4.13.0 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 vllm==0.8.2 watchfiles==1.0.4 websockets==15.0.1 xformers==0.0.29.post2 xgrammar==0.1.16 yarl==1.18.3 zipp==3.21.0 ```

Context for the issue:

No response

The text was updated successfully, but these errors were encountered:

cpfiffer · 2025-03-27T18:02:55Z

Upstream issue: vllm-project/vllm#15636

limaolin2017 · 2025-04-07T13:44:16Z

+1

cpfiffer added the bug label Mar 27, 2025

cpfiffer mentioned this issue Mar 27, 2025

[Bug]: Outlines broken on vLLM 0.8+ vllm-project/vllm#15636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM does not support per request user provided logits processors #1517

vLLM does not support per request user provided logits processors #1517

cpfiffer commented Mar 27, 2025

cpfiffer commented Mar 27, 2025

limaolin2017 commented Apr 7, 2025

vLLM does not support per request user provided logits processors #1517

vLLM does not support per request user provided logits processors #1517

Comments

cpfiffer commented Mar 27, 2025

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

cpfiffer commented Mar 27, 2025

limaolin2017 commented Apr 7, 2025