You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I serve llama3.1-70B quantized w4a16, with the following parameters:
--max-model-len: 127728
--enable-prefix-caching: True
--enable-chunked-prefill: False
--kv-cache-dtype: fp8_e4m3
VLLM_ATTENTION_BACKEND: FLASHINFER
I have the following error:
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
self.engine = LLMEngine(*args,
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__
self._initialize_kv_caches()
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache
self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 258, in initialize_cache
raise_if_cache_size_invalid(num_gpu_blocks,
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model's max seq len (127728) is larger than the maximum number of tokens that can be stored in KV cache (4800). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
If I avoid using FLASHINFER, this are the logs (and the engine fails with the same above error):
INFO 11-14 09:06:52 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 11-14 09:06:52 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable VLLM_ATTENTION_BACKEND=FLASHINFER
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
🐛 Describe the bug
When I serve llama3.1-70B quantized w4a16, with the following parameters:
I have the following error:
If I avoid using FLASHINFER, this are the logs (and the engine fails with the same above error):
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: