[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model #10337

amakaido28 · 2024-11-14T17:00:38Z

🐛 Describe the bug

When I serve llama3.1-70B quantized w4a16, with the following parameters:

--max-model-len: 127728
--enable-prefix-caching: True
--enable-chunked-prefill: False
--kv-cache-dtype: fp8_e4m3
VLLM_ATTENTION_BACKEND: FLASHINFER

I have the following error:

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache
    self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 258, in initialize_cache
    raise_if_cache_size_invalid(num_gpu_blocks,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
    raise ValueError(
ValueError: The model's max seq len (127728) is larger than the maximum number of tokens that can be stored in KV cache (4800). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

If I avoid using FLASHINFER, this are the logs (and the engine fails with the same above error):

INFO 11-14 09:06:52 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 11-14 09:06:52 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

amakaido28 added the bug Something isn't working label Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model #10337

[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model #10337

amakaido28 commented Nov 14, 2024 •

edited

Loading

[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model #10337

[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model #10337

Comments

amakaido28 commented Nov 14, 2024 • edited Loading

🐛 Describe the bug

Before submitting a new issue...

amakaido28 commented Nov 14, 2024 •

edited

Loading