Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model #10337

Open
1 task done
amakaido28 opened this issue Nov 14, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@amakaido28
Copy link

amakaido28 commented Nov 14, 2024

🐛 Describe the bug

When I serve llama3.1-70B quantized w4a16, with the following parameters:

  1. --max-model-len: 127728
  2. --enable-prefix-caching: True
  3. --enable-chunked-prefill: False
  4. --kv-cache-dtype: fp8_e4m3
  5. VLLM_ATTENTION_BACKEND: FLASHINFER

I have the following error:

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 339, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
    self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache
    self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 258, in initialize_cache
    raise_if_cache_size_invalid(num_gpu_blocks,
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
    raise ValueError(
ValueError: The model's max seq len (127728) is larger than the maximum number of tokens that can be stored in KV cache (4800). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

If I avoid using FLASHINFER, this are the logs (and the engine fails with the same above error):

INFO 11-14 09:06:52 selector.py:227] Cannot use FlashAttention-2 backend for FP8 KV cache.
WARNING 11-14 09:06:52 selector.py:229] Please use FlashInfer backend with FP8 KV Cache for better performance by setting environment variable  VLLM_ATTENTION_BACKEND=FLASHINFER

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@amakaido28 amakaido28 added the bug Something isn't working label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant