Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OutOfMemory even after warmup phase succeeded #2744

Open
martinigoyanes opened this issue Nov 13, 2024 · 0 comments
Open

CUDA OutOfMemory even after warmup phase succeeded #2744

martinigoyanes opened this issue Nov 13, 2024 · 0 comments

Comments

@martinigoyanes
Copy link
Contributor

We are running LLaMa 3.1 70B on 2 A100 GPUs with 80GB of RAM each.
From the logs we see that warmup phase succeeded finding the right max_batch_total_tokens and that TGI auto sets max_batch_prefill_tokens. See below:

Default `max_batch_prefill_tokens` to 10049
...
Warming up model
Using optimized Triton indexing kernels.
Setting max batch total tokens to 30816

Expected behavior

I expect to see no Out Of Memory errors at all if warmup has succeeded.

If TGI warmup succeeds then it means the maximum token budgets are adapted to the hardware available preventing TGI from reaching OOM errors during inference. Requests would come in and be batched for prefill/decode stages respecting those budgets and an OOM should NEVER occur. However, we are seeing how OOM errors occur EVEN when respecting the token budgets set by TGI.

Is there a fundamental problem/bug with TGI's token budget computation or maybe with the way memory is allocated/de-alloacated?

Grafana Dashboard

As you see in Grafana the process dies around 8:25 AM, but the max tokens per batch are within the limits defined by TGI

Screenshot 2024-11-13 at 19 54 24 Screenshot 2024-11-13 at 19 54 33

Error Logs

Request failed during generation: Server error: CUDA out of memory. Tried to allocate 2.10 GiB. GPU 0 has a total capacity of 79.15 GiB of which 2.08 GiB is free. Process 3895904 has 77.06 GiB memory in use. Of the allocated memory 74.87 GiB is allocated by PyTorch, with 24.33 MiB allocated in private pools (e.g., CUDA Graphs), and 326.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)	
Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve
    server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 303, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 171, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 2046, in generate_token
    torch.log_softmax(out, -1, out=out)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.10 GiB. GPU 1 has a total capacity of 79.15 GiB of which 2.08 GiB is free. Process 3895903 has 77.06 GiB memory in use. Of the allocated memory 74.87 GiB is allocated by PyTorch, with 24.33 MiB allocated in private pools (e.g., CUDA Graphs), and 326.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Environment:

{
    model_id: "/model_data/llama3.1-70b-dev",
    revision: None,
    validation_workers: 3,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    kv_cache_dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(
        7999,
    ),
    max_input_length: None,
    max_total_tokens: Some(
        8000,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "0.0.0.0",
    port: 8080,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: None,
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 0.85,
    rope_scaling: None,
    rope_factor: None,
    json_output: true,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    api_key: None,
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
    usage_stats: On,
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant