You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running LLaMa 3.1 70B on 2 A100 GPUs with 80GB of RAM each.
From the logs we see that warmup phase succeeded finding the right max_batch_total_tokens and that TGI auto sets max_batch_prefill_tokens. See below:
Default `max_batch_prefill_tokens` to 10049
...
Warming up model
Using optimized Triton indexing kernels.
Setting max batch total tokens to 30816
Expected behavior
I expect to see no Out Of Memory errors at all if warmup has succeeded.
If TGI warmup succeeds then it means the maximum token budgets are adapted to the hardware available preventing TGI from reaching OOM errors during inference. Requests would come in and be batched for prefill/decode stages respecting those budgets and an OOM should NEVER occur. However, we are seeing how OOM errors occur EVEN when respecting the token budgets set by TGI.
Is there a fundamental problem/bug with TGI's token budget computation or maybe with the way memory is allocated/de-alloacated?
Grafana Dashboard
As you see in Grafana the process dies around 8:25 AM, but the max tokens per batch are within the limits defined by TGI
Error Logs
Request failed during generation: Server error: CUDA out of memory. Tried to allocate 2.10 GiB. GPU 0 has a total capacity of 79.15 GiB of which 2.08 GiB is free. Process 3895904 has 77.06 GiB memory in use. Of the allocated memory 74.87 GiB is allocated by PyTorch, with 24.33 MiB allocated in private pools (e.g., CUDA Graphs), and 326.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 303, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 171, in Prefill
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 2046, in generate_token
torch.log_softmax(out, -1, out=out)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.10 GiB. GPU 1 has a total capacity of 79.15 GiB of which 2.08 GiB is free. Process 3895903 has 77.06 GiB memory in use. Of the allocated memory 74.87 GiB is allocated by PyTorch, with 24.33 MiB allocated in private pools (e.g., CUDA Graphs), and 326.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
We are running LLaMa 3.1 70B on 2 A100 GPUs with 80GB of RAM each.
From the logs we see that warmup phase succeeded finding the right
max_batch_total_tokens
and that TGI auto setsmax_batch_prefill_tokens
. See below:Expected behavior
I expect to see no Out Of Memory errors at all if warmup has succeeded.
If TGI warmup succeeds then it means the maximum token budgets are adapted to the hardware available preventing TGI from reaching OOM errors during inference. Requests would come in and be batched for prefill/decode stages respecting those budgets and an OOM should NEVER occur. However, we are seeing how OOM errors occur EVEN when respecting the token budgets set by TGI.
Is there a fundamental problem/bug with TGI's token budget computation or maybe with the way memory is allocated/de-alloacated?
Grafana Dashboard
As you see in Grafana the process dies around 8:25 AM, but the max tokens per batch are within the limits defined by TGI
Error Logs
Environment:
The text was updated successfully, but these errors were encountered: