why not relinquish some of the memory after preprocessing step? #9796

bertsons · 2024-10-08T22:29:37Z

bertsons
Oct 8, 2024

For the preprocessing step (i.e., processing the tokens in the context), we can use batch processing. This requires a higher memory footprint than the inference step (i.e., token generation step) because of more memory needed for calculating attention scores (theoretical peak mem is roughly n_ctx * n_ubatch * n_heads * attn_byte_precision), and similarly with the MLP layers we can also input token embeddings in batches (theoretical peak mem is roughly n_ff * n_ubatch * model_byte_precision) (- peak memory requirement will of course be whichever of the previous 2 formulas are higher, on top of the model params & KV Cache requirements). However, we don't need to keep all that memory during the inference step because "batching" is no longer applicable; we only calculate attention vectors rather than matrices (n_ctx * n_heads * attn_byte_precision), and only put a single input's embeddings through the MLP layers (n_ff * model_byte_precision).

From my own experimentation, I can see that this memory reserved for the pre-processing step's large attention score matrices (well, it's large if context length and ubatch are big enough) isn't deallocated at all before the inference step. Is there a reason for this?

ggerganov · 2024-10-09T11:31:49Z

ggerganov
Oct 9, 2024
Maintainer

We generally try to pre-allocate the necessary amount of memory for the computation at the start, instead of doing it dynamically at runtime. This has the advantage that the program will not run out of memory in the middle of the computation.

The llama_context does not differentiate between preprocessing and inference steps. It is created to support a certain maximum batch size and will allocate the necessary memory for that requirement. In theory, we can extend the interface to allow to create a secondary llama_context for batch size = 1 and have the app destroy the first llama_context when it is no longer needed. But this would require changes in the KV cache management in order to be able to share the cache across different llama_context.

4 replies

slaren Oct 9, 2024
Maintainer

We could add a function to change the batch size of an existing context too. I am just not sure that it would be very useful in practice.

ggerganov Oct 9, 2024
Maintainer

Yes, IMO not at all useful. What mostly matters in most applications is peak memory usage and this will not help reduce it.

bertsons Oct 9, 2024
Author

We could add a function to change the batch size of an existing context too. I am just not sure that it would be very useful in practice.

If you are using an LLM in a game, then perhaps it'd be desirable to relinquish memory back for game graphics asap, rather than waiting for the generation of all tokens to finish?

slaren Oct 9, 2024
Maintainer

The problem is that the compute buffer reallocation may fail, which would leave the llama_context unusable. I am not sure that a game wants to deal with that possibility when it could just use a very small batch size from the beginning.

In any case, it would be a fairly simple change to add a function to do this, so I wouldn't be opposed to adding that. I think the best way would be to add a function to change the llama_context_params of a llama_context, which would allow changing some of the parameters, and return an error when trying to change a parameter that cannot be changed. It would require adding functions to ggml_backend_sched and ggml_gallocr to free the compute buffer, then calling ggml_backend_sched_reserve with the new batch size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

why not relinquish some of the memory after preprocessing step? #9796

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

why not relinquish some of the memory after preprocessing step? #9796

Uh oh!

Uh oh!

bertsons Oct 8, 2024

Replies: 1 comment · 4 replies

Uh oh!

ggerganov Oct 9, 2024 Maintainer

Uh oh!

slaren Oct 9, 2024 Maintainer

Uh oh!

ggerganov Oct 9, 2024 Maintainer

Uh oh!

bertsons Oct 9, 2024 Author

Uh oh!

slaren Oct 9, 2024 Maintainer

bertsons
Oct 8, 2024

Replies: 1 comment 4 replies

ggerganov
Oct 9, 2024
Maintainer

slaren Oct 9, 2024
Maintainer

ggerganov Oct 9, 2024
Maintainer

bertsons Oct 9, 2024
Author

slaren Oct 9, 2024
Maintainer