why not relinquish some of the memory after preprocessing step? #9796
Replies: 1 comment 4 replies
-
We generally try to pre-allocate the necessary amount of memory for the computation at the start, instead of doing it dynamically at runtime. This has the advantage that the program will not run out of memory in the middle of the computation. The |
Beta Was this translation helpful? Give feedback.
-
For the preprocessing step (i.e., processing the tokens in the context), we can use batch processing. This requires a higher memory footprint than the inference step (i.e., token generation step) because of more memory needed for calculating attention scores (theoretical peak mem is roughly
n_ctx * n_ubatch * n_heads * attn_byte_precision
), and similarly with the MLP layers we can also input token embeddings in batches (theoretical peak mem is roughlyn_ff * n_ubatch * model_byte_precision
) (- peak memory requirement will of course be whichever of the previous 2 formulas are higher, on top of the model params & KV Cache requirements). However, we don't need to keep all that memory during the inference step because "batching" is no longer applicable; we only calculate attention vectors rather than matrices (n_ctx * n_heads * attn_byte_precision
), and only put a single input's embeddings through the MLP layers (n_ff * model_byte_precision
).From my own experimentation, I can see that this memory reserved for the pre-processing step's large attention score matrices (well, it's large if context length and ubatch are big enough) isn't deallocated at all before the inference step. Is there a reason for this?
Beta Was this translation helpful? Give feedback.
All reactions