You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm analyzing the total RAM usage = RAM + VRAM using different combination of -ngl and --no-warmup
-ngl = 0: Total RAM usage after warm-up = ~10GB + ~0GB VRAM = ~10GB total RAM usage. -ngl = 0 and --no-warmup: Total RAM usage = ~10GB + ~0GB VRAM = ~10GB total RAM usage. -ngl = 128: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage. -ngl = 19: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage. -ngl = 18: Total RAM usage after warm-up = ~6GB + ~4GB VRAM = ~10GB total RAM usage. -ngl = 128 and --no-warmup: Total RAM usage before warming up = ~6GB + ~4GB VRAM = ~10GB total RAM usage and after warming the result is again ~10GB + ~4GB VRAM = ~14GB total RAM usage.
NB: After -ngl = 14 is the max number of layers that fit entirely into 4GB VRAM. Everything above 14 need UMA.
Why in some cases (when trying to allocate significantly more VRAM than actually available) the total RAM usage is 14GB, while the rest of the cases it is 10GB. Shouldn't it always be ~constant?
Why there is a sudden spike (from -ngl 18 to -ngl 19) of RAM usage (from 6 to 10GB of RAM) after the warming up. Having in mind that the logs show only few hundreds of MB difference. Therefore I mean this one layer is not so extremely large compared to other layers. Also from -ngl 19 to -n 128 (this model has 41 layers) the total RAM usage is always 14GB.
Any ideas what I'm doing wrong?
Update: Few observations from today's testing.
LLaMaCPP allocate 2 main buffers on VRAM - buffer for model's layers and buffer for KV cache.
This increase of 4GB of total RAM usage occurs when the size of these 2 buffers cannot fit the free VRAM. I'm still looking for why this happens.
The inference performance drop significantly (~10x) when buffer for model's layers does not fit the free VRAM. It is totally fine (in terms of performance) if buffer for KV cache is allocated on RAM using unified memory.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My laptop is with 7700HQ CPU + nVidia GTX 1050 GPU. 32GB main RAM and 4GB VRAM.
I'm loading Mistral-Nemo-Instruct-2407-Q4_0 model using the following params.
I'm analyzing the total RAM usage = RAM + VRAM using different combination of -ngl and --no-warmup
-ngl = 0: Total RAM usage after warm-up = ~10GB + ~0GB VRAM = ~10GB total RAM usage.
-ngl = 0 and --no-warmup: Total RAM usage = ~10GB + ~0GB VRAM = ~10GB total RAM usage.
-ngl = 128: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage.
-ngl = 19: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage.
-ngl = 18: Total RAM usage after warm-up = ~6GB + ~4GB VRAM = ~10GB total RAM usage.
-ngl = 128 and --no-warmup: Total RAM usage before warming up = ~6GB + ~4GB VRAM = ~10GB total RAM usage and after warming the result is again ~10GB + ~4GB VRAM = ~14GB total RAM usage.
NB: After -ngl = 14 is the max number of layers that fit entirely into 4GB VRAM. Everything above 14 need UMA.
Why in some cases (when trying to allocate significantly more VRAM than actually available) the total RAM usage is 14GB, while the rest of the cases it is 10GB. Shouldn't it always be ~constant?
Why there is a sudden spike (from -ngl 18 to -ngl 19) of RAM usage (from 6 to 10GB of RAM) after the warming up. Having in mind that the logs show only few hundreds of MB difference. Therefore I mean this one layer is not so extremely large compared to other layers. Also from -ngl 19 to -n 128 (this model has 41 layers) the total RAM usage is always 14GB.
Any ideas what I'm doing wrong?
Update: Few observations from today's testing.
Beta Was this translation helpful? Give feedback.
All reactions