No effect of loading model in RAM? #14212
Unanswered
wouterverduin
asked this question in
Q&A
Replies: 1 comment 2 replies
-
This is expected. mmap also loads the model into RAM, but it shows up as cache memory instead, and will get evicted first if your RAM starts running out due to other processes. It basically uses the file on disk as a page file for itself. See https://en.wikipedia.org/wiki/Mmap |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all!
I have been bugging my head for quite sometime now. Im new to llama.cpp so hoping for some directions.
Currently i am experimenting with running models on CPU only on my server. The setup is proxmox with an ubuntu LXC running only llama.cpp. I have allocated my 8 CPU cores and 32gig memory (DDR3 ECC memory). The storage is on an M2SSD on proxmox.
The thing i noticed was that when running llama server it doesnt show memory usage in proxmox which was weird to me. After googling i tried the --mlock and --no-mmap options combined and separate and noticed that with the --no-mmap it does correctly load into RAM however using any of the two arguments doesnt change anything in inference speed.
Current models i tried: Medgemma 4B-IT, Deepseek R1 Qwen3 8B distill, Qwen3 8B and Qwen3 30B A3B.
Example command:
/root/llama.cpp/build/bin/llama-server -m /root/models/Qwen3-30B-A3B.gguf --host 0.0.0.0 --port 9090
Is there a speed increase to be expected for preloading the model in RAM? Are there ways to speed up inference speed on CPU i am not (yet) aware of? e.g. Qwen3 30B A3B is ~5.5tk/s.
Beta Was this translation helpful? Give feedback.
All reactions