No effect of loading model in RAM? #14212

wouterverduin · 2025-06-16T10:29:51Z

wouterverduin
Jun 16, 2025

Hi all!

I have been bugging my head for quite sometime now. Im new to llama.cpp so hoping for some directions.

Currently i am experimenting with running models on CPU only on my server. The setup is proxmox with an ubuntu LXC running only llama.cpp. I have allocated my 8 CPU cores and 32gig memory (DDR3 ECC memory). The storage is on an M2SSD on proxmox.

The thing i noticed was that when running llama server it doesnt show memory usage in proxmox which was weird to me. After googling i tried the --mlock and --no-mmap options combined and separate and noticed that with the --no-mmap it does correctly load into RAM however using any of the two arguments doesnt change anything in inference speed.

Current models i tried: Medgemma 4B-IT, Deepseek R1 Qwen3 8B distill, Qwen3 8B and Qwen3 30B A3B.

Example command:
/root/llama.cpp/build/bin/llama-server -m /root/models/Qwen3-30B-A3B.gguf --host 0.0.0.0 --port 9090

Is there a speed increase to be expected for preloading the model in RAM? Are there ways to speed up inference speed on CPU i am not (yet) aware of? e.g. Qwen3 30B A3B is ~5.5tk/s.

0cc4m · 2025-06-16T14:07:35Z

0cc4m
Jun 16, 2025
Collaborator

This is expected. mmap also loads the model into RAM, but it shows up as cache memory instead, and will get evicted first if your RAM starts running out due to other processes. It basically uses the file on disk as a page file for itself. See https://en.wikipedia.org/wiki/Mmap

2 replies

wouterverduin Jun 17, 2025
Author

Thanks for the clarification! Im still wondering though: Why is there no inference speed difference between the use or absence of --mlock and --no-mmap? Is that also to be expected?

Thanks again in advance!

0cc4m Jun 17, 2025
Collaborator

I suggest you look into the meaning of those parameters. mmap shouldn't make a difference, as I said. mlock is rarely used, I think there are some specific setups where it can make a difference (on Mac?). Something about telling the OS to not evict the model from RAM. I've never used it or heard of it being used.

It's normal that all minor variations of the model being in RAM are about the same speed.

What makes a difference is the type of quant and of course the model size in RAM. Your CPU and memory speed, too. A GPU would also help out, even if it can't fit the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No effect of loading model in RAM? #14212

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

No effect of loading model in RAM? #14212

Uh oh!

wouterverduin Jun 16, 2025

Replies: 1 comment · 2 replies

Uh oh!

0cc4m Jun 16, 2025 Collaborator

Uh oh!

wouterverduin Jun 17, 2025 Author

Uh oh!

0cc4m Jun 17, 2025 Collaborator

wouterverduin
Jun 16, 2025

Replies: 1 comment 2 replies

0cc4m
Jun 16, 2025
Collaborator

wouterverduin Jun 17, 2025
Author

0cc4m Jun 17, 2025
Collaborator