Memory consumption #15372

pirj · 2025-08-17T13:22:19Z

pirj
Aug 17, 2025

My machine is a MacBook Air M3 24 GB.

I've noticed that the same models that other tools can run, fail with out-of-memory errors with llama.cpp.

Qwen3:30b-a3b Q4_K_M: runs just fine with Ollama, but went out of memory with llama.cpp. Ollama's model fetched from their registry, llama.cpp's - from HF.
Is this a fair comparison?
EXAONE-4.0-32B-4bit Q4_K_M failed at first with mlx-lm, but after their hint how to increase the memory limit:

sudo sysctl iogpu.wired_limit_mb=20480

mlx_lm.chat --model mlx-community/EXAONE-4.0-32B-4bit --max-tokens=2048 worked just fine (5 t/s though).
llama.cpp's model from HF, MLX-lm's too, but from mlx-community.
Running

$ llama-cli -hf LGAI-EXAONE/EXAONE-4.0-32B-GGUF:Q4_K_M

with 20G wired_limit_mb worked:

llama_model_load_from_file_impl: using device Metal (Apple M3) - 20479 MiB free
...
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.01 GiB (4.83 BPW)
...
load_tensors: offloading 64 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   281.25 MiB
load_tensors: Metal_Mapped model buffer size = 18443.68 MiB

With the default 16384 wired_limit_mb, it failed:

llama_model_load_from_file_impl: using device Metal (Apple M3) - 16383 MiB free
...
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
...
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

which is on par with mlx-lm. I can't compare to Ollama, as this model's support only landed in a pre-release version.

I wouldn't recommend stretching the iogpu.wired_limit_mb setting, as once (when I forgot to quit mlx-lm before running llama.cpp), the whole laptop hang for half a minute, and after a minute of inference, rebooted.
Separately, both mlx-lm an llama.cpp seem to be quite stable with the 20480 setting, but it's just a sample of one, and not a big stretch either.

Answered by ggerganov

Aug 18, 2025

With Metal, you always want to add -fa.

Increasing the memory limit is needed for such big models because they cannot fit in the default memory limit - not sure you can do anything else here.

View full answer

abc-nix · 2025-08-17T14:36:31Z

abc-nix
Aug 17, 2025

I have seen on non-Apple iGPUs that memory use is much higher if not using --no-mmap when offloading some/all layers to GPU. Try to use this option and see how it goes.

Someone who uses Apple's devices should be able to tell if this is necessary or not.

0 replies

ggerganov · 2025-08-18T15:25:37Z

ggerganov
Aug 18, 2025
Maintainer

With Metal, you always want to add -fa.

Increasing the memory limit is needed for such big models because they cannot fit in the default memory limit - not sure you can do anything else here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory consumption #15372

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Memory consumption #15372

Uh oh!

pirj Aug 17, 2025

Replies: 2 comments

Uh oh!

abc-nix Aug 17, 2025

Uh oh!

ggerganov Aug 18, 2025 Maintainer

pirj
Aug 17, 2025

abc-nix
Aug 17, 2025

ggerganov
Aug 18, 2025
Maintainer