Distributed inference #15796

alejandrods · 2025-09-04T14:54:28Z

alejandrods
Sep 4, 2025

I have 3 Macs with 512GB each (unified memory). I have been testing the distributed inference of llama.cpp with Qwen3-30B-A3B-Instruct-2507 (just for testing) and it works great but I have a few questions:

For smaller models that fit entirely on 1 machine (for example Qwen3-30B-A3B-Instruct-2507), is it better to also use distributed inference with the rpc-servers? Or would it be faster to run the entire model on 1 machine?
Is the host machine running the llama-server also used for inference? or only for tokenization?
How can I check which machine is being used when I configure the distributed inference?
Are there specific model sizes where distributed inference becomes more beneficial than single-machine inference? I guess if the model does not fit entirely in one machine, right?
Are there arguments that maybe I am missing that may help me to increase the tokens/second?

./llama.cpp/llama-server -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_XL --host 0.0.0.0 --port 11300 --jinja -ngl 99 --threads -1 --ctx-size 32684 --temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --parallel 3 --presence-penalty 1.0 --rpc host1:11400 --rpc host2:11500

These are the logs from llama-server.cpp. I configured the llama-server with --parallel 3 to process 3 requests in parallel. Does update_slots: id 2 means that this is request number 2?

...
slot update_slots: id  2 | task 1426 | prompt processing progress, n_past = 4496, n_tokens = 2048, progress = 0.679396
slot update_slots: id  2 | task 1426 | kv cache rm [4496, end)
slot update_slots: id  2 | task 1426 | prompt processing progress, n_past = 6023, n_tokens = 1529, progress = 0.932924
slot update_slots: id  2 | task 1426 | prompt done, n_past = 6023, n_tokens = 1529
srv  log_server_r: request: POST /api/embed 10.1.1.94 404
slot      release: id  1 | task 13 | stop processing: n_past = 8445, truncated = 0
slot print_timing: id  1 | task 13 | 
...

Why the logs of the rpc-servers show me "Null buffer for tensor passed to init_tensor function" and the connection is closed?

Accepted client connection, free_mem=498209964032, total_mem=498216206336
recv returned 0 (peer closed?)
Client connection closed
Accepted client connection, free_mem=498209964032, total_mem=498216206336
Null buffer for tensor passed to init_tensor function
Null buffer for tensor passed to init_tensor function
Null buffer for tensor passed to init_tensor function

Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed inference #15796

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Distributed inference #15796

Uh oh!

Uh oh!

alejandrods Sep 4, 2025

Replies: 0 comments

alejandrods
Sep 4, 2025