How to make prompt processing faster for Qwen3-30B moe? #13186

CHNtentes · 2025-04-29T14:17:03Z

CHNtentes
Apr 29, 2025

I'm using a 4070 12G and 32G DDR5 ram. This is the command I use:

.\build\bin\llama-server.exe -m D:\llama.cpp\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -c 32768 --port 9999 -ngl 99 --no-webui --device CUDA0 -fa -ot ".ffn_.*_exps.=CPU"

And for long prompts it takes over a minute to process:

prompt eval time = 68442.52 ms / 29933 tokens ( 2.29 ms per token, 437.35 tokens per second)
eval time = 19719.89 ms / 398 tokens ( 49.55 ms per token, 20.18 tokens per second)
total time = 88162.41 ms / 30331 tokens

Is there any approach to increase prompt processing speed? Only use ~5G vram, so I suppose there's room for improvement.

acbits · 2025-04-30T15:24:31Z

acbits
Apr 30, 2025

-fa works only on some GPUs. On other GPUs, it is processed on the CPU AFAIK and performance drops.

Are you sure RTX 4070 supports Flash Attention?

0 replies

pt13762104 · 2025-05-01T11:49:23Z

pt13762104
May 1, 2025

I think you can change the -ot flag to only load tensors from some layers and up to the CPU (load more layers to the GPU since you said you only used 5GB), using this tool: https://3widgets.com/ (regex range generator).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to make prompt processing faster for Qwen3-30B moe? #13186

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to make prompt processing faster for Qwen3-30B moe? #13186

CHNtentes Apr 29, 2025

Replies: 2 comments

acbits Apr 30, 2025

pt13762104 May 1, 2025

CHNtentes
Apr 29, 2025

acbits
Apr 30, 2025

pt13762104
May 1, 2025