Replies: 2 comments
-
Are you sure RTX 4070 supports Flash Attention? |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think you can change the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using a 4070 12G and 32G DDR5 ram. This is the command I use:
.\build\bin\llama-server.exe -m D:\llama.cpp\models\Qwen3-30B-A3B-UD-Q3_K_XL.gguf -c 32768 --port 9999 -ngl 99 --no-webui --device CUDA0 -fa -ot ".ffn_.*_exps.=CPU"
And for long prompts it takes over a minute to process:
Is there any approach to increase prompt processing speed? Only use ~5G vram, so I suppose there's room for improvement.
Beta Was this translation helpful? Give feedback.
All reactions