Anyone running Gemma 3 27B with llama.cpp? #13313

Fulll3 · 2025-05-05T13:05:59Z

Fulll3
May 5, 2025

Hi,

I'm testing Gemma 3 27B via llama.cpp, but running into attention/context issues during chat. After a few turns, it seems to forget the system prompt and loses track of the conversation.

Performance-wise, llama.cpp is ~3x faster than Ollama, but Ollama handles context much better for Gemma 3.

I couldn't find any official chat template for Gemma 3, so I'm sending it as prompt and formatting messages like this:

<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>

Anyone found a better way to structure prompts or maintain context? Tips appreciated!

ali0une · 2025-05-07T06:22:54Z

ali0une
May 7, 2025

Try adding the -keep 1 argument (see llama-server -help)

0 replies

steampunque · 2025-05-08T14:34:31Z

steampunque
May 8, 2025

There is a problem with activations range for f16 quants in Gemma 3 which might be influencing this:

https://www.unsloth.ai/blog/gemma3

Option possibly to move to bf16 or f32 for the activations quants.

2 replies

ggerganov May 8, 2025
Maintainer

Activations in llama.cpp are already in F32 so I doubt this has anything to do with the problem of @Fulll3. They should provide specific reproduction steps, otherwise we just assume a user error.

steampunque May 8, 2025

Activations in llama.cpp are already in F32 so I doubt this has anything to do with the problem of @Fulll3. They should provide specific reproduction steps, otherwise we just assume a user error.

OK, thanks for clarification. I see some bf16 code in cuda backend so possible option to use that too but I am uncertain about how to get it into the quant or if it would help any.

Also if activations had such wide dynamic range anything downstream not f32 is going to quickly saturate. Don't know if that could impact anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anyone running Gemma 3 27B with llama.cpp? #13313

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Anyone running Gemma 3 27B with llama.cpp? #13313

Fulll3 May 5, 2025

Replies: 2 comments · 2 replies

ali0une May 7, 2025

steampunque May 8, 2025

ggerganov May 8, 2025 Maintainer

steampunque May 8, 2025

Fulll3
May 5, 2025

Replies: 2 comments 2 replies

ali0une
May 7, 2025

steampunque
May 8, 2025

ggerganov May 8, 2025
Maintainer