Replies: 2 comments 2 replies
-
Try adding the -keep 1 argument (see llama-server -help) |
Beta Was this translation helpful? Give feedback.
0 replies
-
There is a problem with activations range for f16 quants in Gemma 3 which might be influencing this: https://www.unsloth.ai/blog/gemma3 Option possibly to move to bf16 or f32 for the activations quants. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I'm testing Gemma 3 27B via llama.cpp, but running into attention/context issues during chat. After a few turns, it seems to forget the system prompt and loses track of the conversation.
Performance-wise, llama.cpp is ~3x faster than Ollama, but Ollama handles context much better for Gemma 3.
I couldn't find any official chat template for Gemma 3, so I'm sending it as prompt and formatting messages like this:
<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>
Anyone found a better way to structure prompts or maintain context? Tips appreciated!
Beta Was this translation helpful? Give feedback.
All reactions