How should the client handle thinking blocks? #15333
Unanswered
GlasslessPizza
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a few questions about the correct handling of reasoning blocks and i really need some feedback. I'm upgrading a simple python client frontend to support reasoning models. At the moment I'm using streaming mode in v1/chat/completions in llama-server.
Can llama-server working in SSE streaming mode somehow mark tokens in a way as to differentiate thinking tokens from normal tokens?
This would avoid the frontend the need to implement flimsy parsing for thinking delimiters on a per-model basis.
Does llama-server remove past thinking blocks automatically when processing or should the frontend take care to remove them before sending?
According to this, models are typically trained to expect thinking blocks related to past messages be removed.
The http API has the parameter "thinking_forced_open". Is there also a "thinking_forced_off"? "reasoning-budget" is only available as a CLI argument.
The coexistence of these commands would abstract the switching logic for hybrid models, without needing to support textual switches on a per-model basis on the frontend side like "/no_think" or by prefilling "" or whatever proprietary delimiter string. Also some default to non-thinking mode, with possibility to enable thinking, others default to thinking mode with possibility to disable thinking and this would iron out these inconsistencies as well.
I found in the docs a few other parameters related to managing thinking blocks, but I found them not of much use:
CLI llama-server arguments:
--reasoning-format | always set to "none" by default for streaming mode, ignored;
--reasoning-budget set to 0 | error "Assistant response prefill is incompatible with enable_thinking";
--chat_template_kwargs "{"enable_thinking":true/false}" | only a few models support this, also "Assistant response prefill is incompatible with enable_thinking";
http API:
reasoning_format | I'm suspecting this has the same behavior as the CLI argument;
chat_template_kwargs : {"enable_thinking": true/false} | only a few models support this, also "Assistant response prefill is incompatible with enable_thinking";
Beta Was this translation helpful? Give feedback.
All reactions