How should the client handle thinking blocks? #15333

GlasslessPizza · 2025-08-14T22:07:24Z

GlasslessPizza
Aug 14, 2025

I have a few questions about the correct handling of reasoning blocks and i really need some feedback. I'm upgrading a simple python client frontend to support reasoning models. At the moment I'm using streaming mode in v1/chat/completions in llama-server.

Can llama-server working in SSE streaming mode somehow mark tokens in a way as to differentiate thinking tokens from normal tokens?
This would avoid the frontend the need to implement flimsy parsing for thinking delimiters on a per-model basis.
Does llama-server remove past thinking blocks automatically when processing or should the frontend take care to remove them before sending?
According to this, models are typically trained to expect thinking blocks related to past messages be removed.
The http API has the parameter "thinking_forced_open". Is there also a "thinking_forced_off"? "reasoning-budget" is only available as a CLI argument.
The coexistence of these commands would abstract the switching logic for hybrid models, without needing to support textual switches on a per-model basis on the frontend side like "/no_think" or by prefilling "" or whatever proprietary delimiter string. Also some default to non-thinking mode, with possibility to enable thinking, others default to thinking mode with possibility to disable thinking and this would iron out these inconsistencies as well.

I found in the docs a few other parameters related to managing thinking blocks, but I found them not of much use:

CLI llama-server arguments:
--reasoning-format | always set to "none" by default for streaming mode, ignored;
--reasoning-budget set to 0 | error "Assistant response prefill is incompatible with enable_thinking";
--chat_template_kwargs "{"enable_thinking":true/false}" | only a few models support this, also "Assistant response prefill is incompatible with enable_thinking";

http API:
reasoning_format | I'm suspecting this has the same behavior as the CLI argument;
chat_template_kwargs : {"enable_thinking": true/false} | only a few models support this, also "Assistant response prefill is incompatible with enable_thinking";

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How should the client handle thinking blocks? #15333

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How should the client handle thinking blocks? #15333

Uh oh!

GlasslessPizza Aug 14, 2025

Replies: 0 comments

GlasslessPizza
Aug 14, 2025