fix(responses): stream reasoning as a live reasoning item (#9658)#10284
Open
localai-bot wants to merge 1 commit into
Open
fix(responses): stream reasoning as a live reasoning item (#9658)#10284localai-bot wants to merge 1 commit into
localai-bot wants to merge 1 commit into
Conversation
…9658) In the /v1/responses streaming handler a reasoning model's thinking monologue was streamed to the client as normal message text (a msg_ output item with output_text.delta) and only reclassified into a reasoning item after the stream completed. Subsequent output_text.delta events also kept referencing the old msg_ item id instead of the reasoning_ id. Root causes: 1. The live reasoning item was gated on extractor.Reasoning(), which is only updated by the Go-side raw-tag parser (ProcessToken). When the C++ autoparser drives reasoning through reasoning_content ChatDeltas, the reasoning delta is computed via ProcessChatDeltaReasoning into a separate accumulator, so extractor.Reasoning() stays empty and the gate never fired. The reasoning item was thus only reconstructed at end-of-stream. 2. The non-tool-call path created the message/msg_ output item eagerly before any token, forcing reasoning to a higher output index and making mis-split <think> text land on the pre-existing message item. 3. Neither path carried the sticky preferAutoparser flag, so a content-only autoparser (the non-jinja pure-content fallback, #9985) could leak <think>...</think> tokens into content. Extract the per-token reasoning-vs-message classification into a pure, unit-tested streamReasoningRouter (mirroring chooseDeferredReasoning and processStream in the chat streaming worker): it gates the reasoning item on the reasoning delta, opens the message item lazily on the first content delta, and keeps a sticky preferAutoparser fallback. Both streaming paths now route reasoning deltas to the reasoning_ id and order the reasoning item ahead of the message at completion. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9658
Problem
On the
/v1/responses(Responses API) streaming path, a reasoning model's<think>monologue was streamed to the client as ordinary message text (output_text.deltaon amsg_item) and only reclassified into areasoningoutput item after the stream completed. Subsequent deltas also kept referencing the oldmsg_id.Root causes
extractor.Reasoning(), which only reflects the Go-sideProcessTokenparser and never the autoparser'sProcessChatDeltaReasoningaccumulator - so autoparser-driven reasoning was dropped live and rebuilt only at end-of-stream.msg_item before any token, forcing reasoning to a later index and mis-attributing deltas tomsg_.preferAutoparser, letting a content-only autoparser leak<think>into content (Regression: Reasoning/thinking output provided as regular output #9985).Fix
Extracted a pure
streamReasoningRouterhelper (mirroringchat_stream_workers.go) that gates onreasoningDelta != "", opens the message item lazily, and keeps a sticky autoparser preference. Both streaming callbacks now route reasoning deltas to thereasoning_id, and the completed-response assembly orders reasoning -> message -> tool_calls.Behavior note
A pure-reasoning turn with no content no longer emits an empty
messageitem.Test plan
streamReasoningRouter(red -> green).go test ./core/http/endpoints/openresponses/... ./core/http/endpoints/openai/...green.golangci-lint --new-from-merge-base=origin/masterclean.Assisted-by: claude:claude-opus-4-8 [Claude Code]