feat(continuity): preserve bounded reasoning checkpoints across model handoffs#2214
Open
warelik wants to merge 1 commit into
Open
feat(continuity): preserve bounded reasoning checkpoints across model handoffs#2214warelik wants to merge 1 commit into
warelik wants to merge 1 commit into
Conversation
… handoffs Adds an experimental, opt-in continuity layer that captures native reasoning from successful responses (OpenAI reasoning_content, Claude thinking, Gemini thought parts, Responses reasoning items, Ollama and AI SDK shapes), stores a bounded per-session sidecar, and replays recent checkpoints into the next request's system prompt. The thought-buffer key is decoupled from the prompt-cache session id: uses client-provided session id when available (stable across context summarization), falls back to apiKey, then continuitySessionId. When enabled, also strips content-embedded thinking tag markers (<think>/<thinking>/<thought>/<reasoning>/<analysis>) from response content in translate and passthrough modes while keeping inner text visible; tagged text is not promoted into the sidecar. Bounds: 100 thoughts/session, 1000 sessions (LRU via lastUsed), MAX_RECENT_THOUGHT_CHARS ceiling (oversize traces skipped, not truncated). Default off. Dashboard toggle + count control in Token Saver UI. Ships with open-sse/utils/continuitySelfCheck.mjs — no-framework assert-based self-check (13/13) for tag stripping, thinking extraction, and continuity prompt fence behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(continuity): preserve bounded reasoning checkpoints across model handoffs
Feature request
This PR proposes an experimental, opt-in continuity layer for 9router.
The feature is not a smart model selector and not an attempt to make switching providers “always better”. It solves a narrower routing problem:
The implementation attached here keeps a small reasoning sidecar per session. It captures native reasoning from successful responses, stores a bounded checkpoint buffer, and injects recent checkpoints into the next request’s system prompt when the user enables the feature.
Default: off.
Why this belongs in 9router
9router’s core value is not only “send request to provider X”. It is the translation and fallback layer between many clients, many API shapes, and many providers.
That makes reasoning continuity a router-level problem:
reasoning_content.The normal message history remains client-owned and is still passed as before. This feature only carries the missing “reasoning trace” that tends to be lost at router boundaries.
This is closely related to issues already reported in the repo:
reasoning_contentto be passed back in multi-turn conversations.reasoning_contentis not preserved across turns.contentthrough<thinking>tags.This PR does not replace those narrower protocol fixes. It adds a router-level continuation mechanism for the class of cases where provider-native reasoning cannot be faithfully round-tripped through the client’s normal history.
What the feature does
1. Capture native reasoning from successful responses
The router extracts reasoning from the response shape it actually receives:
reasoning_contentthinkingparts[]withthought: truereasoningoutput items and reasoning delta eventsmessage.thinkingreasoning-deltaStreaming responses use the existing assembled
contentObj.thinkingpath after the stream completes. Non-streaming and forced SSE→JSON paths use the same extractor utility.Failed responses do not replace the buffer.
2. Store a bounded per-session sidecar
The feature adds a small in-memory
recentThoughtsStoreinsessionManager.js.Bounds:
lastUsed;MAX_RECENT_THOUGHT_CHARSis skipped with a warning, not truncated.The “not truncated” part is intentional. Partial reasoning traces are worse than no trace because they can turn the checkpoint into corrupted instructions or misleading context. If a trace is too large, the safer behavior is to drop it and keep the normal visible conversation intact.
3. Replay recent checkpoints into the next request
When enabled, the next request receives up to
continuityCountrecent checkpoints through the existing system-prompt injection path.The prompt is explicitly framed as a host continuation checkpoint:
This is deliberately injected into the system prompt, not as a synthetic user message. It mirrors the existing RTK/caveman style of system-level prompt augmentation and avoids polluting user history.
The important architecture detail: two different session IDs
The feature separates two concepts that look similar but behave differently:
continuitySessionId— existing prompt-cache/session identity used for upstream compatibility.thoughtKey— reasoning sidecar key used to keep checkpoints across route changes and summarization.The thought key is:
This is the result of the earlier debugging work in this patch series.
Using only the existing
captureSessionId()looked attractive at first, but it can fall back toassistantTextSessionId, which hashes assistant text. When a client summarizes or truncates a long conversation, that assistant-text hash changes. The normal visible history still exists, but the sidecar buffer becomes orphaned under the old hash.Using
connectionIdalso failed for combo routes because each provider/account can have a different connection id, so reasoning captured from provider A would not be visible when the route moved to provider B.The current split keeps prompt caching behavior unchanged while giving the reasoning sidecar a stable key when the client supplies one. If no stable client session exists, it falls back to
apiKey, then the existing session id.Why this is not “full context transfer”
The feature intentionally does not claim to transfer the whole model/session state.
It does not preserve:
It only replays a bounded number of recent reasoning traces as continuation context.
That makes it safe enough to ship as an experimental opt-in feature without pretending it is a perfect cross-provider session migration system.
Why the tag normalizer is included
Some providers/translation paths surface reasoning markers as visible text, for example:
When continuity is enabled, this PR removes only the tag markers from visible content and keeps the inner text visible.
Important behavior:
This is intentionally conservative. It does not try to infer hidden reasoning from arbitrary visible text.
Why it is opt-in
The research does not support the claim “switching models with prior reasoning is always better”. The effect is directional and pair/task dependent.
This PR therefore exposes the feature as a dashboard toggle:
continuityEnabled— defaultfalsecontinuityCount— default3, range1..100Users who are actively using combo/fallback routes with thinking models can enable it. Others see no behavior change.
Research motivation
The design is based on a modest claim: a bounded reasoning checkpoint can be useful handoff context for some model pairs and tasks, but it must be configurable and measured.
Relevant work:
Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems studies multi-turn handoff where one model produces the dialogue prefix and another model completes the final turn. The reported effect is source→target specific: some switches improve, others degrade. This supports the feature being opt-in and route-dependent rather than enabled as a universal quality optimization.
Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models shows that reasoning traces can transfer between models in some settings. It also cautions that trace transfer can leak or nearly reveal answers on some tasks. This supports bounded replay and explicit framing instead of blindly copying unbounded reasoning.
Mixture-of-Agents Enhances Large Language Model Capabilities is not the same architecture as 9router fallback, but it is useful evidence that outputs from one model can serve as auxiliary context for another.
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? is the counterweight: heterogeneous model mixing is not automatically better. This is why the PR avoids any claim that cross-provider continuation always improves quality.
Implementation map
open-sse/utils/thinkingExtractor.jsSmall recursive extractor for native reasoning fields across response families.
It recognizes provider-native reasoning fields but does not infer reasoning from ordinary visible text.
open-sse/utils/taggedThinkingNormalizer.jsRemoves known thinking tag markers from response content while preserving the inner text.
This is deliberately a simple marker stripper. It does not try to solve split tags across streaming chunk boundaries with stateful buffering yet; that would add complexity before there is evidence it is needed.
open-sse/rtk/continuityPrompt.jsBuilds the framed continuation checkpoint.
It uses dynamic backtick fences so checkpoint text containing backticks cannot prematurely close the fenced block.
open-sse/rtk/continuity.jsThin wrapper around the existing system-prompt injection helper.
The feature does not create synthetic user messages.
open-sse/utils/sessionManager.jsAdds
recentThoughtsStorenext to existing in-memory session stores.There is no DB, Redis, second persistent store, or provider-specific session scheme. Cleanup uses the same TTL rhythm as runtime session cleanup, while the thought store has its own
lastUsedupdates.open-sse/handlers/chatCore*.jsWires the feature through:
open-sse/utils/stream.jsAccumulates thinking/content from stream chunks across OpenAI, Claude, Gemini, Responses, Ollama, and CommandCode/AI SDK style events.
src/app/(dashboard)/dashboard/token-saver/TokenSaverClient.jsAdds the experimental toggle and replay count control.
Operational notes
Related PRs / prior work
This builds on the same compatibility direction as:
thoughtSignatureand stream sentinel compatibility.Those are protocol-shape fixes. This PR is the feature-layer continuation mechanism above those translators.