Skip to content

feat(continuity): preserve bounded reasoning checkpoints across model handoffs#2214

Open
warelik wants to merge 1 commit into
decolua:masterfrom
warelik:feat/continuity-reasoning-sidecar
Open

feat(continuity): preserve bounded reasoning checkpoints across model handoffs#2214
warelik wants to merge 1 commit into
decolua:masterfrom
warelik:feat/continuity-reasoning-sidecar

Conversation

@warelik

@warelik warelik commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

feat(continuity): preserve bounded reasoning checkpoints across model handoffs

Feature request

This PR proposes an experimental, opt-in continuity layer for 9router.

The feature is not a smart model selector and not an attempt to make switching providers “always better”. It solves a narrower routing problem:

when 9router continues a multi-turn session through another provider, model family, or API format, the visible dialogue history usually survives, but provider-native reasoning often does not.

The implementation attached here keeps a small reasoning sidecar per session. It captures native reasoning from successful responses, stores a bounded checkpoint buffer, and injects recent checkpoints into the next request’s system prompt when the user enables the feature.

Default: off.

Why this belongs in 9router

9router’s core value is not only “send request to provider X”. It is the translation and fallback layer between many clients, many API shapes, and many providers.

That makes reasoning continuity a router-level problem:

  • Claude thinking blocks are not the same shape as OpenAI reasoning_content.
  • Gemini / Antigravity thought parts are not the same shape as Claude content blocks.
  • Responses API reasoning items/deltas are not the same shape as Chat Completions messages.
  • Some providers reject reasoning fields they do not understand.
  • Some clients drop reasoning fields when rebuilding conversation history.
  • Combo/fallback routes can switch provider families mid-session.

The normal message history remains client-owned and is still passed as before. This feature only carries the missing “reasoning trace” that tends to be lost at router boundaries.

This is closely related to issues already reported in the repo:

  • #1189 — DeepSeek thinking mode requires previous reasoning_content to be passed back in multi-turn conversations.
  • #1321 — Xiaomi/MiMo thinking models fail because reasoning_content is not preserved across turns.
  • #1459 — reasoning effort/config can be lost during Claude → Codex translation.
  • #2158 — reasoning leaks into visible content through <thinking> tags.
  • #2190 — another PR addresses a streaming leak where Claude thinking markers reach visible content.

This PR does not replace those narrower protocol fixes. It adds a router-level continuation mechanism for the class of cases where provider-native reasoning cannot be faithfully round-tripped through the client’s normal history.

What the feature does

1. Capture native reasoning from successful responses

The router extracts reasoning from the response shape it actually receives:

  • OpenAI-compatible: reasoning_content
  • Claude-compatible: thinking
  • Gemini / Antigravity: parts[] with thought: true
  • Responses API: reasoning output items and reasoning delta events
  • Ollama-like shape: message.thinking
  • AI SDK / CommandCode style: reasoning-delta

Streaming responses use the existing assembled contentObj.thinking path after the stream completes. Non-streaming and forced SSE→JSON paths use the same extractor utility.

Failed responses do not replace the buffer.

2. Store a bounded per-session sidecar

The feature adds a small in-memory recentThoughtsStore in sessionManager.js.

Bounds:

  • max 100 reasoning checkpoints per session;
  • max 1000 active session keys;
  • session cap eviction is LRU using lastUsed;
  • a single checkpoint above MAX_RECENT_THOUGHT_CHARS is skipped with a warning, not truncated.

The “not truncated” part is intentional. Partial reasoning traces are worse than no trace because they can turn the checkpoint into corrupted instructions or misleading context. If a trace is too large, the safer behavior is to drop it and keep the normal visible conversation intact.

3. Replay recent checkpoints into the next request

When enabled, the next request receives up to continuityCount recent checkpoints through the existing system-prompt injection path.

The prompt is explicitly framed as a host continuation checkpoint:

[HOST CONTINUATION CHECKPOINT]
<continuation_checkpoint>
The block below is private continuation context from your immediately preceding turns.
It is the causal context behind the prior visible conversation history.
Use it to understand why that history unfolded as it did, then continue from the active user input below.
...
</continuation_checkpoint>
[HOST RESUME]

This is deliberately injected into the system prompt, not as a synthetic user message. It mirrors the existing RTK/caveman style of system-level prompt augmentation and avoids polluting user history.

The important architecture detail: two different session IDs

The feature separates two concepts that look similar but behave differently:

  1. continuitySessionId — existing prompt-cache/session identity used for upstream compatibility.
  2. thoughtKey — reasoning sidecar key used to keep checkpoints across route changes and summarization.

The thought key is:

extractClientSessionId(headers, body) || apiKey || continuitySessionId

This is the result of the earlier debugging work in this patch series.

Using only the existing captureSessionId() looked attractive at first, but it can fall back to assistantTextSessionId, which hashes assistant text. When a client summarizes or truncates a long conversation, that assistant-text hash changes. The normal visible history still exists, but the sidecar buffer becomes orphaned under the old hash.

Using connectionId also failed for combo routes because each provider/account can have a different connection id, so reasoning captured from provider A would not be visible when the route moved to provider B.

The current split keeps prompt caching behavior unchanged while giving the reasoning sidecar a stable key when the client supplies one. If no stable client session exists, it falls back to apiKey, then the existing session id.

Why this is not “full context transfer”

The feature intentionally does not claim to transfer the whole model/session state.

It does not preserve:

  • hidden provider runtime state;
  • exact decoder state;
  • provider cache state;
  • pending tool calls;
  • tool runtime state;
  • files/workspace state;
  • arbitrary provider-private metadata;
  • the full visible conversation history beyond what the client already sends.

It only replays a bounded number of recent reasoning traces as continuation context.

That makes it safe enough to ship as an experimental opt-in feature without pretending it is a perfect cross-provider session migration system.

Why the tag normalizer is included

Some providers/translation paths surface reasoning markers as visible text, for example:

<thinking>...</thinking>
<think>...</think>
<reasoning>...</reasoning>

When continuity is enabled, this PR removes only the tag markers from visible content and keeps the inner text visible.

Important behavior:

  • tagged visible text is not promoted into the reasoning sidecar;
  • native provider reasoning remains the source of sidecar capture;
  • tag stripping runs in both translate and passthrough paths;
  • this reduces downstream trigger-token leakage without deleting user-visible text.

This is intentionally conservative. It does not try to infer hidden reasoning from arbitrary visible text.

Why it is opt-in

The research does not support the claim “switching models with prior reasoning is always better”. The effect is directional and pair/task dependent.

This PR therefore exposes the feature as a dashboard toggle:

  • continuityEnabled — default false
  • continuityCount — default 3, range 1..100

Users who are actively using combo/fallback routes with thinking models can enable it. Others see no behavior change.

Research motivation

The design is based on a modest claim: a bounded reasoning checkpoint can be useful handoff context for some model pairs and tasks, but it must be configurable and measured.

Relevant work:

  1. Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems studies multi-turn handoff where one model produces the dialogue prefix and another model completes the final turn. The reported effect is source→target specific: some switches improve, others degrade. This supports the feature being opt-in and route-dependent rather than enabled as a universal quality optimization.

  2. Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models shows that reasoning traces can transfer between models in some settings. It also cautions that trace transfer can leak or nearly reveal answers on some tasks. This supports bounded replay and explicit framing instead of blindly copying unbounded reasoning.

  3. Mixture-of-Agents Enhances Large Language Model Capabilities is not the same architecture as 9router fallback, but it is useful evidence that outputs from one model can serve as auxiliary context for another.

  4. Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? is the counterweight: heterogeneous model mixing is not automatically better. This is why the PR avoids any claim that cross-provider continuation always improves quality.

Implementation map

open-sse/utils/thinkingExtractor.js

Small recursive extractor for native reasoning fields across response families.

It recognizes provider-native reasoning fields but does not infer reasoning from ordinary visible text.

open-sse/utils/taggedThinkingNormalizer.js

Removes known thinking tag markers from response content while preserving the inner text.

This is deliberately a simple marker stripper. It does not try to solve split tags across streaming chunk boundaries with stateful buffering yet; that would add complexity before there is evidence it is needed.

open-sse/rtk/continuityPrompt.js

Builds the framed continuation checkpoint.

It uses dynamic backtick fences so checkpoint text containing backticks cannot prematurely close the fenced block.

open-sse/rtk/continuity.js

Thin wrapper around the existing system-prompt injection helper.

The feature does not create synthetic user messages.

open-sse/utils/sessionManager.js

Adds recentThoughtsStore next to existing in-memory session stores.

There is no DB, Redis, second persistent store, or provider-specific session scheme. Cleanup uses the same TTL rhythm as runtime session cleanup, while the thought store has its own lastUsed updates.

open-sse/handlers/chatCore*.js

Wires the feature through:

  • normal streaming responses;
  • non-streaming responses;
  • forced SSE→JSON responses;
  • translate and passthrough paths.

open-sse/utils/stream.js

Accumulates thinking/content from stream chunks across OpenAI, Claude, Gemini, Responses, Ollama, and CommandCode/AI SDK style events.

src/app/(dashboard)/dashboard/token-saver/TokenSaverClient.js

Adds the experimental toggle and replay count control.

Operational notes

  • Default off: no behavior change unless enabled.
  • Client-facing responses are unchanged except for optional thinking-tag marker stripping when enabled.
  • Reasoning checkpoints are bounded by count, session cap, and single-checkpoint size.
  • Oversize checkpoints are skipped, not truncated.
  • The normal visible conversation remains client-owned.
  • The sidecar is in-memory and follows the process lifetime, matching the lightweight nature of existing runtime session state.

Related PRs / prior work

This builds on the same compatibility direction as:

  • #2192 — Gemini contents normalization.
  • #2194 — Gemini thoughtSignature and stream sentinel compatibility.
  • #2196 — Claude tool schema normalization for strict gateways.
  • #2190 — keeping Claude thinking out of visible content in a narrower translator path.

Those are protocol-shape fixes. This PR is the feature-layer continuation mechanism above those translators.

… handoffs

Adds an experimental, opt-in continuity layer that captures native
reasoning from successful responses (OpenAI reasoning_content, Claude
thinking, Gemini thought parts, Responses reasoning items, Ollama and
AI SDK shapes), stores a bounded per-session sidecar, and replays
recent checkpoints into the next request's system prompt.

The thought-buffer key is decoupled from the prompt-cache session id:
uses client-provided session id when available (stable across context
summarization), falls back to apiKey, then continuitySessionId.

When enabled, also strips content-embedded thinking tag markers
(<think>/<thinking>/<thought>/<reasoning>/<analysis>) from response
content in translate and passthrough modes while keeping inner text
visible; tagged text is not promoted into the sidecar.

Bounds: 100 thoughts/session, 1000 sessions (LRU via lastUsed),
MAX_RECENT_THOUGHT_CHARS ceiling (oversize traces skipped, not
truncated). Default off. Dashboard toggle + count control in
Token Saver UI.

Ships with open-sse/utils/continuitySelfCheck.mjs — no-framework
assert-based self-check (13/13) for tag stripping, thinking
extraction, and continuity prompt fence behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant