feat(usage): track cached tokens + correct input/output/cache cost by hodtien · Pull Request #2209 · decolua/9router

hodtien · 2026-06-29T09:22:03Z

Problem

The dashboard mis-reported both token counts and cost whenever an upstream returned cached tokens. Two distinct root causes:

Cache fields stripped before persist. saveUsageStats in open-sse/handlers/chatCore/requestDetail.js collapsed tokens to {prompt_tokens, completion_tokens} before writing to usageHistory. calculateCost already knew how to price cached + cache_creation, but it never received those fields — so cached tokens were billed at the full input rate and the displayed input count was wrong.
Mixed token conventions across providers. Claude reports prompt_tokens excluding cache_read_input_tokens / cache_creation_input_tokens. OpenAI Chat, OpenAI Responses, and Gemini report prompt_tokens including cached_tokens. The cost formula assumed inclusive (nonCachedInput = prompt − cached) so:
- Claude was undercharged and its input count was wrong.
- cache_creation_input_tokens was double-counted under the inclusive convention (prompt already contains it, then the cost formula added it again at its own rate).

Fix — one canonical convention

canonicalizeUsage() in open-sse/utils/usageTracking.js normalizes every provider to one shape before persist:

prompt_tokens               = total input INCLUDING cache read + cache creation
cached_tokens               = cache-read portion   (subset of prompt_tokens)
cache_creation_input_tokens = cache-write portion  (subset of prompt_tokens)
completion_tokens, reasoning_tokens, total_tokens

Discriminator: Claude path emits cache_read_input_tokens (prompt excludes cache) → fold into prompt. OpenAI/Gemini emit cached_tokens (prompt already inclusive) → pass through. Idempotent.

Cost formula (open-sse/providers/pricing.js + src/lib/db/repos/usageRepo.js) now subtracts both cached and cache_creation from full-rate input:

const nonCachedInput = Math.max(0, prompt − cachedTokens − cacheCreationTokens);
cost += nonCachedInput * pricing.input / 1e6
     + cachedTokens   * (pricing.cached       || pricing.input) / 1e6
     + cacheCreation  * (pricing.cache_creation || pricing.input) / 1e6
     + output         * pricing.output / 1e6
     + reasoning      * (pricing.reasoning || pricing.output) / 1e6

requestDetail.js now passes the canonical object through instead of stripping. extractUsageFromResponse also surfaces cachedContentTokenCount for Gemini (was dropped).

Provider coverage

Provider	Cache extracted	Pricing has `cached` / `cache_creation`	Cost accurate
Anthropic (official)	`cache_read_input_tokens`, `cache_creation_input_tokens`	yes	yes
Anthropic Compatible	same as above (extractor is format-based, not provider-based)	yes (resolves via `MODEL_PRICING`/`PATTERN_PRICING`)	yes
OpenAI Chat (official)	`prompt_tokens_details.cached_tokens`	yes	yes
OpenAI Compatible	same	yes	yes
OpenAI Responses	`input_tokens_details.cached_tokens`	yes	yes
Gemini / Antigravity	`cachedContentTokenCount`	yes	yes
DeepSeek	`prompt_cache_hit_tokens`	yes	yes
Kiro (Amazon Q)	upstream does not expose cache fields today	—	n/a
Ollama / CommandCode	no cache concept	—	n/a

Kiro executor + USAGE_EXTRACTORS.kiro are now defensive: if the upstream event shape grows cache_read_input_tokens / cache_creation_input_tokens / cachedTokens, they are picked up automatically — no second pass needed when Amazon Q starts emitting cache.

84/84 entries in MODEL_PRICING already carry the cached and cache_creation fields, so the pricing UI/API needed no changes.

UI

Overview card: standalone Cached Tokens card (was a subline under Input).
Tokens table: new Cached column between Input and Output.
Cost table: new Cached Cost column between Input and Output. Cost split is a token-share allocation of the server totalCost (rate-accurate) — cachedCost = cachedTokens / totalTokens × totalCost. If exact per-rate cached cost display is needed, the storage layer can be extended to return per-component cost.
Details tab: new Cached + Cache Creation columns in the list, matching the drawer fields.

Files

Backend:

open-sse/utils/usageTracking.js — canonicalizeUsage()
open-sse/providers/pricing.js — calculateCostFromTokens no double-count
open-sse/handlers/chatCore/requestDetail.js — canonicalize instead of strip; Gemini cachedContentTokenCount
open-sse/translator/concerns/usage.js — Kiro defensive pass-through
open-sse/executors/kiro.js — Kiro metricsEvent defensive cache pickup
src/lib/db/repos/usageRepo.js — calculateCost no double-count; aggregate cachedTokens through daily + 24h stats (totalCachedTokens, per-group cachedTokens)

UI:

src/app/(dashboard)/dashboard/usage/components/OverviewCards.js
src/app/(dashboard)/dashboard/usage/components/UsageTable.js
src/app/(dashboard)/dashboard/usage/components/RequestDetailsTab.js
src/shared/components/UsageStats.js

Tests:

tests/unit/cached-token-usage.test.js (new) — canonicalization + cost + Kiro forward-compat
tests/unit/cached-token-e2e.test.js (new) — end-to-end saveRequestUsage → getUsageStats asserts persisted cached_tokens, aggregated totalCachedTokens, correct cost

Verification

Unit + e2e tests green (13 new, no new regressions; baseline 68 fail → 59 fail with this work).
Production build passes.
Live /api/usage/stats now serves totalCachedTokens (top-level) and cachedTokens per byProvider / byModel group.
Seeded a Claude + OpenAI cache row in dev DB and confirmed the overview card shows Cached Tokens = 900, the Cost table shows the new column, and the Details tab shows Cached = 600 / Cache Creation = 50.

Historical rows written before this change have cached_tokens = null in usageHistory.tokens (the old strip path). They will continue to show 0 for cached. A one-shot backfill from requestDetails.providerResponse is possible but out of scope for this PR — easy to add as a follow-up script if needed.

OpenAI/Anthropic Compatible nodes were hard-limited to one connection. Remove the guard so they hold a key pool; runtime getProviderCredentials already rotates/fails over across connections. Embedding nodes unchanged.

9router dropped cache tokens before persisting usage, so token counts and cost were wrong for cache-using providers. Two root causes fixed: - saveUsageStats stripped tokens to {prompt,completion} before the DB write, so calculateCost never saw cached/cache_creation. Now canonicalizeUsage threads them through. - Mixed conventions: Claude reports prompt EXCLUDING cache; OpenAI/Gemini INCLUDING it. canonicalizeUsage folds everything to one cache-inclusive convention, and the cost formula subtracts both cached + cache_creation from full-rate input (was double-counting cache_creation). Cached tokens now persist, aggregate into daily/24h stats (totalCachedTokens + per-group cachedTokens), and surface in the dashboard (overview card, usage table column, request detail drawer).

- Cached Tokens as a standalone overview card (was a subline under Input). - "Cached" column in the Tokens table; "Cached Cost" column in the Cost table (peeled out from input share via token-share allocation). - "Cached" + "Cache Creation" columns in the Details tab list (matches drawer fields). - Kiro executor + USAGE_EXTRACTORS.kiro: pass through cache_read_input_tokens / cache_creation_input_tokens / cachedTokens if the upstream event shape grows them. Amazon Q does not expose cache today; this keeps cost tracking working without a second pass.

Anthropic streaming splits usage across events: message_start carries input_tokens + cache_read_input_tokens + cache_creation_input_tokens, while message_delta carries only the final output_tokens. Three paths read cache solely from message_delta, so cache (and real input) were dropped — recorded as 0. - extractUsage: add message_start branch + mergeUsage() field-wise max-merge so start/delta combine instead of clobbering - stream.js: merge usage across events (passthrough + translate) - claude-to-openai translator: capture message_start usage; delta falls back to it when cache fields are absent - bypassHandler: merge message_start cache with message_delta output

hodtien · 2026-06-29T10:13:27Z

Update: fix cache token cho Anthropic streaming (commit `b988021`)

Vấn đề phát hiện thêm

Anthropic streaming tách usage ra 2 event:

message_start → usage: { input_tokens, cache_read_input_tokens, cache_creation_input_tokens }
message_delta → usage: { output_tokens } (chỉ có output, không có input/cache)

3 đường code chỉ đọc cache từ message_delta — nơi Anthropic không bao giờ gửi cache — nên cache token và cả input_tokens bị mất (ghi nhận = 0).

Sửa

extractUsage: thêm nhánh message_start + helper mergeUsage() (max-merge từng field) để start/delta hợp nhất thay vì ghi đè.
stream.js: merge usage qua các event (cả passthrough + translate path).
claude-to-openai translator: bắt usage từ message_start; message_delta fallback về cache đã bắt được khi delta thiếu field.
bypassHandler: merge cache của message_start với output của message_delta.

Kiểm chứng

Test live qua provider xxxxx (anthropic-compatible): prompt_tokens ghi đúng 2413 — trước fix sẽ là 0 (vì input chỉ nằm ở message_start).
Unit test stream Anthropic có cache: prompt 50 + cache_read 2400 + cache_creation 120 → canonical 2570, cached_tokens 2400 ghi đúng.
Full suite: 0 regression mới so với baseline.

hodtien added 3 commits June 29, 2026 16:21

hodtien mentioned this pull request Jun 29, 2026

feat(usage): track cached tokens + correct input/output/cache cost #2208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(usage): track cached tokens + correct input/output/cache cost#2209

feat(usage): track cached tokens + correct input/output/cache cost#2209
hodtien wants to merge 4 commits into
decolua:masterfrom
hodtien:feature/cached-token-clean

hodtien commented Jun 29, 2026 •

edited

Loading

Uh oh!

hodtien commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hodtien commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix — one canonical convention

Provider coverage

UI

Files

Verification

Uh oh!

hodtien commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update: fix cache token cho Anthropic streaming (commit b988021)

Vấn đề phát hiện thêm

Sửa

Kiểm chứng

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hodtien commented Jun 29, 2026 •

edited

Loading

hodtien commented Jun 29, 2026 •

edited

Loading

Update: fix cache token cho Anthropic streaming (commit `b988021`)