Skip to content

feat(usage): track cached tokens + correct input/output/cache cost#2209

Open
hodtien wants to merge 4 commits into
decolua:masterfrom
hodtien:feature/cached-token-clean
Open

feat(usage): track cached tokens + correct input/output/cache cost#2209
hodtien wants to merge 4 commits into
decolua:masterfrom
hodtien:feature/cached-token-clean

Conversation

@hodtien

@hodtien hodtien commented Jun 29, 2026

Copy link
Copy Markdown

Problem

The dashboard mis-reported both token counts and cost whenever an upstream returned cached tokens. Two distinct root causes:

  1. Cache fields stripped before persist. saveUsageStats in open-sse/handlers/chatCore/requestDetail.js collapsed tokens to {prompt_tokens, completion_tokens} before writing to usageHistory. calculateCost already knew how to price cached + cache_creation, but it never received those fields — so cached tokens were billed at the full input rate and the displayed input count was wrong.

  2. Mixed token conventions across providers. Claude reports prompt_tokens excluding cache_read_input_tokens / cache_creation_input_tokens. OpenAI Chat, OpenAI Responses, and Gemini report prompt_tokens including cached_tokens. The cost formula assumed inclusive (nonCachedInput = prompt − cached) so:

    • Claude was undercharged and its input count was wrong.
    • cache_creation_input_tokens was double-counted under the inclusive convention (prompt already contains it, then the cost formula added it again at its own rate).

Fix — one canonical convention

canonicalizeUsage() in open-sse/utils/usageTracking.js normalizes every provider to one shape before persist:

prompt_tokens               = total input INCLUDING cache read + cache creation
cached_tokens               = cache-read portion   (subset of prompt_tokens)
cache_creation_input_tokens = cache-write portion  (subset of prompt_tokens)
completion_tokens, reasoning_tokens, total_tokens

Discriminator: Claude path emits cache_read_input_tokens (prompt excludes cache) → fold into prompt. OpenAI/Gemini emit cached_tokens (prompt already inclusive) → pass through. Idempotent.

Cost formula (open-sse/providers/pricing.js + src/lib/db/repos/usageRepo.js) now subtracts both cached and cache_creation from full-rate input:

const nonCachedInput = Math.max(0, prompt  cachedTokens  cacheCreationTokens);
cost += nonCachedInput * pricing.input / 1e6
     + cachedTokens   * (pricing.cached       || pricing.input) / 1e6
     + cacheCreation  * (pricing.cache_creation || pricing.input) / 1e6
     + output         * pricing.output / 1e6
     + reasoning      * (pricing.reasoning || pricing.output) / 1e6

requestDetail.js now passes the canonical object through instead of stripping. extractUsageFromResponse also surfaces cachedContentTokenCount for Gemini (was dropped).

Provider coverage

Provider Cache extracted Pricing has cached / cache_creation Cost accurate
Anthropic (official) cache_read_input_tokens, cache_creation_input_tokens yes yes
Anthropic Compatible same as above (extractor is format-based, not provider-based) yes (resolves via MODEL_PRICING/PATTERN_PRICING) yes
OpenAI Chat (official) prompt_tokens_details.cached_tokens yes yes
OpenAI Compatible same yes yes
OpenAI Responses input_tokens_details.cached_tokens yes yes
Gemini / Antigravity cachedContentTokenCount yes yes
DeepSeek prompt_cache_hit_tokens yes yes
Kiro (Amazon Q) upstream does not expose cache fields today n/a
Ollama / CommandCode no cache concept n/a

Kiro executor + USAGE_EXTRACTORS.kiro are now defensive: if the upstream event shape grows cache_read_input_tokens / cache_creation_input_tokens / cachedTokens, they are picked up automatically — no second pass needed when Amazon Q starts emitting cache.

84/84 entries in MODEL_PRICING already carry the cached and cache_creation fields, so the pricing UI/API needed no changes.

UI

  • Overview card: standalone Cached Tokens card (was a subline under Input).
  • Tokens table: new Cached column between Input and Output.
  • Cost table: new Cached Cost column between Input and Output. Cost split is a token-share allocation of the server totalCost (rate-accurate) — cachedCost = cachedTokens / totalTokens × totalCost. If exact per-rate cached cost display is needed, the storage layer can be extended to return per-component cost.
  • Details tab: new Cached + Cache Creation columns in the list, matching the drawer fields.
image image

Files

Backend:

  • open-sse/utils/usageTracking.jscanonicalizeUsage()
  • open-sse/providers/pricing.jscalculateCostFromTokens no double-count
  • open-sse/handlers/chatCore/requestDetail.js — canonicalize instead of strip; Gemini cachedContentTokenCount
  • open-sse/translator/concerns/usage.js — Kiro defensive pass-through
  • open-sse/executors/kiro.js — Kiro metricsEvent defensive cache pickup
  • src/lib/db/repos/usageRepo.jscalculateCost no double-count; aggregate cachedTokens through daily + 24h stats (totalCachedTokens, per-group cachedTokens)

UI:

  • src/app/(dashboard)/dashboard/usage/components/OverviewCards.js
  • src/app/(dashboard)/dashboard/usage/components/UsageTable.js
  • src/app/(dashboard)/dashboard/usage/components/RequestDetailsTab.js
  • src/shared/components/UsageStats.js

Tests:

  • tests/unit/cached-token-usage.test.js (new) — canonicalization + cost + Kiro forward-compat
  • tests/unit/cached-token-e2e.test.js (new) — end-to-end saveRequestUsage → getUsageStats asserts persisted cached_tokens, aggregated totalCachedTokens, correct cost

Verification

  • Unit + e2e tests green (13 new, no new regressions; baseline 68 fail → 59 fail with this work).
  • Production build passes.
  • Live /api/usage/stats now serves totalCachedTokens (top-level) and cachedTokens per byProvider / byModel group.
  • Seeded a Claude + OpenAI cache row in dev DB and confirmed the overview card shows Cached Tokens = 900, the Cost table shows the new column, and the Details tab shows Cached = 600 / Cache Creation = 50.

Historical rows written before this change have cached_tokens = null in usageHistory.tokens (the old strip path). They will continue to show 0 for cached. A one-shot backfill from requestDetails.providerResponse is possible but out of scope for this PR — easy to add as a follow-up script if needed.

hodtien added 3 commits June 29, 2026 16:21
OpenAI/Anthropic Compatible nodes were hard-limited to one connection.
Remove the guard so they hold a key pool; runtime getProviderCredentials
already rotates/fails over across connections. Embedding nodes unchanged.
9router dropped cache tokens before persisting usage, so token counts and
cost were wrong for cache-using providers. Two root causes fixed:

- saveUsageStats stripped tokens to {prompt,completion} before the DB write,
  so calculateCost never saw cached/cache_creation. Now canonicalizeUsage
  threads them through.
- Mixed conventions: Claude reports prompt EXCLUDING cache; OpenAI/Gemini
  INCLUDING it. canonicalizeUsage folds everything to one cache-inclusive
  convention, and the cost formula subtracts both cached + cache_creation
  from full-rate input (was double-counting cache_creation).

Cached tokens now persist, aggregate into daily/24h stats (totalCachedTokens
+ per-group cachedTokens), and surface in the dashboard (overview card,
usage table column, request detail drawer).
- Cached Tokens as a standalone overview card (was a subline under Input).
- "Cached" column in the Tokens table; "Cached Cost" column in the Cost
  table (peeled out from input share via token-share allocation).
- "Cached" + "Cache Creation" columns in the Details tab list (matches
  drawer fields).
- Kiro executor + USAGE_EXTRACTORS.kiro: pass through cache_read_input_tokens /
  cache_creation_input_tokens / cachedTokens if the upstream event shape
  grows them. Amazon Q does not expose cache today; this keeps cost tracking
  working without a second pass.
Anthropic streaming splits usage across events: message_start carries
input_tokens + cache_read_input_tokens + cache_creation_input_tokens,
while message_delta carries only the final output_tokens. Three paths
read cache solely from message_delta, so cache (and real input) were
dropped — recorded as 0.

- extractUsage: add message_start branch + mergeUsage() field-wise
  max-merge so start/delta combine instead of clobbering
- stream.js: merge usage across events (passthrough + translate)
- claude-to-openai translator: capture message_start usage; delta
  falls back to it when cache fields are absent
- bypassHandler: merge message_start cache with message_delta output
@hodtien

hodtien commented Jun 29, 2026

Copy link
Copy Markdown
Author

Update: fix cache token cho Anthropic streaming (commit b988021)

Vấn đề phát hiện thêm

Anthropic streaming tách usage ra 2 event:

  • message_startusage: { input_tokens, cache_read_input_tokens, cache_creation_input_tokens }
  • message_deltausage: { output_tokens } (chỉ có output, không có input/cache)

3 đường code chỉ đọc cache từ message_delta — nơi Anthropic không bao giờ gửi cache — nên cache token và cả input_tokens bị mất (ghi nhận = 0).

Sửa

  • extractUsage: thêm nhánh message_start + helper mergeUsage() (max-merge từng field) để start/delta hợp nhất thay vì ghi đè.
  • stream.js: merge usage qua các event (cả passthrough + translate path).
  • claude-to-openai translator: bắt usage từ message_start; message_delta fallback về cache đã bắt được khi delta thiếu field.
  • bypassHandler: merge cache của message_start với output của message_delta.

Kiểm chứng

  • Test live qua provider xxxxx (anthropic-compatible): prompt_tokens ghi đúng 2413 — trước fix sẽ là 0 (vì input chỉ nằm ở message_start).
  • Unit test stream Anthropic có cache: prompt 50 + cache_read 2400 + cache_creation 120 → canonical 2570, cached_tokens 2400 ghi đúng.
  • Full suite: 0 regression mới so với baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant