feat(usage): track cached tokens + correct input/output/cache cost#2209
Open
hodtien wants to merge 4 commits into
Open
feat(usage): track cached tokens + correct input/output/cache cost#2209hodtien wants to merge 4 commits into
hodtien wants to merge 4 commits into
Conversation
added 3 commits
June 29, 2026 16:21
OpenAI/Anthropic Compatible nodes were hard-limited to one connection. Remove the guard so they hold a key pool; runtime getProviderCredentials already rotates/fails over across connections. Embedding nodes unchanged.
9router dropped cache tokens before persisting usage, so token counts and
cost were wrong for cache-using providers. Two root causes fixed:
- saveUsageStats stripped tokens to {prompt,completion} before the DB write,
so calculateCost never saw cached/cache_creation. Now canonicalizeUsage
threads them through.
- Mixed conventions: Claude reports prompt EXCLUDING cache; OpenAI/Gemini
INCLUDING it. canonicalizeUsage folds everything to one cache-inclusive
convention, and the cost formula subtracts both cached + cache_creation
from full-rate input (was double-counting cache_creation).
Cached tokens now persist, aggregate into daily/24h stats (totalCachedTokens
+ per-group cachedTokens), and surface in the dashboard (overview card,
usage table column, request detail drawer).
- Cached Tokens as a standalone overview card (was a subline under Input). - "Cached" column in the Tokens table; "Cached Cost" column in the Cost table (peeled out from input share via token-share allocation). - "Cached" + "Cache Creation" columns in the Details tab list (matches drawer fields). - Kiro executor + USAGE_EXTRACTORS.kiro: pass through cache_read_input_tokens / cache_creation_input_tokens / cachedTokens if the upstream event shape grows them. Amazon Q does not expose cache today; this keeps cost tracking working without a second pass.
Anthropic streaming splits usage across events: message_start carries input_tokens + cache_read_input_tokens + cache_creation_input_tokens, while message_delta carries only the final output_tokens. Three paths read cache solely from message_delta, so cache (and real input) were dropped — recorded as 0. - extractUsage: add message_start branch + mergeUsage() field-wise max-merge so start/delta combine instead of clobbering - stream.js: merge usage across events (passthrough + translate) - claude-to-openai translator: capture message_start usage; delta falls back to it when cache fields are absent - bypassHandler: merge message_start cache with message_delta output
Author
Update: fix cache token cho Anthropic streaming (commit b988021)Vấn đề phát hiện thêmAnthropic streaming tách usage ra 2 event:
3 đường code chỉ đọc cache từ Sửa
Kiểm chứng
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The dashboard mis-reported both token counts and cost whenever an upstream returned cached tokens. Two distinct root causes:
Cache fields stripped before persist.
saveUsageStatsinopen-sse/handlers/chatCore/requestDetail.jscollapsedtokensto{prompt_tokens, completion_tokens}before writing tousageHistory.calculateCostalready knew how to price cached + cache_creation, but it never received those fields — so cached tokens were billed at the full input rate and the displayed input count was wrong.Mixed token conventions across providers. Claude reports
prompt_tokensexcludingcache_read_input_tokens/cache_creation_input_tokens. OpenAI Chat, OpenAI Responses, and Gemini reportprompt_tokensincludingcached_tokens. The cost formula assumed inclusive (nonCachedInput = prompt − cached) so:cache_creation_input_tokenswas double-counted under the inclusive convention (promptalready contains it, then the cost formula added it again at its own rate).Fix — one canonical convention
canonicalizeUsage()inopen-sse/utils/usageTracking.jsnormalizes every provider to one shape before persist:Discriminator: Claude path emits
cache_read_input_tokens(prompt excludes cache) → fold into prompt. OpenAI/Gemini emitcached_tokens(prompt already inclusive) → pass through. Idempotent.Cost formula (
open-sse/providers/pricing.js+src/lib/db/repos/usageRepo.js) now subtracts both cached and cache_creation from full-rate input:requestDetail.jsnow passes the canonical object through instead of stripping.extractUsageFromResponsealso surfacescachedContentTokenCountfor Gemini (was dropped).Provider coverage
cached/cache_creationcache_read_input_tokens,cache_creation_input_tokensMODEL_PRICING/PATTERN_PRICING)prompt_tokens_details.cached_tokensinput_tokens_details.cached_tokenscachedContentTokenCountprompt_cache_hit_tokensKiro executor +
USAGE_EXTRACTORS.kiroare now defensive: if the upstream event shape growscache_read_input_tokens/cache_creation_input_tokens/cachedTokens, they are picked up automatically — no second pass needed when Amazon Q starts emitting cache.84/84 entries in
MODEL_PRICINGalready carry thecachedandcache_creationfields, so the pricing UI/API needed no changes.UI
Cached Tokenscard (was a subline under Input).Cachedcolumn between Input and Output.Cached Costcolumn between Input and Output. Cost split is a token-share allocation of the servertotalCost(rate-accurate) —cachedCost = cachedTokens / totalTokens × totalCost. If exact per-rate cached cost display is needed, the storage layer can be extended to return per-component cost.Cached+Cache Creationcolumns in the list, matching the drawer fields.Files
Backend:
open-sse/utils/usageTracking.js—canonicalizeUsage()open-sse/providers/pricing.js—calculateCostFromTokensno double-countopen-sse/handlers/chatCore/requestDetail.js— canonicalize instead of strip; GeminicachedContentTokenCountopen-sse/translator/concerns/usage.js— Kiro defensive pass-throughopen-sse/executors/kiro.js— Kiro metricsEvent defensive cache pickupsrc/lib/db/repos/usageRepo.js—calculateCostno double-count; aggregatecachedTokensthrough daily + 24h stats (totalCachedTokens, per-groupcachedTokens)UI:
src/app/(dashboard)/dashboard/usage/components/OverviewCards.jssrc/app/(dashboard)/dashboard/usage/components/UsageTable.jssrc/app/(dashboard)/dashboard/usage/components/RequestDetailsTab.jssrc/shared/components/UsageStats.jsTests:
tests/unit/cached-token-usage.test.js(new) — canonicalization + cost + Kiro forward-compattests/unit/cached-token-e2e.test.js(new) — end-to-end saveRequestUsage → getUsageStats asserts persistedcached_tokens, aggregatedtotalCachedTokens, correct costVerification
/api/usage/statsnow servestotalCachedTokens(top-level) andcachedTokensperbyProvider/byModelgroup.Cached Tokens = 900, the Cost table shows the new column, and the Details tab showsCached = 600 / Cache Creation = 50.Historical rows written before this change have
cached_tokens = nullinusageHistory.tokens(the old strip path). They will continue to show 0 for cached. A one-shot backfill fromrequestDetails.providerResponseis possible but out of scope for this PR — easy to add as a follow-up script if needed.