Skip to content

perf: add 3-layer early date filtering to skip irrelevant files and entries#869

Open
pbuchman wants to merge 3 commits intoryoppippi:mainfrom
pbuchman:perf/early-date-filtering
Open

perf: add 3-layer early date filtering to skip irrelevant files and entries#869
pbuchman wants to merge 3 commits intoryoppippi:mainfrom
pbuchman:perf/early-date-filtering

Conversation

@pbuchman
Copy link

@pbuchman pbuchman commented Feb 27, 2026

Summary

  • Add file-level mtime pre-filter to skip JSONL files that can't contain entries for the requested date range
  • Add line-level substring pre-check (USAGE_LINE_MARKER) to skip JSON.parse on non-usage lines
  • Add entry-level early date skip to avoid cost calculation for out-of-range entries
  • Apply optimizations to loadDailyUsageData, loadSessionData, and loadSessionBlockData

Problem

When --since / --until are provided, ccusage currently:

  1. Globs all JSONL files
  2. Opens every file to extract its earliest timestamp (for sorting)
  3. Reads and parses every line of every file
  4. Calculates cost for every entry
  5. Groups and aggregates all entries
  6. Then filters by date range — discarding most of the work

This means a single-day query takes the same time as a full scan.

Benchmarks

Machine: Apple M2 Pro, 32 GB RAM, macOS 15.6
Dataset: 4,637 JSONL files, 572 MB total (largest file: 469 MB)

Query Before After Improvement
Today only (--since 20260227 --until 20260227) 36.7s 0.76s 98%
Last 7 days (--since 20260220) 36.7s 3.0s 92%
Full scan (no filters) 36.7s 31.6s 14%

The mtime filter provides the biggest wins for short date ranges (the most common use case for integrations like OpenUsage). The substring pre-check also improves full-scan performance by ~14% by skipping JSON.parse on non-usage lines.

How it works

1. File-level mtime pre-filter (filterFilesByMtime)

Before reading any file content, checks each file's mtime against the --since date (with 1-day buffer for timezone safety). Files whose mtime is older than since - 1 day are skipped entirely.

2. Line-level substring pre-check (USAGE_LINE_MARKER)

Before calling JSON.parse on each line, checks whether the line contains the "costUSD" substring (the USAGE_LINE_MARKER). In typical JSONL files, ~91.5% of lines are non-usage entries (conversation turns, system messages, etc.) that don't contain cost data. This skips expensive JSON parsing for those lines entirely.

3. Entry-level early date skip

Inside the per-line processing loop, after formatDate but before the expensive calculateCostForEntry, compares the entry's date against since/until and skips out-of-range entries. Uses the same comparison logic as the existing filterByDateRange in _date-utils.ts.

Backwards compatibility

This is a backwards-compatible change:

  • mtime filtering uses a 1-day buffer to account for timezone edge cases
  • Substring pre-check only skips lines that can't possibly be usage entries
  • Entry-level skip uses identical comparison logic to the existing filterByDateRange
  • No behavioral change when --since/--until are not provided
  • Files that fail stat() are kept (not filtered), ensuring no data loss on permission issues
  • All existing tests continue to pass

Context

Discovered while debugging slow load times in OpenUsage, a menu bar app that uses ccusage to display Claude Code token usage. With 15-second timeouts per runner, ccusage always times out on machines with heavy usage, causing 60+ second delays. See: robinebers/openusage#249

Testing

  • Added unit tests for parseSinceDate (YYYYMMDD parsing, edge cases)
  • Added unit tests for filterFilesByMtime (mtime edge cases, stat errors, buffer behavior)
  • Added tests for USAGE_LINE_MARKER substring pre-check (true/false positives, edge cases)
  • Added tests for entry-level early skip (correctness parity with existing post-filter)
  • Added integration tests with multi-date fixtures and mtime simulation via utimes()
  • All 269 tests pass: pnpm run format && pnpm typecheck && pnpm run test

Summary by CodeRabbit

  • New Features

    • Date-range (since/until) support added to data loading options.
  • Improvements

    • Pre-filters files by modification time and applies date-range skipping earlier to reduce work and speed processing.
    • Skips irrelevant lines before parsing and ensures deduplication and cost calculations respect date filters and timezones.
  • Tests

    • Expanded coverage for date parsing, mtime filtering, line-level skipping, and end-to-end date/time behavior.

@coderabbitai
Copy link

coderabbitai bot commented Feb 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds date-range parsing and mtime-based pre-filtering to the ccusage data loader: new helpers (parseSinceDate, filterFilesByMtime, isOutsideDateRange, USAGE_LINE_MARKER), a DateFilter type merged into exported LoadOptions, and entry-level skipping in daily/session/block loaders; tests and package/manifest touched.

Changes

Cohort / File(s) Summary
Data loader
apps/ccusage/src/data-loader.ts
Added parseSinceDate, filterFilesByMtime, isOutsideDateRange, and USAGE_LINE_MARKER; introduced DateFilter and extended exported LoadOptions with since/until; applied mtime pre-filtering before sorting and per-entry date-range skipping across loadDailyUsageData, loadSessionData, and loadSessionBlockData; updated deduplication/cost paths to respect date filters.
Tests
apps/ccusage/...tests*
Added unit tests for parseSinceDate, filterFilesByMtime, isOutsideDateRange, and USAGE_LINE_MARKER checks; adjusted end-to-end tests for mtime pre-filtering and timezone-aware date-range behavior across daily/session/block flows.
Manifests / Metadata
manifest_file, package.json
Updated project manifest/package metadata (lines changed include dependency or metadata adjustments associated with the changes).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • ryoppippi

Poem

🐰 I hop through logs at morning light,
Sniffing mtimes, skipping those not right.
Since and until guide my tidy quest,
Fresh lines I keep, the stale ones rest.
Carrots, dates, and neat data — what a sight! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a 3-layer date filtering mechanism to optimize performance by skipping irrelevant files and entries.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
apps/ccusage/src/data-loader.ts (1)

853-860: Consolidate duplicated entry-level date skip logic

The same dateCompact + since/until checks are repeated in three loaders. Extracting one helper reduces drift risk and keeps behavior aligned with future date-range changes.

♻️ Suggested consolidation
+function isOutsideDateRange(date: string, since?: string, until?: string): boolean {
+	const dateCompact = date.substring(0, 10).replace(/-/g, '');
+	if (since != null && since !== '' && dateCompact < since) {
+		return true;
+	}
+	if (until != null && until !== '' && dateCompact > until) {
+		return true;
+	}
+	return false;
+}
-const dateCompact = entryDate.substring(0, 10).replace(/-/g, '');
-if (options?.since != null && dateCompact < options.since) {
-	return;
-}
-if (options?.until != null && dateCompact > options.until) {
+if (isOutsideDateRange(entryDate, options?.since, options?.until)) {
 	return;
 }

Also applies to: 1031-1039, 1470-1478

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 853 - 860, Extract the
duplicated date-range skip logic into a single helper (e.g., isOutsideDateRange
or shouldSkipByDate) that accepts the raw entry date string and the loader
options and returns a boolean indicating whether to skip; inside the helper
compute dateCompact = date.substring(0,10).replace(/-/g,'') and check
options?.since and options?.until with proper null/undefined guards, then
replace the three inline blocks (the occurrences that compute dateCompact and
compare to options.since/options.until) by calling this helper and returning
when it indicates the entry is out of range so all loaders (the three places
shown) share the same behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 733-740: The current check in filterFilesByMtime (and the similar
block at the later occurrence) only tests for null/undefined (since == null) so
an empty string ("") is treated as a valid cutoff and leads to invalid
parse/comparisons; update the guard to treat empty or whitespace-only since
values as "no filter" (e.g. check if !since or since.trim() === "" before
returning files) and only call parseSinceDate when since is non-empty, ensuring
mtime filtering runs correctly.
- Around line 722-723: The functions parseSinceDate and filterFilesByMtime are
internal-only and should not be part of the module API; remove the export
keyword from their declarations (i.e., change "export function parseSinceDate"
and "export function filterFilesByMtime" to non-exported function declarations)
so they remain usable inside data-loader.ts and by in-file vitest blocks but are
not exported to other modules; verify there are no external imports relying on
these symbols and run type-check/tests.
- Line 16: Tests in apps/ccusage/src/data-loader.ts use forbidden dynamic `await
import()` inside describe callbacks; move those modules to top-level imports
(add `import { utimes } from 'node:fs/promises' to the existing import on line
16 and add `import { createFixture } from 'fs-fixture'` at top), then remove
`async` from the affected describe callbacks and replace each `await import()`
with the corresponding top-level imported symbol usages for the tests named
`filterFilesByMtime`, `mtime + loadDailyUsageData integration`,
`loadSessionUsageById`, `loadSessionData early date filtering`, and
`loadSessionBlockData early date filtering` so no dynamic imports remain.

---

Nitpick comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 853-860: Extract the duplicated date-range skip logic into a
single helper (e.g., isOutsideDateRange or shouldSkipByDate) that accepts the
raw entry date string and the loader options and returns a boolean indicating
whether to skip; inside the helper compute dateCompact =
date.substring(0,10).replace(/-/g,'') and check options?.since and
options?.until with proper null/undefined guards, then replace the three inline
blocks (the occurrences that compute dateCompact and compare to
options.since/options.until) by calling this helper and returning when it
indicates the entry is out of range so all loaders (the three places shown)
share the same behavior.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c40ea6e and 80d9116.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
apps/ccusage/src/data-loader.ts (1)

1599-1601: ⚠️ Potential issue | 🟠 Major

await import() still exists in test blocks and violates repo policy.

There are remaining dynamic imports in describe blocks. This is already a previously reported issue and appears unresolved.

#!/bin/bash
# Verify no dynamic imports remain in TypeScript files (expected: no matches)
rg -nP 'await\s+import\(' --type=ts

As per coding guidelines: apps/ccusage/**/*.ts: "NEVER use await import() dynamic imports anywhere (especially in tests)".

Also applies to: 4768-4774

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 1599 - 1601, The test contains
a dynamic "await import('fs-fixture')" inside the describe block
(describe('loadSessionUsageById')) which violates the no-dynamic-import policy;
fix it by replacing the dynamic import with a static top-level import (e.g., add
"import { createFixture } from 'fs-fixture';" at the top of the file) and remove
the "await import" usage inside the describe block so the test uses the
statically imported createFixture symbol instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 1037-1040: The early per-entry skip using formatDate(...) and
isOutsideDateRange(...) (entryDate) changes session-level filtering semantics
because sessions are later filtered by aggregated lastActivity; instead preserve
previous behavior by removing or deferring the entry-level date skip in the
data-loading path that builds sessions (so mixed-range sessions still include
all entries for correct lastActivity calculation), and ensure session pruning
remains based on the aggregated lastActivity check (the code around
entryDate/isOutsideDateRange should not drop entries used by session
aggregation; adjust the filtering logic where sessions are assembled and keep
the existing lastActivity-based session filter).
- Around line 1471-1474: Remove the per-entry early skip (the formatDate +
isOutsideDateRange check) so identifySessionBlocks receives the full
chronological stream; instead apply the existing date-range filtering to the
constructed session blocks by checking each block.startTime (reuse the current
block-level date checks around block.startTime) so blocks are preserved and not
split/shifted by entry-level drops. Ensure identifySessionBlocks and any callers
still use the original entries array and that the only date filtering happens
after blocks are built using block.startTime.

---

Duplicate comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 1599-1601: The test contains a dynamic "await
import('fs-fixture')" inside the describe block
(describe('loadSessionUsageById')) which violates the no-dynamic-import policy;
fix it by replacing the dynamic import with a static top-level import (e.g., add
"import { createFixture } from 'fs-fixture';" at the top of the file) and remove
the "await import" usage inside the describe block so the test uses the
statically imported createFixture symbol instead.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 970eeeb and f2f7af3.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pbuchman pbuchman force-pushed the perf/early-date-filtering branch from f2f7af3 to a837500 Compare February 27, 2026 16:47
@pbuchman pbuchman changed the title perf(ccusage): add early date filtering to skip irrelevant files and entries perf: add 3-layer early date filtering to skip irrelevant files and entries Feb 27, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
apps/ccusage/src/data-loader.ts (2)

1479-1482: ⚠️ Potential issue | 🟠 Major

Entry-level date skip can reshape session block boundaries

On Line 1479–Line 1482, removing entries before identifySessionBlocks(...) can split/shift blocks near boundaries, then block-level filtering operates on already-altered blocks.

Suggested fix
-				const date = formatDate(data.timestamp, options?.timezone, DEFAULT_LOCALE);
-				if (isOutsideDateRange(date, options?.since, options?.until)) {
-					return;
-				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 1479 - 1482, Currently
individual entries are being filtered by date before calling
identifySessionBlocks (the formatDate + isOutsideDateRange check), which can
split or shift session blocks near the boundaries; instead, stop dropping
entries before identifySessionBlocks: pass the full set of entries into
identifySessionBlocks (reference identifySessionBlocks and the variables
formatDate, isOutsideDateRange, options?.since, options?.until, DEFAULT_LOCALE),
then after session blocks are constructed apply date-based filtering at the
block level (e.g., drop or trim session blocks whose timestamps fall outside
options?.since/options?.until) so block boundaries remain correct.

1044-1047: ⚠️ Potential issue | 🟠 Major

Entry-level date skip changes session aggregation semantics

On Line 1044–Line 1047, filtering entries before session aggregation can alter totals for sessions whose lastActivity is in range but contain earlier/later entries outside range. This is a behavior change versus filtering sessions by aggregated lastActivity.

Suggested fix
-				const entryDate = formatDate(data.timestamp, options?.timezone, DEFAULT_LOCALE);
-				if (isOutsideDateRange(entryDate, options?.since, options?.until)) {
-					return;
-				}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/ccusage/src/data-loader.ts` around lines 1044 - 1047, The current
early-return using formatDate(data.timestamp) and isOutsideDateRange(entryDate,
options?.since, options?.until) skips individual entries and changes session
totals; instead stop filtering at the entry level—remove or defer this check in
the entries processing loop and let all entries contribute to session
aggregation, then after sessions are aggregated filter sessions by their
aggregated lastActivity (e.g., call formatDate(session.lastActivity,
options?.timezone, DEFAULT_LOCALE) and isOutsideDateRange(...) against
options?.since/options?.until) so totals remain correct while still excluding
out-of-range sessions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 1479-1482: Currently individual entries are being filtered by date
before calling identifySessionBlocks (the formatDate + isOutsideDateRange
check), which can split or shift session blocks near the boundaries; instead,
stop dropping entries before identifySessionBlocks: pass the full set of entries
into identifySessionBlocks (reference identifySessionBlocks and the variables
formatDate, isOutsideDateRange, options?.since, options?.until, DEFAULT_LOCALE),
then after session blocks are constructed apply date-based filtering at the
block level (e.g., drop or trim session blocks whose timestamps fall outside
options?.since/options?.until) so block boundaries remain correct.
- Around line 1044-1047: The current early-return using
formatDate(data.timestamp) and isOutsideDateRange(entryDate, options?.since,
options?.until) skips individual entries and changes session totals; instead
stop filtering at the entry level—remove or defer this check in the entries
processing loop and let all entries contribute to session aggregation, then
after sessions are aggregated filter sessions by their aggregated lastActivity
(e.g., call formatDate(session.lastActivity, options?.timezone, DEFAULT_LOCALE)
and isOutsideDateRange(...) against options?.since/options?.until) so totals
remain correct while still excluding out-of-range sessions.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2f7af3 and a837500.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

Remove await import() dynamic imports from test blocks (repo policy).
Remove entry-level date skip from session/block loaders to preserve
aggregation semantics — sessions and blocks are filtered post-aggregation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/ccusage/src/data-loader.ts`:
- Around line 726-733: The current guard in filterFilesByMtime accepts any
8-digit string, so validate that the parsed date is a real calendar date before
enabling mtime pre-filter: call parseSinceDate(since) into sinceDate, then
verify sinceDate is valid (e.g., !isNaN(sinceDate.getTime())) and that its
year/month/day match the numeric components parsed from the original since
string (so 20230231 is rejected). If validation fails, return files without
applying the cutoff; only subtract one day and compute sinceMs when the parsed
date is confirmed valid.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a837500 and 33eac4f.

📒 Files selected for processing (1)
  • apps/ccusage/src/data-loader.ts

Reject impossible dates like 20230231 that pass the 8-digit regex
but would produce an unintended cutoff via Date rollover.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robinebers
Copy link

That is an excellent addition. Thank you for both, letting us know in OpenUsage and submitting the PR here.

I hope this can be merged soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants