feat(store): Tier-3 content-addressed OHLCV chunks (epic #540 phase 3c)#546
Merged
MDUYN merged 1 commit intoMay 12, 2026
Merged
Conversation
…e (epic #540 phase 3c) Wires LocalTieredStore into the existing OHLCV side-store machinery so identical (symbol, timeframe) Parquet bytes are written exactly once and shared across every bundle that references them. - write() now routes save_bundle's OHLCV writes to <root>/ohlcv/ whenever backtest.ohlcv is non-empty. The bundle envelope keeps its content-addressed manifest unchanged, so old bundles remain readable. - open() forwards the same shared directory to open_bundle so OHLCV lookups resolve regardless of what path the bundle was originally written with. - delete() intentionally does NOT touch ohlcv/. Chunks are globally shared; orphans are reclaimed via garbage_collect_ohlcv(dry_run=…). - Introspection helpers required by the dedup-upload protocol (docs/design/ohlcv-dedup-protocol.md): * iter_ohlcv_hashes() / ohlcv_referenced_hashes() * ohlcv_stored_hashes() * ohlcv_stats() -> stored_blobs / stored_bytes / referenced_blobs / orphan_blobs / missing_blobs * garbage_collect_ohlcv(dry_run=False) Manifests are decoded straight from the bundle envelope (_decode_payload) so the cost is one msgpack read per bundle — no full Backtest instantiation. 9 new tests: - No OHLCV -> no chunk dir created. - Identical OHLCV is stored once across distinct handles (dedup). - Different OHLCV yields separate chunks. - Round-trip via store.open() resolves OHLCV from the shared dir. - delete() keeps still-referenced chunks; orphans only after GC. - garbage_collect_ohlcv(dry_run=True) lists without deleting; the real call removes them. - iter_ohlcv_hashes() emits per-reference; ohlcv_referenced_hashes() dedups. - Hash strings are 64-char lowercase hex (matches the upload protocol spec). Targeted suite (backtest_store + backtest_index + cli): 110 / 110 passing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 3c — Tier-3 content-addressed OHLCV chunks
Stacked on #545 (Phase 3b). Part of epic #540.
This slice unlocks the headline storage win for users who run many
backtests over a shared universe: identical OHLCV is stored exactly
once across the entire store, no matter how many bundles reference it.
What's in the box
LocalTieredStore.write()now routessave_bundle's OHLCV writes to<root>/ohlcv/whenever the backtest carries price data. The bundleenvelope still uses the existing content-addressed manifest
(
{key -> "<sha256>.parquet"}), so old bundles remain readable andnew ones simply share the storage location.
store.open()forwards the same shared directory toopen_bundle, soOHLCV lookups resolve regardless of where the bundle was originally
written.
store.delete()deliberately leaves theohlcv/directory alone —chunks are globally shared. Orphans are reclaimed on demand.
New introspection / GC API
iter_ohlcv_hashes()(handle, key)reference (with duplicates)ohlcv_referenced_hashes() -> setohlcv_stored_hashes() -> setohlcv/ohlcv_stats() -> dict{stored_blobs, stored_bytes, referenced_blobs, orphan_blobs, missing_blobs}garbage_collect_ohlcv(*, dry_run=False) -> list[str]dry_run=Truereturns the audit list without deletingManifests are decoded straight from the bundle envelope via
_decode_payload— no fullBacktestinstantiation, so introspectionacross thousands of bundles stays cheap.
This is exactly the set-difference primitive that
docs/design/ohlcv-dedup-protocol.mdneeds for its
POST /ohlcv/negotiatestep.Tests
9 new tests in
tests/services/backtest_store/test_local_tiered_store_ohlcv.py:store.open()resolves OHLCV from the shared dir.delete()keeps still-referenced chunks; orphans only appear when the last reference is gone.garbage_collect_ohlcv(dry_run=True)lists without deleting; the real call removes them.iter_ohlcv_hashes()emits per-reference;ohlcv_referenced_hashes()dedups.Targeted suite (
tests/services/backtest_store/+tests/services/backtest_index/+tests/cli/): 110 / 110 passing.What's left in Phase 3
iaf migrate-store --from local-dir --to local-tiered; byte-identical Tier-2 →Backtestreassembly (.iafbtbecomes export-only); parameterised pytest fixture running every backtest test against both stores