[bm25 5/5] refactor: drop in-memory BM25Scorer; FTS5 is the only implementation#335
Open
raoabinav wants to merge 5 commits into
Open
[bm25 5/5] refactor: drop in-memory BM25Scorer; FTS5 is the only implementation#335raoabinav wants to merge 5 commits into
raoabinav wants to merge 5 commits into
Conversation
Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5- backed implementation behind the same contract. Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search) relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) -> list[SearchResult]`. Extracting those two methods into an ABC makes the follow-up FTS5 implementation drop-in. See yichuan-w#327 for the broader plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 2 of 5 from yichuan-w#327. Builds on yichuan-w#328 (BM25Index ABC). Adds `LeannBuilder(prebuild_bm25=True)`. When set, build_index fits a BM25Scorer on the chunks and pickles it to <index>.bm25.pkl, then records the snapshot filename in meta.json under "bm25_snapshot". LeannSearcher._init_bm25 now checks for that snapshot first: if present and loads cleanly, it skips fitting; otherwise it falls back to today's behavior (scan passages.jsonl and fit). Older indexes are unaffected — no snapshot field in their meta.json, so the fit-on-search path runs. Default stays False so this PR changes nothing for existing callers. Default flip happens in sub-PR 4 once the FTS5 backend (sub-PR 3) lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 3 of 5 from yichuan-w#327. Builds on yichuan-w#328 and yichuan-w#332. New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better. Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes. Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4. Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 4 of 5 from yichuan-w#327. Builds on the FTS5 backend in sub-PR 3. Flips LeannBuilder(bm25_backend=...) default from "memory" to "fts5". New indexes built without specifying a backend now write a sqlite FTS5 database instead of in-memory state. Existing indexes are unaffected — their meta.json either lacks bm25_backend (older, fit-on-search path) or has it set explicitly. In-memory `BM25Scorer` stays as `bm25_backend="memory"` for tiny corpora where the FTS5 sqlite overhead isn't worth it. Sub-PR 5 removes it after this default has soaked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ce1ed22 to
329c903
Compare
…tion Sub-PR 5 of 5 from yichuan-w#327. Builds on sub-PR 4 (FTS5 default). Deletes BM25Scorer entirely along with the build-time pickle snapshot path (_build_bm25_snapshot, the "bm25_snapshot" meta.json field, the prebuild_bm25 kwarg) and the fit-on-search fallback inside _init_bm25. Drops the now-orphan `from collections import Counter, defaultdict`. `bm25_backend` is now fts5-only; passing "memory" raises ValueError. Indexes built before this had a BM25 artifact will get a clear RuntimeError on first hybrid/BM25 call pointing the user at `leann rebuild`. Net: -74 LOC of dead BM25 implementation, one canonical path. Breaking change for indexes without an FTS5 artifact. Documented in the error message; affects only indexes built pre-sub-PR-3 (or with bm25_backend="memory" during the soak window between sub-PRs 2-4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
329c903 to
0100431
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sub-PR 5 of 5 from #327, the final step. Stacks on the default flip in the prior sub-PR.
Breaking change for indexes without an FTS5 artifact. Documented in the runtime error message; affects only indexes built before the FTS5 backend (or with
bm25_backend="memory"during the soak window). Users will see a clearRuntimeErroron first hybrid/BM25 call pointing them atleann rebuild.Deletes
BM25Scorerentirely along with the build-time pickle snapshot path (_build_bm25_snapshot, thebm25_snapshotmeta.json field, theprebuild_bm25kwarg) and the fit-on-search fallback inside_init_bm25. Drops the now-orphanfrom collections import Counter, defaultdict.bm25_backendis now fts5-only; passing"memory"raisesValueError.Net: -74 LOC of dead BM25 implementation, one canonical path.
BM25 persistence train (#327)
BM25IndexABC extractionBM25Scorer(breaking)