[bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits#328
Open
raoabinav wants to merge 1 commit into
Open
[bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits#328raoabinav wants to merge 1 commit into
raoabinav wants to merge 1 commit into
Conversation
Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5- backed implementation behind the same contract. Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search) relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) -> list[SearchResult]`. Extracting those two methods into an ABC makes the follow-up FTS5 implementation drop-in. See yichuan-w#327 for the broader plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
raoabinav
added a commit
to raoabinav/LEANN
that referenced
this pull request
May 20, 2026
Sub-PR 3 of 5 from yichuan-w#327. Builds on yichuan-w#328 and yichuan-w#332. New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better. Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes. Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4. Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sub-PR 1 of 5 from #327.
Pure refactor, no behavior change. Extracts the two methods
LeannSearcheractually uses fromBM25Scorer(fit(documents),search(query, top_k) -> list[SearchResult]) into aBM25IndexABC.BM25Scorerinherits, so all existing call sites work unchanged.Sets up the next sub-PRs to drop in an FTS5-backed implementation behind the same contract without touching the consumer.
BM25 persistence train (#327)
BM25IndexABC extractionBM25Scorer(breaking)