Skip to content

[bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits#328

Open
raoabinav wants to merge 1 commit into
yichuan-w:mainfrom
raoabinav:refactor/bm25-index-interface
Open

[bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits#328
raoabinav wants to merge 1 commit into
yichuan-w:mainfrom
raoabinav:refactor/bm25-index-interface

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

@raoabinav raoabinav commented May 20, 2026

Sub-PR 1 of 5 from #327.

Pure refactor, no behavior change. Extracts the two methods LeannSearcher actually uses from BM25Scorer (fit(documents), search(query, top_k) -> list[SearchResult]) into a BM25Index ABC. BM25Scorer inherits, so all existing call sites work unchanged.

Sets up the next sub-PRs to drop in an FTS5-backed implementation behind the same contract without touching the consumer.

BM25 persistence train (#327)

Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See yichuan-w#327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raoabinav raoabinav marked this pull request as ready for review May 20, 2026 18:14
raoabinav added a commit to raoabinav/LEANN that referenced this pull request May 20, 2026
Sub-PR 3 of 5 from yichuan-w#327. Builds on yichuan-w#328 and yichuan-w#332.

New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better.

Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raoabinav raoabinav changed the title refactor: extract BM25Index ABC, BM25Scorer inherits [bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant