Skip to content

[bm25 3/5] feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation#333

Open
raoabinav wants to merge 3 commits into
yichuan-w:mainfrom
raoabinav:feat/bm25-fts5-backend
Open

[bm25 3/5] feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation#333
raoabinav wants to merge 3 commits into
yichuan-w:mainfrom
raoabinav:feat/bm25-fts5-backend

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

Sub-PR 3 of 5 from #327. Stacks on #328 (ABC) and #332 (snapshot path).

New Fts5BM25Index(BM25Index) class backed by SQLite FTS5 (tokenize='unicode61 remove_diacritics 2'). fit() bulk-inserts into a fresh virtual table; search() runs MATCH with -bm25() ordering so the rest of LeannSearcher (and hybrid fusion at api.py:1306-1314) keeps higher-is-better.

Opt-in via LeannBuilder(bm25_backend="fts5"). When set, build_index writes <index>.bm25.sqlite and records bm25_backend="fts5" + bm25_db in meta.json. LeannSearcher._init_bm25 honors the field: fts5 → load the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default bm25_backend="memory" so nothing changes for existing callers; default flip is sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) for consistency across backends. Neutralizes FTS5 syntax surprises like : *.

BM25 persistence train (#327)

raoabinav and others added 2 commits May 20, 2026 11:04
Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See yichuan-w#327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 2 of 5 from yichuan-w#327. Builds on yichuan-w#328 (BM25Index ABC).

Adds `LeannBuilder(prebuild_bm25=True)`. When set, build_index fits a
BM25Scorer on the chunks and pickles it to <index>.bm25.pkl, then records
the snapshot filename in meta.json under "bm25_snapshot".

LeannSearcher._init_bm25 now checks for that snapshot first: if present and
loads cleanly, it skips fitting; otherwise it falls back to today's behavior
(scan passages.jsonl and fit). Older indexes are unaffected — no snapshot
field in their meta.json, so the fit-on-search path runs.

Default stays False so this PR changes nothing for existing callers. Default
flip happens in sub-PR 4 once the FTS5 backend (sub-PR 3) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 3 of 5 from yichuan-w#327. Builds on yichuan-w#328 and yichuan-w#332.

New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better.

Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raoabinav raoabinav force-pushed the feat/bm25-fts5-backend branch from 3113bd2 to e4a3619 Compare May 20, 2026 23:25
@raoabinav raoabinav changed the title feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation [bm25 3/5] feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant