[content-hash 5/5] feat: switch default passage_id_scheme to content-hash#338
Open
raoabinav wants to merge 5 commits into
Open
[content-hash 5/5] feat: switch default passage_id_scheme to content-hash#338raoabinav wants to merge 5 commits into
raoabinav wants to merge 5 commits into
Conversation
Sub-PR 1 of 5 from the plan in yichuan-w#329. Purely additive — no behavior change for any caller, existing index loaders ignore the field. Writes a new `passage_id_scheme: "sequential"` field into the .meta.json produced by both build_index and build_index_from_arrays. Bumps version to "1.1" for human-inspectable schema tracking (no code reads version today, so the bump is safe). Module-level constants PASSAGE_ID_SCHEME_SEQUENTIAL / _CONTENT_HASH document the value space; the content-hash scheme itself ships in sub-PR 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 2 of 5 from yichuan-w#329. Builds on yichuan-w#330 (which added the meta.json field). New behavior: - `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key passages by sha256(text)[:16] instead of insertion index. Stable across file moves, reorderings, and re-runs of the same corpus. - `leann build --id-scheme content-hash` exposes it at the CLI. - Default unchanged ("sequential"). Existing indexes continue to work identically; no migration triggered. Identical-text chunks collide (same hash). For this sub-PR the second occurrence overwrites the first in the offset map — that's the dedup behavior I'd want by default. A `--preserve-duplicates` escape hatch can land later if needed (see the open question in yichuan-w#329). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 3 of 5 from yichuan-w#329. Builds on yichuan-w#330 / yichuan-w#331. Two changes: 1. `LeannCLI._make_incremental_builder` now reads the existing index's `passage_id_scheme` from meta.json and uses that, ignoring any conflicting `--id-scheme` on the args (with a note printed). Otherwise an update command on a content-hash index would mix sequential IDs into a hash-keyed passages.jsonl and break lookups. 2. `LeannSearcher` exposes `self.passage_id_scheme` so consumers can introspect; defaults to "sequential" for older indexes that don't record it (pre-yichuan-w#330). No behavior change for fresh builds — the CLI's --id-scheme still controls which scheme a brand-new index gets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt-hash Sub-PR 4 of 5 from yichuan-w#329. Builds on the reader-side support in sub-PR 3. `leann migrate-ids <index>` rewrites an existing sequential-ID index to use sha256(text)[:16] keys. Purely a Python-side rewrite — FAISS labels stay valid, only the string IDs they map to change. Four artifacts get rewritten atomically (via .migrate sibling files + shutil.move): - .passages.jsonl : new "id" field per line - .passages.idx : new offset map keyed by new IDs (deduplicates collisions) - .ids.txt : new label → ID mapping (preserves FAISS label count) - .meta.json : passage_id_scheme = "content-hash", version = "1.1" Collisions (identical-text chunks) get deduped — the later occurrence wins in the offset map. ids.txt still has one line per FAISS label so the graph keeps working; duplicate labels just point to the same passage. Interactive confirmation required unless --yes is passed. --dry-run reports the collision count without writing anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
…hash" Sub-PR 5 of 5 from yichuan-w#329. Builds on the migrate-ids command in sub-PR 4. Flips the default at every layer: - `LeannBuilder(passage_id_scheme=...)` default in api.py - `leann build --id-scheme=...` default in cli.py - Fallback in `_make_incremental_builder` when args has neither an existing index scheme nor an explicit flag (rare: brand-new builder, no args) New indexes built without specifying a scheme now get content-hash IDs by default — file-move-stable, dedup-friendly. Existing indexes are unaffected (they continue with whatever scheme is in their meta.json). Users who explicitly want sequential can opt back via `--id-scheme sequential`. This is the user-visible point at which the default flips. Combined with sub-PRs 1-4 it's the full migration: old indexes keep working, new builds get the better default, and `leann migrate-ids` is available for converting existing data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
83123ae to
52cee53
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sub-PR 5 of 5 from #329, the final step. Stacks on the migrate-ids command in the prior sub-PR.
Flips the default at every layer:
LeannBuilder(passage_id_scheme=...)default in api.pyleann build --id-scheme=...default in cli.py_make_incremental_builderwhen args has neither an existing index scheme nor an explicit flagNew indexes built without specifying a scheme now get content-hash IDs by default — file-move-stable, dedup-friendly. Existing indexes are unaffected (they continue with whatever scheme is in their meta.json). Users who explicitly want sequential can opt back via
--id-scheme sequential.This is the user-visible point at which the default flips. Combined with sub-PRs 1-4 it's the full migration: old indexes keep working, new builds get the better default, and
leann migrate-idsis available for converting existing data.Content-hash passage IDs train (#329)
passage_id_schemefield in meta.json--id-scheme content-hashbuilder + CLI flagleann migrate-idscommand