Skip to content

[content-hash 5/5] feat: switch default passage_id_scheme to content-hash#338

Open
raoabinav wants to merge 5 commits into
yichuan-w:mainfrom
raoabinav:feat/passage-id-scheme-default-content-hash
Open

[content-hash 5/5] feat: switch default passage_id_scheme to content-hash#338
raoabinav wants to merge 5 commits into
yichuan-w:mainfrom
raoabinav:feat/passage-id-scheme-default-content-hash

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

Sub-PR 5 of 5 from #329, the final step. Stacks on the migrate-ids command in the prior sub-PR.

Flips the default at every layer:

  • LeannBuilder(passage_id_scheme=...) default in api.py
  • leann build --id-scheme=... default in cli.py
  • Fallback in _make_incremental_builder when args has neither an existing index scheme nor an explicit flag

New indexes built without specifying a scheme now get content-hash IDs by default — file-move-stable, dedup-friendly. Existing indexes are unaffected (they continue with whatever scheme is in their meta.json). Users who explicitly want sequential can opt back via --id-scheme sequential.

This is the user-visible point at which the default flips. Combined with sub-PRs 1-4 it's the full migration: old indexes keep working, new builds get the better default, and leann migrate-ids is available for converting existing data.

Content-hash passage IDs train (#329)

raoabinav and others added 4 commits May 20, 2026 11:07
Sub-PR 1 of 5 from the plan in yichuan-w#329. Purely additive — no behavior change
for any caller, existing index loaders ignore the field.

Writes a new `passage_id_scheme: "sequential"` field into the .meta.json
produced by both build_index and build_index_from_arrays. Bumps version
to "1.1" for human-inspectable schema tracking (no code reads version today,
so the bump is safe).

Module-level constants PASSAGE_ID_SCHEME_SEQUENTIAL / _CONTENT_HASH document
the value space; the content-hash scheme itself ships in sub-PR 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 2 of 5 from yichuan-w#329. Builds on yichuan-w#330 (which added the meta.json field).

New behavior:
- `LeannBuilder(..., passage_id_scheme="content-hash")` makes add_text() key
  passages by sha256(text)[:16] instead of insertion index. Stable across file
  moves, reorderings, and re-runs of the same corpus.
- `leann build --id-scheme content-hash` exposes it at the CLI.
- Default unchanged ("sequential"). Existing indexes continue to work
  identically; no migration triggered.

Identical-text chunks collide (same hash). For this sub-PR the second
occurrence overwrites the first in the offset map — that's the dedup
behavior I'd want by default. A `--preserve-duplicates` escape hatch can
land later if needed (see the open question in yichuan-w#329).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sub-PR 3 of 5 from yichuan-w#329. Builds on yichuan-w#330 / yichuan-w#331.

Two changes:
1. `LeannCLI._make_incremental_builder` now reads the existing index's
   `passage_id_scheme` from meta.json and uses that, ignoring any conflicting
   `--id-scheme` on the args (with a note printed). Otherwise an update
   command on a content-hash index would mix sequential IDs into a hash-keyed
   passages.jsonl and break lookups.
2. `LeannSearcher` exposes `self.passage_id_scheme` so consumers can
   introspect; defaults to "sequential" for older indexes that don't record
   it (pre-yichuan-w#330).

No behavior change for fresh builds — the CLI's --id-scheme still controls
which scheme a brand-new index gets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt-hash

Sub-PR 4 of 5 from yichuan-w#329. Builds on the reader-side support in sub-PR 3.

`leann migrate-ids <index>` rewrites an existing sequential-ID index to use
sha256(text)[:16] keys. Purely a Python-side rewrite — FAISS labels stay
valid, only the string IDs they map to change. Four artifacts get rewritten
atomically (via .migrate sibling files + shutil.move):
  - .passages.jsonl  : new "id" field per line
  - .passages.idx    : new offset map keyed by new IDs (deduplicates collisions)
  - .ids.txt         : new label → ID mapping (preserves FAISS label count)
  - .meta.json       : passage_id_scheme = "content-hash", version = "1.1"

Collisions (identical-text chunks) get deduped — the later occurrence wins
in the offset map. ids.txt still has one line per FAISS label so the graph
keeps working; duplicate labels just point to the same passage.

Interactive confirmation required unless --yes is passed. --dry-run reports
the collision count without writing anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hash"

Sub-PR 5 of 5 from yichuan-w#329. Builds on the migrate-ids command in sub-PR 4.

Flips the default at every layer:
- `LeannBuilder(passage_id_scheme=...)` default in api.py
- `leann build --id-scheme=...` default in cli.py
- Fallback in `_make_incremental_builder` when args has neither an existing
  index scheme nor an explicit flag (rare: brand-new builder, no args)

New indexes built without specifying a scheme now get content-hash IDs by
default — file-move-stable, dedup-friendly. Existing indexes are unaffected
(they continue with whatever scheme is in their meta.json). Users who
explicitly want sequential can opt back via `--id-scheme sequential`.

This is the user-visible point at which the default flips. Combined with
sub-PRs 1-4 it's the full migration: old indexes keep working, new builds
get the better default, and `leann migrate-ids` is available for converting
existing data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raoabinav raoabinav force-pushed the feat/passage-id-scheme-default-content-hash branch from 83123ae to 52cee53 Compare May 20, 2026 23:25
@raoabinav raoabinav changed the title feat: switch default passage_id_scheme from sequential to content-hash [content-hash 5/5] feat: switch default passage_id_scheme to content-hash May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant