Design: Unify FTS into the Global Segment Model #6301

Xuanwo · 2026-03-26T10:34:47Z

Xuanwo
Mar 26, 2026
Maintainer

Background

Lance is unifying its index system around a framework-level segment architecture so that build, commit, query fan-out, compact, and GC all operate on the same physical unit.

FTS needs to align with this model.

How FTS Works Today

On-Disk Layout

An FTS index lives in a single root directory:

<dataset>/_indices/<index_uuid>/
├── metadata.lance            # partition list + FTS params
├── part_0_tokens.lance       # token dictionary (FST)
├── part_0_docs.lance         # doc_id → row_id + num_tokens
├── part_0_invert.lance       # posting lists + positions
├── part_1_tokens.lance
├── part_1_docs.lance
├── part_1_invert.lance
└── ...

metadata.lance + part_<id>_* form a stable FTS root format that the builder, loader, and distributed finalize all depend on.

Build

Read source columns, tokenize.
When a worker's memory reaches the threshold (default 2 GiB, controlled by LANCE_FTS_PARTITION_SIZE), spill to part_<id>_* files.
Merge tail builders across workers.
Write metadata.lance.

Spilled part files are the final payload — there is no separate intermediate-to-final format conversion.

Distributed Build

Each worker processes a subset of fragments, writes part_<id>_tokens/docs/invert.lance + part_<id>_metadata.lance.
The coordinator collects per-part metadata, reassigns partition IDs, renames files, and writes the unified root metadata.lance.

Query

Open a single FTS root; load every partition's TokenSet (FST) and DocSet into memory.
Execute BM25 + WAND scoring in parallel across partitions.
Merge top-k within the root.

BM25 scoring depends on two kinds of global information:

Corpus-level: num_docs, avg_doc_length (aggregated from DocSet).
Per-token: document frequency (aggregated from each partition's TokenSet + PostingListReader).

Append

Build new partitions for the incoming data.
Copy each old partition's 3 files to the new index directory (server-side copy when the ObjectStore scheme matches).
Rewrite root metadata.lance.

Append cost grows linearly with the number of existing partitions.

Delete

Lazy: dead rows remain in posting lists and are filtered at query time. Physical cleanup depends on optimize.

Problems

The FTS root is currently the only physical management unit. This causes:

Unclear control-plane / data-plane boundary: the framework has no stable physical unit to operate on; partitions are internal implementation details of the root.
Linear append cost: every append must copy all existing partitions to a new directory.
No framework-level identity for distributed build output: worker artifacts are folded back into a single root instead of being committed as independent units.

Design

Core Decisions

Each FTS segment reuses the existing complete FTS root format.
New segments default to single-part (partitions: [0]).
A logical FTS index is the set of same-named segments in the manifest.

<dataset>/_indices/
├── <segment_uuid_1>/
│   ├── metadata.lance
│   ├── part_0_tokens.lance
│   ├── part_0_docs.lance
│   └── part_0_invert.lance
├── <segment_uuid_2>/
│   ├── metadata.lance
│   ├── part_0_tokens.lance
│   ├── part_0_docs.lance
│   └── part_0_invert.lance
└── ...

Legacy multi-part segments remain readable.

Metadata

No metadata schema changes in the initial phases. The existing manifest-level IndexMetadata (uuid, name, fields, fragment_bitmap, etc.) and segment-local metadata.lance (params, partitions, token_set_format, etc.) are sufficient to support the segment control plane.

Future work may promote segment-local fields (token_set_format, posting_tail_codec, etc.) to manifest metadata and add statistical fields such as num_docs_raw and total_tokens_raw so the planner can make decisions without opening payload files. This is not a prerequisite for segmentation.

Global BM25 Scoring

BM25 requires two kinds of global information:

Corpus-level statistics (num_docs, avg_doc_length): aggregated from each segment's DocSet.
Per-token df: probe each segment's TokenSet (FST lookup) + PostingListReader::lengths[token_id].

Phase 1 — dfs_query_then_search:

Coordinator collects candidate segments.
Each node probes its local segments for query-token df and returns results to the coordinator.
Coordinator aggregates global df and constructs a global BM25 scorer.
Each node executes local search with the global scorer and returns local top-k.
Coordinator merges top-k.

Cost: candidate_segments × query_tokens probes. For 100 segments and 3 query tokens this is 300 FST lookups + 300 length reads.

Phase 2 — metadata acceleration:

Write lightweight df summaries at segment build time to reduce query-time probes:

Per-segment top-K token df table (covers the high-frequency hot path).
Count-min sketch (space-efficient approximate df).
Bloom filter (token existence only; useful for segment pruning).

Build

Workers produce complete FTS segment roots directly.

In distributed builds each worker outputs independent segments; the coordinator performs a logical commit to the manifest. No cross-worker rename/finalize step is needed.

Append

New data produces a new segment. No files from existing segments are copied.

Compact

Select a set of old segments → re-scan source data by fragment coverage and live rows → rebuild into fewer new segments using the existing builder.

Goals: reduce segment fan-out, apply delete cleanup, produce more compact payload.

Cost is O(source_data), not O(index_size). The trade-off is simplicity: compact reuses the full build path and avoids cross-segment posting-list merge.

Delete

Segments are immutable. Deleted rows are lazily filtered at query time. Physical cleanup is performed by compact.

Open Questions

Logical metadata location: continue with manifest-level grouping long-term, or introduce a dedicated logical index header?
Segment sizing policy: LANCE_FTS_PARTITION_SIZE (default 2 GiB) controls worker memory limits and spill cadence. Under the single-part-per-segment model, a worker that spills multiple times needs either a post-spill merge or to emit multiple segments. The sizing policy and the spill-to-segment mapping need to be co-designed.
Segment-level synopsis: minimum set of per-segment summaries for query pruning (bloom filters, token sketches, min/max statistics).
WAND upper bound under global IDF: today's block-max scores are precomputed with partition-local statistics. After switching to global IDF, block-max upper bounds must be recomputed; otherwise WAND pruning becomes loose.

BubbleCal · 2026-03-26T13:22:10Z

BubbleCal
Mar 26, 2026
Maintainer

for 1, I think it's fine to have the metadata file for each segment, as they are small, but we can have a logical index metadata in long term to make it cleaner.

for 2, I don't worry about this too much, now we have memory_limit param for creating FTS index, we can set this to larger value to have less partitions, and it will merge all tail partitions, so expect it won't create many partitions

for 3, bloom filter may help, the others may not. I don't think any of this can make query noticeable improvement, now FTS would just return empty results if all tokens are missed, checking tokens existence is not expensive

for 4, it may be fine to do in-segment scoring, this is also how the other systems handle segments, we can add a param to do global scoring, just let people know that would be slower

1 reply

westonpace Mar 26, 2026
Maintainer

For 2:

For now just create multiple segments per worker. In the future we will need segment merge anyways. Because right now every time we run "optimize" we are going to create a new tiny segment. We need some way to merge those segments together in the future. I think, if we write out posting lists so they are sorted alphabetically by token (key) then it will make a low-RAM merge possible. I think the merge routines we have today require enough RAM to hold both segments at the same time in memory and so final segment size is RAM limited. Or even if we don't write out posting lists in any particular order we could do a two-pass merge. First sort and then merge.

For 3:

Let's get something working first and worry about optimization later. All the things you describe seem like they can be added on in a backwards-compatible fashion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: Unify FTS into the Global Segment Model #6301

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Design: Unify FTS into the Global Segment Model #6301

Uh oh!

Xuanwo Mar 26, 2026 Maintainer

Background

How FTS Works Today

On-Disk Layout

Build

Distributed Build

Query

Append

Delete

Problems

Design

Core Decisions

Metadata

Global BM25 Scoring

Build

Append

Compact

Delete

Open Questions

Replies: 1 comment · 1 reply

Uh oh!

BubbleCal Mar 26, 2026 Maintainer

Uh oh!

Uh oh!

westonpace Mar 26, 2026 Maintainer

Xuanwo
Mar 26, 2026
Maintainer

Replies: 1 comment 1 reply

BubbleCal
Mar 26, 2026
Maintainer

westonpace Mar 26, 2026
Maintainer