Lance Vector Index: Multi-Segment Final State #6189
Replies: 5 comments 5 replies
-
|
Like this! Just a thought: is this new architecture really "only" about vector indexes or is this a general topic for all kind of indexes that could grew pretty huge on big data sets? |
Beta Was this translation helpful? Give feedback.
-
Does the user specify one or both? How can Is
What is this?
I don't really have a problem with this if we want to go this route but it seems redundant. The manifest has a list of UUIDs. Wouldn't the most recent UUID in the manifest be the "active" UUID? Then we just look at it and see if it has space, if it does, we add to it, if it doesn't, we make a new index? |
Beta Was this translation helpful? Give feedback.
-
If you read https://lance.org/format/table/index/#basic-concepts, you'll find we already define the idea of index segments. So this isn't new. Could you describe more why we need a sealed and an active tag on segments? We don't have those for fragments? Does a sealed segment ever become active? Say I have a segment that covers fragments 0..100 and is sealed. Then fragments 50..100 are updated via data replacements, and invalidate that part of the index segment. And then fragments 25..50 are deleted. Now there's only fragments 0..25 active in this index, but it's no longer meeting the |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Thank you @hpvd @westonpace @wjones127 for the review! I have rewrite the design doc, PTAL. Additionally, this new design doc eliminates the need for changes to the table format. As a result, a formal vote is no longer required. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
1. Problem
Lance's format already defines an index segment model: an index consists of multiple segments, each being a self-contained physical index that describes its data coverage through a
fragment_bitmap.However, the current distributed vector index build does not take advantage of this model. The build works in two phases:
The shard phase scales horizontally, but finalize must converge to a single point and rewrite all shard outputs in full — making it the scalability bottleneck. The root cause: the format already supports segmentation, but the build, transaction, and query semantics are not organized around it.
2. Goals
3. Design
3.1 Logical Index vs. Physical Segment
Users interact with vector indexes by index name (the logical index). A logical index is composed of a set of physical index segments that:
3.2 Distributed Build: Before and After
Current flow:
Proposed flow:
Finalize shifts from "full merge" to "validate, organize, and commit."
4. Transaction Semantics
4.1 Current Semantics and Limitations
CreateIndex's apply logic works as: delete all indexes with the same name → delete indexes specified by UUID inremoved_indices→ appendnew_indices.This assumes only one index survives per name and does not support long-lived multi-segment coexistence.
4.2 Proposed Semantics
Transactions must evolve from "replace all by name" to "precisely maintain the segment set by UUID":
4.3 Concurrency
With long-lived multi-segment coexistence, concurrency control must be defined around segment set mutations. The system must correctly handle:
5. Query Semantics
5.1 Query Flow
5.2 Phase 1 Trade-offs
Phase 1 accepts the following costs:
The goal is to establish correct multi-segment query semantics and provide a stable foundation for future optimization.
6. Integration with Existing Systems
6.1 Format / Manifest
Each segment remains a regular
IndexMetadataentry. A logical index maps to multiple same-name entries in the manifest. No format-level changes are required.6.2 Cleanup / GC
Since segments are standard
IndexMetadataentries, the existing GC mechanism applies directly: manifest references determine liveness, and dereferencedindex_uuids are cleaned up through the existing path.6.3 Compaction / Remap
Compaction must remain compatible with multi-segment coexistence, continuing to maintain each segment's
fragment_bitmapand its relationship to the underlying fragments.7. Phase 1 Scope
7.1 Index Types
Covered:
IVF_FLAT,IVF_PQ,IVF_SQ.Deferred:
IVF_HNSW_*— stronger intra-partition semantics make segmented query and maintenance more involved.IVF_RQ— on a separate evolution path.7.2 Minimum Deliverables
8. Benefits
Beta Was this translation helpful? Give feedback.
All reactions