Lance Vector Index: Growing Beyond the Final Merge #6165
Replies: 3 comments 1 reply
-
|
This is actually exactly how the FTS index works today, so you could use that as inspiration. Also, the vector index already supports delta indices, which are basically what you are describing as segments. So we should already have the code in place to search segments in parallel.
A searchable segment should probably fit into the RAM of a system that will be used for queries. This way, if we wanted to, we could setup a distributed search system and fit the entire index into RAM. If we can't justify that RAM cost then we can still page in partitions as needed with some kind of LRU cache. Given this, I think 50GB is a reasonable default upper bound on segment size but it should definitely be configurable.
If we say 50GB then that means a 1T row index would need ~200 segments (we'd need 200 query nodes to cache the thing in RAM) which seems reasonable.
We have code in place today to search multiple deltas. I think the code we have for deltas and this new segment code should converge. I don't want two distinct "segment / delta" concepts. We don't have code for a distributed search but that seems out of scope for lance-format anyways.
We already have a process for creating deltas to absorb new data and then periodically merging those deltas. So the segment splitting question will come into play during the merge I think. When we merge deltas we should merge them into the "current segment". If that segment is too large then we need to seal the segment and start a new one. This new segment should inherit the centroids of the previous segment but only as initial state. The centroids can migrate per the usual spfresh process as the segment grows. As a result, each segment will have its own set of centroids, but those centroids will all be correct for the segment itself. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
I created a detailed design doc as a follow up of this proposal: #6189 Welcome to take a review!~ |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Background
Today, distributed vector index builds do a good job of spreading the heavy lifting across multiple shards. But at the end of the pipeline, the system still needs to converge all those shard outputs into a single final index.
As data grows, this "fan-out then funnel-in" pattern turns the final convergence step into a bottleneck. No matter how well the earlier stages scale horizontally, the last merge remains the hardest part of the pipeline to scale.
So the question this proposal raises isn't "how do we make the final merge faster?" — it's something more fundamental:
Does a vector index have to converge into a single monolithic index at the end, or can it instead evolve into a small number of stable, searchable segments as data grows?
The Core Idea
The basic idea is straightforward:
This is not "shard everything from the start." It's a progressive growth model:
Why This Direction
Two problems motivate this.
1. The final convergence step doesn't scale
If the end state always has to be a single index, then no matter how distributed the build is, there's always a centralized finalization step at the tail end. As data grows, this step gets more expensive — and there's no way to scale it horizontally.
Allowing the final index to remain as multiple large searchable segments removes the need to compress everything into a monolith at the very end.
2. Index structure should match data scale
A single searchable index makes perfect sense for small datasets — simple structure, direct query path, minimal overhead. But as data scales up, the marginal benefit of maintaining a monolith decreases while the maintenance cost keeps climbing.
A more natural progression is:
What This Would Look Like
If this direction holds up, the evolution path would go something like:
The goal is emphatically not unlimited sharding. The number of searchable segments should stay within a small constant range, like 8 or 16 on an 1B dataset.
How This Relates to What Already Exists
It's more of a natural extension of capabilities Lance already has:
What This Proposal Explicitly Defers
To keep the discussion focused, this proposal intentionally avoids trying to solve everything in phase one.
It does not attempt to address:
What it does want to establish:
A Systems Evolution Perspective
From a systems standpoint, this looks a lot like introducing a tiered maintenance model for vector indexes:
If the community agrees this direction makes sense, the follow-up discussions would naturally center on:
Beta Was this translation helpful? Give feedback.
All reactions