Lance Vector Index: Growing Beyond the Final Merge #6165

Xuanwo · 2026-03-11T09:47:25Z

Xuanwo
Mar 11, 2026
Maintainer

Background

Today, distributed vector index builds do a good job of spreading the heavy lifting across multiple shards. But at the end of the pipeline, the system still needs to converge all those shard outputs into a single final index.

As data grows, this "fan-out then funnel-in" pattern turns the final convergence step into a bottleneck. No matter how well the earlier stages scale horizontally, the last merge remains the hardest part of the pipeline to scale.

So the question this proposal raises isn't "how do we make the final merge faster?" — it's something more fundamental:

Does a vector index have to converge into a single monolithic index at the end, or can it instead evolve into a small number of stable, searchable segments as data grows?

The Core Idea

The basic idea is straightforward:

When data is small, the system maintains a single searchable index — business as usual.
As data grows, the system stops forcing everything into one monolithic index at the end.
Instead, it lets the index gradually evolve from a single segment into a small number of larger, independently searchable segments.

This is not "shard everything from the start." It's a progressive growth model:

Keep it simple at small scale.
Introduce segmentation as data grows.
Cap the number of segments within a bounded range.

Why This Direction

Two problems motivate this.

1. The final convergence step doesn't scale

If the end state always has to be a single index, then no matter how distributed the build is, there's always a centralized finalization step at the tail end. As data grows, this step gets more expensive — and there's no way to scale it horizontally.

Allowing the final index to remain as multiple large searchable segments removes the need to compress everything into a monolith at the very end.

2. Index structure should match data scale

A single searchable index makes perfect sense for small datasets — simple structure, direct query path, minimal overhead. But as data scales up, the marginal benefit of maintaining a monolith decreases while the maintenance cost keeps climbing.

A more natural progression is:

Stay monolithic when small.
Segment gradually when large.

What This Would Look Like

If this direction holds up, the evolution path would go something like:

Start with a single searchable index.
Absorb incoming data incrementally.
Once the index crosses a size threshold, split it into multiple searchable segments.
As data continues to grow, the segment count increases — but stays bounded.

The goal is emphatically not unlimited sharding. The number of searchable segments should stay within a small constant range, like 8 or 16 on an 1B dataset.

How This Relates to What Already Exists

It's more of a natural extension of capabilities Lance already has:

A single logical index can already contain multiple physical indexes under the hood.
The query path already handles fan-out across multiple indexes.
Background index optimization and ongoing maintenance are already part of the system.
Compaction, cleanup, and GC already have stable relationships with index metadata.

What This Proposal Explicitly Defers

To keep the discussion focused, this proposal intentionally avoids trying to solve everything in phase one.

It does not attempt to address:

A brand-new logical index container format
Segment-level pruning
Unified support for all vector index types at once
Immediate optimization of all query costs

What it does want to establish:

Whether this direction is worth pursuing.
Whether it fits Lance's overall trajectory.
Whether it can deliver real value without blowing up system complexity.

A Systems Evolution Perspective

From a systems standpoint, this looks a lot like introducing a tiered maintenance model for vector indexes:

Newer, smaller indexes absorb fresh data.
Older, larger segments carry the bulk of query traffic.
Background maintenance gradually folds the former into the latter.

If the community agrees this direction makes sense, the follow-up discussions would naturally center on:

How to define the boundaries of a searchable segment.
How segment count should scale with data size.
How the query path should handle multiple searchable segments.
How compaction, cleanup, and incremental maintenance interact with this structure.

westonpace · 2026-03-11T12:48:32Z

westonpace
Mar 11, 2026
Maintainer

This is actually exactly how the FTS index works today, so you could use that as inspiration. Also, the vector index already supports delta indices, which are basically what you are describing as segments. So we should already have the code in place to search segments in parallel.

How to define the boundaries of a searchable segment.

A searchable segment should probably fit into the RAM of a system that will be used for queries. This way, if we wanted to, we could setup a distributed search system and fit the entire index into RAM. If we can't justify that RAM cost then we can still page in partitions as needed with some kind of LRU cache. Given this, I think 50GB is a reasonable default upper bound on segment size but it should definitely be configurable.

How segment count should scale with data size.

If we say 50GB then that means a 1T row index would need ~200 segments (we'd need 200 query nodes to cache the thing in RAM) which seems reasonable.

How the query path should handle multiple searchable segments.

We have code in place today to search multiple deltas. I think the code we have for deltas and this new segment code should converge. I don't want two distinct "segment / delta" concepts. We don't have code for a distributed search but that seems out of scope for lance-format anyways.

How compaction, cleanup, and incremental maintenance interact with this structure.

We already have a process for creating deltas to absorb new data and then periodically merging those deltas. So the segment splitting question will come into play during the merge I think. When we merge deltas we should merge them into the "current segment". If that segment is too large then we need to seal the segment and start a new one. This new segment should inherit the centroids of the previous segment but only as initial state. The centroids can migrate per the usual spfresh process as the segment grows. As a result, each segment will have its own set of centroids, but those centroids will all be correct for the segment itself.

1 reply

Xuanwo Mar 11, 2026
Maintainer Author

Bravo! Thanks a lot for the input, super helpful.

wjones127 · 2026-03-11T15:25:40Z

wjones127
Mar 11, 2026
Maintainer

A single searchable index makes perfect sense for small datasets — simple structure, direct query path, minimal overhead. But as data scales up, the marginal benefit of maintaining a monolith decreases while the maintenance cost keeps climbing.

I would say, I'm not even sure of this. Having only one segment might be best for read performance, but it can cause huge write amplification. It's basically equivalent to always keeping data in one fragment; maybe faster than multiple fragments, but costs a lot to maintain.

To give an example, here's a users table. It's only 150MB in size for the latest version. But because we often compact the index into a single segment, we have over 30GB of index files in a one week window:

In other words, I think we should evaluate looking at index segments much like we look at fragments: fewer is often better. Merging segments = "compacting segments". Compacting means write amplification. Sometimes that amplification is worth it, sometime it isn't. And, like you are saying, sometimes there are scales where compaction to a single segment/fragment doesn't make any sense.

This is to say, even at small scales, I'd evaluate the possibility of maintaining multiple index segments and improving the query performance of this.

This doesn't even have to be distributed indexing. For example, if you have 100 fragments, you could kick off two independent indexing jobs for fragments 0..50 and fragments 50..100. (Note: we need to update the Commit logic to allow this.)

0 replies

Xuanwo · 2026-03-13T07:47:33Z

Xuanwo
Mar 13, 2026
Maintainer Author

I created a detailed design doc as a follow up of this proposal: #6189

Welcome to take a review!~

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance Vector Index: Growing Beyond the Final Merge #6165

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Lance Vector Index: Growing Beyond the Final Merge #6165

Uh oh!

Xuanwo Mar 11, 2026 Maintainer

Background

The Core Idea

Why This Direction

1. The final convergence step doesn't scale

2. Index structure should match data scale

What This Would Look Like

How This Relates to What Already Exists

What This Proposal Explicitly Defers

A Systems Evolution Perspective

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

westonpace Mar 11, 2026 Maintainer

Uh oh!

Xuanwo Mar 11, 2026 Maintainer Author

Uh oh!

wjones127 Mar 11, 2026 Maintainer

Uh oh!

Xuanwo Mar 13, 2026 Maintainer Author

Xuanwo
Mar 11, 2026
Maintainer

Replies: 3 comments 1 reply

westonpace
Mar 11, 2026
Maintainer

Xuanwo Mar 11, 2026
Maintainer Author

wjones127
Mar 11, 2026
Maintainer

Xuanwo
Mar 13, 2026
Maintainer Author