perf: fix O(N²) version column scan with deletion vectors by pengw0048 · Pull Request #6375 · lance-format/lance

pengw0048 · 2026-04-01T16:53:06Z

Problem

Scanning _row_created_at_version or _row_last_updated_at_version is extremely slow on fragments with deletion vectors — 53 seconds for 1M rows (vs 0.03s for a regular data column on the same table).

This makes the delta() API (get_inserted_rows() / get_updated_rows()) unusable on any table that has had deletions.

Root Cause

In apply_row_id_and_deletes() (lance-table/src/utils/stream.rs), the version column is built by:

sequence.versions()
    .skip(r.start as usize)
    .take((r.end - r.start) as usize)

versions() creates a fresh iterator from the start of the RLE sequence. For each range in the selection, skip(r.start) walks through all prior elements — O(rows) per range. With deletion vectors creating many small ranges, this becomes O(rows × ranges).

Fix

Replace with version_values_for_selection() that:

Fast-paths single-run fragments (common case: all rows same version) with vec![version; count] — O(1)
Binary search over precomputed run offsets for O(log(runs)) per position in the multi-run case

Benchmark

1M rows, 33% deleted, _row_created_at_version scan:

	Before	After
Time	53s	0.02s
Complexity	O(rows × ranges)	O(rows × log(runs))

Minimal Reproduction

import lance, pyarrow as pa, numpy as np, time

uri = "/tmp/test_version_col_perf.lance"
lance.write_dataset(
    pa.table({"val": np.random.randint(0, 1000, 1_000_000)}),
    uri, mode="overwrite", enable_stable_row_ids=True,
)
ds = lance.dataset(uri)
ds.delete("val < 333")
ds = lance.dataset(uri)

t0 = time.time()
ds.scanner(columns=["_row_created_at_version"]).to_table()
print(f"_row_created_at_version with deletions: {time.time()-t0:.2f}s")

t0 = time.time()
ds.scanner(columns=["val"]).to_table()
print(f"normal column with deletions: {time.time()-t0:.2f}s")

When scanning `_row_created_at_version` or `_row_last_updated_at_version` on fragments with deletion vectors, the previous implementation called `sequence.versions().skip(r.start).take(...)` for each range in the selection. Since `versions()` creates a fresh iterator from the start, `skip(r.start)` walks through all prior elements — O(rows) per range, leading to O(rows × ranges) total work. Replace with `version_values_for_selection()` that: - Fast-paths single-run fragments (common case) with `vec![version; count]` - Uses binary search over precomputed run offsets for O(log(runs)) per position in the multi-run case Benchmark (1M rows, 33% deleted): - Before: 53s - After: 0.02s

codecov · 2026-04-01T19:31:17Z

Codecov Report

❌ Patch coverage is 96.17834% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-table/src/utils/stream.rs	96.17%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

BubbleCal

Thanks for the great improvements!

BubbleCal · 2026-04-02T08:02:47Z

rust/lance-table/src/utils/stream.rs

+        for pos in r.start..r.end {
+            let pos = pos as usize;
+            if pos >= total_len {
+                versions.push(1); // fallback


i think we will get an error at https://github.com/lance-format/lance/pull/6375/changes#diff-e3ed3f2ff36ff45b28ec19c75b4b9fadcfd857e2ced681ceb22b07fdc4e705b4R360 if we don't have enough rows, but this would hide the error

BubbleCal · 2026-04-02T08:05:12Z

rust/lance-table/src/utils/stream.rs

+                continue;
+            }
+            // Binary search for the run containing this position
+            let run_idx = match run_offsets.binary_search(&pos) {


need one more test to cover this path

…n test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BubbleCal

Good to go once we have all greens

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pengw0048 · 2026-04-02T17:19:27Z

@BubbleCal could you approve the workflows again? Fixed the formatting and the windows test failure seemed unrelated.

github-actions bot added the performance label Apr 1, 2026

test: add test for version column with deletion vectors

27c9ce3

pengw0048 force-pushed the fix/version-column-perf-with-deletions branch from 7181475 to 27c9ce3 Compare April 2, 2026 04:40

pengw0048 marked this pull request as ready for review April 2, 2026 04:41

BubbleCal reviewed Apr 2, 2026

View reviewed changes

Address review: return error instead of silent fallback, add multi-ru…

649c8df

…n test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pengw0048 requested a review from BubbleCal April 2, 2026 13:43

BubbleCal approved these changes Apr 2, 2026

View reviewed changes

fix: cargo fmt

34ae821

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BubbleCal merged commit 8bd9b90 into lance-format:main Apr 2, 2026
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fix O(N²) version column scan with deletion vectors#6375

perf: fix O(N²) version column scan with deletion vectors#6375
BubbleCal merged 4 commits intolance-format:mainfrom
pengw0048:fix/version-column-perf-with-deletions

pengw0048 commented Apr 1, 2026

Uh oh!

codecov bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

BubbleCal left a comment

Uh oh!

BubbleCal Apr 2, 2026

Uh oh!

BubbleCal Apr 2, 2026

Uh oh!

BubbleCal left a comment

Uh oh!

pengw0048 commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pengw0048 commented Apr 1, 2026

Problem

Root Cause

Fix

Benchmark

Minimal Reproduction

Uh oh!

codecov bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

BubbleCal Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

BubbleCal Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

BubbleCal left a comment

Choose a reason for hiding this comment

Uh oh!

pengw0048 commented Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 1, 2026 •

edited

Loading