perf: skip row ID index rebuild in dataset.update() with stable row IDs by pengw0048 · Pull Request #6406 · lance-format/lance

pengw0048 · 2026-04-03T20:00:00Z

dataset.update() is ~400x slower with enable_stable_row_ids=True because it rebuilds the full row ID index (loading row ID sequences + deletion vectors from every fragment) on each update call. The index is cached by manifest version, but each update creates a new version, causing a cache miss.

Repro

import lance, pyarrow as pa, time, shutil

N = 1_000_000

for label, stable in [("stable_off", False), ("stable_on", True)]:
    uri = f"/tmp/repro_update_{label}.lance"
    shutil.rmtree(uri, ignore_errors=True)
    ds = lance.write_dataset(
        pa.table({"x": pa.array(range(N), type=pa.int32())}),
        uri, enable_stable_row_ids=stable,
    )
    t0 = time.perf_counter()
    ds.update({"x": "0"}, where=f"x < {N // 10}")
    print(f"{label}: {time.perf_counter() - t0:.3f}s")
    shutil.rmtree(uri)
# stable_off: 0.017s
# stable_on:  65.784s

Fix

The scanner already knows the physical row addresses of the rows it reads. When stable row IDs are enabled, also request _rowaddr from the scanner and capture both:

Stable row IDs → used for new fragment metadata (unchanged)
Row addresses → used directly for applying deletions (new, bypasses index)

This makes dataset.update() avoid load_row_id_index() entirely when the addresses are already available from the scan.

…ides addresses When stable row IDs are enabled, `dataset.update()` loaded the full row ID index (all fragments + deletion vectors) to map stable IDs back to row addresses for applying deletions. This was O(total_fragments) I/O regardless of how many rows were updated, causing ~400x slowdowns on tables with many fragments. The scanner already knows the physical row addresses, so request both `_rowid` and `_rowaddr` during the update scan. The stable row IDs are still captured for new fragment metadata, while the row addresses are used directly for deletions — bypassing the index entirely.

pengw0048 · 2026-04-03T20:18:14Z

Update after local testing: The row address capture approach is logically correct (verified by cargo test — 33 update + 14 delete tests pass), but end-to-end benchmarks show it doesn't address the actual bottleneck. With a 100K-row table: scan takes 2ms, write takes 1ms, but update() still takes 1.3s.

The real bottleneck appears to be elsewhere in the update pipeline — possibly in the commit path or row ID sequence computation. This needs profiling at the Rust level to pinpoint.

Leaving the PR as draft — the approach is sound but the performance issue has a different root cause than initially analyzed.

codecov · 2026-04-03T20:33:25Z

Codecov Report

❌ Patch coverage is 76.92308% with 12 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/utils.rs	75.55%	5 Missing and 6 partials ⚠️
rust/lance/src/dataset/write/update.rs	85.71%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions bot added the performance label Apr 3, 2026

pengw0048 closed this Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip row ID index rebuild in dataset.update() with stable row IDs#6406

perf: skip row ID index rebuild in dataset.update() with stable row IDs#6406
pengw0048 wants to merge 1 commit intolance-format:mainfrom
pengw0048:perf/update-skip-rowid-index-rebuild

pengw0048 commented Apr 3, 2026

Uh oh!

pengw0048 commented Apr 3, 2026

Uh oh!

codecov bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pengw0048 commented Apr 3, 2026

Repro

Fix

Uh oh!

pengw0048 commented Apr 3, 2026

Uh oh!

codecov bot commented Apr 3, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant