Skip to content

perf: skip row ID index rebuild in dataset.update() with stable row IDs#6406

Closed
pengw0048 wants to merge 1 commit intolance-format:mainfrom
pengw0048:perf/update-skip-rowid-index-rebuild
Closed

perf: skip row ID index rebuild in dataset.update() with stable row IDs#6406
pengw0048 wants to merge 1 commit intolance-format:mainfrom
pengw0048:perf/update-skip-rowid-index-rebuild

Conversation

@pengw0048
Copy link
Copy Markdown
Contributor

dataset.update() is ~400x slower with enable_stable_row_ids=True because it rebuilds the full row ID index (loading row ID sequences + deletion vectors from every fragment) on each update call. The index is cached by manifest version, but each update creates a new version, causing a cache miss.

Repro

import lance, pyarrow as pa, time, shutil

N = 1_000_000

for label, stable in [("stable_off", False), ("stable_on", True)]:
    uri = f"/tmp/repro_update_{label}.lance"
    shutil.rmtree(uri, ignore_errors=True)
    ds = lance.write_dataset(
        pa.table({"x": pa.array(range(N), type=pa.int32())}),
        uri, enable_stable_row_ids=stable,
    )
    t0 = time.perf_counter()
    ds.update({"x": "0"}, where=f"x < {N // 10}")
    print(f"{label}: {time.perf_counter() - t0:.3f}s")
    shutil.rmtree(uri)
# stable_off: 0.017s
# stable_on:  65.784s

Fix

The scanner already knows the physical row addresses of the rows it reads. When stable row IDs are enabled, also request _rowaddr from the scanner and capture both:

  • Stable row IDs → used for new fragment metadata (unchanged)
  • Row addresses → used directly for applying deletions (new, bypasses index)

This makes dataset.update() avoid load_row_id_index() entirely when the addresses are already available from the scan.

…ides addresses

When stable row IDs are enabled, `dataset.update()` loaded the full row
ID index (all fragments + deletion vectors) to map stable IDs back to
row addresses for applying deletions. This was O(total_fragments) I/O
regardless of how many rows were updated, causing ~400x slowdowns on
tables with many fragments.

The scanner already knows the physical row addresses, so request both
`_rowid` and `_rowaddr` during the update scan. The stable row IDs are
still captured for new fragment metadata, while the row addresses are
used directly for deletions — bypassing the index entirely.
@pengw0048
Copy link
Copy Markdown
Contributor Author

Update after local testing: The row address capture approach is logically correct (verified by cargo test — 33 update + 14 delete tests pass), but end-to-end benchmarks show it doesn't address the actual bottleneck. With a 100K-row table: scan takes 2ms, write takes 1ms, but update() still takes 1.3s.

The real bottleneck appears to be elsewhere in the update pipeline — possibly in the commit path or row ID sequence computation. This needs profiling at the Rust level to pinpoint.

Leaving the PR as draft — the approach is sound but the performance issue has a different root cause than initially analyzed.

@pengw0048 pengw0048 closed this Apr 3, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/utils.rs 75.55% 5 Missing and 6 partials ⚠️
rust/lance/src/dataset/write/update.rs 85.71% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant