perf: skip row ID index rebuild in dataset.update() with stable row IDs#6406
Closed
pengw0048 wants to merge 1 commit intolance-format:mainfrom
Closed
perf: skip row ID index rebuild in dataset.update() with stable row IDs#6406pengw0048 wants to merge 1 commit intolance-format:mainfrom
pengw0048 wants to merge 1 commit intolance-format:mainfrom
Conversation
…ides addresses When stable row IDs are enabled, `dataset.update()` loaded the full row ID index (all fragments + deletion vectors) to map stable IDs back to row addresses for applying deletions. This was O(total_fragments) I/O regardless of how many rows were updated, causing ~400x slowdowns on tables with many fragments. The scanner already knows the physical row addresses, so request both `_rowid` and `_rowaddr` during the update scan. The stable row IDs are still captured for new fragment metadata, while the row addresses are used directly for deletions — bypassing the index entirely.
Contributor
Author
|
Update after local testing: The row address capture approach is logically correct (verified by The real bottleneck appears to be elsewhere in the update pipeline — possibly in the commit path or row ID sequence computation. This needs profiling at the Rust level to pinpoint. Leaving the PR as draft — the approach is sound but the performance issue has a different root cause than initially analyzed. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
dataset.update()is ~400x slower withenable_stable_row_ids=Truebecause it rebuilds the full row ID index (loading row ID sequences + deletion vectors from every fragment) on each update call. The index is cached by manifest version, but each update creates a new version, causing a cache miss.Repro
Fix
The scanner already knows the physical row addresses of the rows it reads. When stable row IDs are enabled, also request
_rowaddrfrom the scanner and capture both:This makes
dataset.update()avoidload_row_id_index()entirely when the addresses are already available from the scan.