Skip to content

feat: add planned blob reads with source-level coalescing#6352

Open
Xuanwo wants to merge 3 commits intomainfrom
xuanwo/blob-read-coalescing
Open

feat: add planned blob reads with source-level coalescing#6352
Xuanwo wants to merge 3 commits intomainfrom
xuanwo/blob-read-coalescing

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 31, 2026

This PR improves blob I/O in two complementary ways: BlobFile instances that resolve to the same physical object now share a lazy BlobSource and can opportunistically coalesce concurrent reads before handing them to Lance's existing scheduler, and datasets now expose a planned read_blobs API for materializing blob payloads directly. It also adds explicit cursor-preserving range reads for BlobFile across Rust, Python, and Java, with end-to-end Python coverage for the new API and the edge cases it uncovered.

This keeps the optimization aligned with Lance's existing scheduler model while giving callers a higher-level path for sequential and batched blob access.

Python example

import lance

dataset = lance.dataset("/path/to/dataset")
blobs = dataset.read_blobs(
    "images",
    indices=[0, 4, 8],
    target_request_bytes=8 * 1024 * 1024,
    max_gap_bytes=64 * 1024,
    max_concurrency=4,
    preserve_order=True,
)

for row_address, payload in blobs:
    print(row_address, len(payload))

@github-actions github-actions bot added enhancement New feature or request python java labels Mar 31, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 78.01418% with 155 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/blob.rs 78.26% 129 Missing and 21 partials ⚠️
rust/lance/src/dataset/take.rs 33.33% 2 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs 88.88% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@Xuanwo Xuanwo requested a review from jackye1995 March 31, 2026 16:12
})?
}

async fn drain_pending_reads(self: Arc<Self>, scheduler: FileScheduler) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong, but I think this only drains the reads that are going on concurrently. I think in addition to this, we would like to something that can be more optimized for sequential read use cases, for example we can expose a reader that just gives blobs from a list of row positions.

In that use case, we can further minimize the amount of calls we make to object store. Consider the example that we read blobs at position [0-1024), and the caller is iterating through all blobs, and each image is like 100KB. Then we will read around 100MB, and technically we could just do minimum 1 request to object store. It becomes completely just limited by how much data we would like to buffer in a single call to object store to determine how much object store calls we make to read all images.

Adding more complexity, suppose the user is taking some range with gaps, e.g. [0, 1020], [1023]. Then there is a decision that the stream reader has to make about if it should still fetch the continuous range in a single call, or split it to a separated call. And there could be a threshold set like max_gap_size to determine this.

If we do that, I think this could be turned to a more controlled way to do even concurrent read. Considering again if we read rows [0, 4096), then instead of trying to read all images concurrently and depend on the file scheduler to collapse the requests, we can just split it to a shard of 4 streams reading continuous 1024 images to match whatever hardware the workflow is run on.

Let me know what you think about that!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion! I have impelmented a new read_blobs API which gives users more control over the blob reading.

@Xuanwo Xuanwo changed the title feat: coalesce blob reads by source feat: add planned blob reads with source-level coalescing Apr 3, 2026
@Xuanwo Xuanwo force-pushed the xuanwo/blob-read-coalescing branch from 0f8e779 to ebf8e8d Compare April 3, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants