feat: add planned blob reads with source-level coalescing by Xuanwo · Pull Request #6352 · lance-format/lance

Xuanwo · 2026-03-31T08:13:29Z

This PR improves blob I/O in two complementary ways: BlobFile instances that resolve to the same physical object now share a lazy BlobSource and can opportunistically coalesce concurrent reads before handing them to Lance's existing scheduler, and datasets now expose a planned read_blobs API for materializing blob payloads directly. It also adds explicit cursor-preserving range reads for BlobFile across Rust, Python, and Java, with end-to-end Python coverage for the new API and the edge cases it uncovered.

This keeps the optimization aligned with Lance's existing scheduler model while giving callers a higher-level path for sequential and batched blob access.

Python example

import lance

dataset = lance.dataset("/path/to/dataset")
blobs = dataset.read_blobs(
    "images",
    indices=[0, 4, 8],
    target_request_bytes=8 * 1024 * 1024,
    max_gap_bytes=64 * 1024,
    max_concurrency=4,
    preserve_order=True,
)

for row_address, payload in blobs:
    print(row_address, len(payload))

codecov · 2026-03-31T08:45:41Z

Codecov Report

❌ Patch coverage is 78.01418% with 155 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/blob.rs	78.26%	129 Missing and 21 partials ⚠️
rust/lance/src/dataset/take.rs	33.33%	2 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs	88.88%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

jackye1995 · 2026-04-03T05:18:26Z

rust/lance/src/dataset/blob.rs

+        })?
+    }
+
+    async fn drain_pending_reads(self: Arc<Self>, scheduler: FileScheduler) {


Correct me if I am wrong, but I think this only drains the reads that are going on concurrently. I think in addition to this, we would like to something that can be more optimized for sequential read use cases, for example we can expose a reader that just gives blobs from a list of row positions.

In that use case, we can further minimize the amount of calls we make to object store. Consider the example that we read blobs at position [0-1024), and the caller is iterating through all blobs, and each image is like 100KB. Then we will read around 100MB, and technically we could just do minimum 1 request to object store. It becomes completely just limited by how much data we would like to buffer in a single call to object store to determine how much object store calls we make to read all images.

Adding more complexity, suppose the user is taking some range with gaps, e.g. [0, 1020], [1023]. Then there is a decision that the stream reader has to make about if it should still fetch the continuous range in a single call, or split it to a separated call. And there could be a threshold set like max_gap_size to determine this.

If we do that, I think this could be turned to a more controlled way to do even concurrent read. Considering again if we read rows [0, 4096), then instead of trying to read all images concurrently and depend on the file scheduler to collapse the requests, we can just split it to a shard of 4 streams reading continuous 1024 images to match whatever hardware the workflow is run on.

Let me know what you think about that!

Thank you for the suggestion! I have impelmented a new read_blobs API which gives users more control over the blob reading.

github-actions bot added enhancement New feature or request python java labels Mar 31, 2026

Xuanwo requested a review from jackye1995 March 31, 2026 16:12

jackye1995 reviewed Apr 3, 2026

View reviewed changes

Xuanwo changed the title ~~feat: coalesce blob reads by source~~ feat: add planned blob reads with source-level coalescing Apr 3, 2026

Xuanwo added 3 commits April 3, 2026 18:43

feat: coalesce blob reads by source

6d0a9f3

test: cover python read blobs api

1d0985e

fix: resolve blob read rebase imports

ebf8e8d

Xuanwo force-pushed the xuanwo/blob-read-coalescing branch from 0f8e779 to ebf8e8d Compare April 3, 2026 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add planned blob reads with source-level coalescing#6352

feat: add planned blob reads with source-level coalescing#6352
Xuanwo wants to merge 3 commits intomainfrom
xuanwo/blob-read-coalescing

Xuanwo commented Mar 31, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

jackye1995 Apr 3, 2026

Uh oh!

Xuanwo Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xuanwo commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python example

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xuanwo commented Mar 31, 2026 •

edited

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading