feat: add planned blob reads with source-level coalescing#6352
feat: add planned blob reads with source-level coalescing#6352
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| })? | ||
| } | ||
|
|
||
| async fn drain_pending_reads(self: Arc<Self>, scheduler: FileScheduler) { |
There was a problem hiding this comment.
Correct me if I am wrong, but I think this only drains the reads that are going on concurrently. I think in addition to this, we would like to something that can be more optimized for sequential read use cases, for example we can expose a reader that just gives blobs from a list of row positions.
In that use case, we can further minimize the amount of calls we make to object store. Consider the example that we read blobs at position [0-1024), and the caller is iterating through all blobs, and each image is like 100KB. Then we will read around 100MB, and technically we could just do minimum 1 request to object store. It becomes completely just limited by how much data we would like to buffer in a single call to object store to determine how much object store calls we make to read all images.
Adding more complexity, suppose the user is taking some range with gaps, e.g. [0, 1020], [1023]. Then there is a decision that the stream reader has to make about if it should still fetch the continuous range in a single call, or split it to a separated call. And there could be a threshold set like max_gap_size to determine this.
If we do that, I think this could be turned to a more controlled way to do even concurrent read. Considering again if we read rows [0, 4096), then instead of trying to read all images concurrently and depend on the file scheduler to collapse the requests, we can just split it to a shard of 4 streams reading continuous 1024 images to match whatever hardware the workflow is run on.
Let me know what you think about that!
There was a problem hiding this comment.
Thank you for the suggestion! I have impelmented a new read_blobs API which gives users more control over the blob reading.
0f8e779 to
ebf8e8d
Compare
This PR improves blob I/O in two complementary ways:
BlobFileinstances that resolve to the same physical object now share a lazyBlobSourceand can opportunistically coalesce concurrent reads before handing them to Lance's existing scheduler, and datasets now expose a plannedread_blobsAPI for materializing blob payloads directly. It also adds explicit cursor-preserving range reads forBlobFileacross Rust, Python, and Java, with end-to-end Python coverage for the new API and the edge cases it uncovered.This keeps the optimization aligned with Lance's existing scheduler model while giving callers a higher-level path for sequential and batched blob access.
Python example