feat: add planned blob reads with source-level coalescing by Xuanwo · Pull Request #6352 · lance-format/lance

Xuanwo · 2026-03-31T08:13:29Z

This PR improves blob I/O in two complementary ways: BlobFile instances that resolve to the same physical object now share a lazy BlobSource and can opportunistically coalesce concurrent reads before handing them to Lance's existing scheduler, and datasets now expose a planned read_blobs API for materializing blob payloads directly. It also adds explicit cursor-preserving range reads for BlobFile across Rust, Python, and Java, with end-to-end Python coverage for the new API and the edge cases it uncovered.

This keeps the optimization aligned with Lance's existing scheduler model while giving callers a higher-level path for sequential and batched blob access.

Python example

import lance

dataset = lance.dataset("/path/to/dataset")
blobs = dataset.read_blobs(
    "images",
    indices=[0, 4, 8],
    target_request_bytes=8 * 1024 * 1024,
    max_gap_bytes=64 * 1024,
    max_concurrency=4,
    preserve_order=True,
)

for row_address, payload in blobs:
    print(row_address, len(payload))

codecov · 2026-03-31T08:45:41Z

Codecov Report

❌ Patch coverage is 83.09232% with 152 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/blob.rs	83.37%	118 Missing and 29 partials ⚠️
rust/lance/src/dataset/take.rs	33.33%	2 Missing and 2 partials ⚠️
rust/lance/src/dataset.rs	88.88%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

westonpace

I think some parts of this PR are reinventing capabilities that are already in the file / scan scheduler. The scheduler is already capable of both merging nearby reads into larger reads and splitting extremely large reads into multiple concurrent reads.

You also have a max concurrency control. However, this is also handled already. If you are worried about using too much RAM then you can use the scan scheduler's I/O buffer size parameter which controls how much RAM we allow in I/O buffers before we pause the scan. If you are worried about rate limits or overloading the object store then the object store's themselves already have controls for this (e.g. AIMD throttles)

I think all we need to do is wire up the scan scheduler's config to your user facing config variables (removing max_concurrency in favor of I/O buffer size which is easier to configure) and making one call to submit_request per file (which will allow the scheduler to do the coalescing and splitting of reads)

…escing # Conflicts: # rust/lance/src/dataset/blob.rs

westonpace

Approving on behalf of Weston (via Claude Code).

The refactor in 153ef4b40 is a big improvement — the planner now correctly delegates merging, splitting, and backpressure to the existing FileScheduler instead of reimplementing it. The BlobSource sharing and per-source grouping are clean, and the selection_index approach for order preservation is simpler than the previous buffered/buffer_unordered split.

Two items worth addressing (non-blocking):

1. V1 blob collection still uses .unwrap() on fragment lookup
rust/lance/src/dataset/blob.rs — collect_blob_entries_v1:

let frag = dataset.get_fragment(frag_id as usize).unwrap();
let data_file = frag.data_file_for_field(blob_field_id).unwrap();

The V2 path handles errors properly. These should use ok_or_else(|| Error::...) to avoid panicking on invalid row addresses.

2. drain_pending_reads leader can get permanently stuck
If fulfill_pending_blob_reads panics, is_draining stays true forever and no new leader will spawn — all subsequent reads on that BlobSource will silently fail. A drop guard that resets is_draining = false would make this robust.

…escing

github-actions bot added enhancement New feature or request python java labels Mar 31, 2026

Xuanwo requested a review from jackye1995 March 31, 2026 16:12

jackye1995 reviewed Apr 3, 2026

View reviewed changes

Comment thread rust/lance/src/dataset/blob.rs

Xuanwo changed the title ~~feat: coalesce blob reads by source~~ feat: add planned blob reads with source-level coalescing Apr 3, 2026

Xuanwo force-pushed the xuanwo/blob-read-coalescing branch from 0f8e779 to ebf8e8d Compare April 3, 2026 10:51

Xuanwo added 3 commits April 14, 2026 19:16

feat: coalesce blob reads by source

2b46f57

test: cover python read blobs api

3886ec3

fix: resolve blob read rebase imports

c016d45

Xuanwo force-pushed the xuanwo/blob-read-coalescing branch from ebf8e8d to c016d45 Compare April 14, 2026 11:17

Xuanwo added 3 commits April 14, 2026 19:34

fix: address blob read lints

ce40787

style: format python blob read changes

d7848a4

style: format python read blobs bindings

062c3f4

westonpace reviewed Apr 14, 2026

View reviewed changes

Xuanwo added 2 commits April 15, 2026 13:21

refactor: simplify read_blobs scheduling

153ef4b

Merge remote-tracking branch 'origin/main' into xuanwo/blob-read-coal…

c93aa46

…escing # Conflicts: # rust/lance/src/dataset/blob.rs

westonpace approved these changes Apr 15, 2026

View reviewed changes

Xuanwo added 6 commits April 16, 2026 14:27

refactor: reuse scan schedulers in read_blobs

b1ec25a

Merge remote-tracking branch 'origin/main' into xuanwo/blob-read-coal…

cbd068a

…escing

fix: harden blob read error handling

3b1384d

refactor: remove blob drain test hooks

08e70a9

fix: stream read blobs lazily

2a2f74a

Merge branch 'main' into xuanwo/blob-read-coalescing

64ce3da

Xuanwo merged commit accef66 into main Apr 16, 2026
28 checks passed

Xuanwo deleted the xuanwo/blob-read-coalescing branch April 16, 2026 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add planned blob reads with source-level coalescing#6352

feat: add planned blob reads with source-level coalescing#6352
Xuanwo merged 14 commits intomainfrom
xuanwo/blob-read-coalescing

Xuanwo commented Mar 31, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

westonpace left a comment

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Xuanwo commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python example

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xuanwo commented Mar 31, 2026 •

edited

Loading

codecov bot commented Mar 31, 2026 •

edited

Loading