Skip to content

feat: wire batch_size_bytes to Python and public Rust API#6428

Merged
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-file-reader
Apr 10, 2026
Merged

feat: wire batch_size_bytes to Python and public Rust API#6428
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-file-reader

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

Stacked on #6388. Please merge that PR first.

  • Adds batch_size_bytes: Option<u64> to FileReaderOptions and propagates it through all 6 SchedulerDecoderConfig creation sites in the file reader
  • Adds batch_size_bytes field + setter to Scanner, wired through both scan_fragments (via LanceScanConfig) and pushdown_scan (via FileReaderOptions in ScanConfig)
  • Adds batch_size_bytes to LanceScanConfig, with try_new_v2 injecting it into FragReadConfig via FileReaderOptions
  • Exposes batch_size_bytes in the Python API: LanceDataset.scanner(), to_table(), to_batches(), ScannerBuilder

Test plan

  • cargo check -p lance-file -p lance --tests — clean
  • cargo clippy -p lance-file -p lance --tests -- -D warnings — clean
  • cargo fmt --all — applied
  • cargo test -p lance-encoding -- byte_sized — 3/3 pass
  • cargo test -p lance -- test_scan — 38/38 pass

🤖 Generated with Claude Code

@github-actions github-actions Bot added enhancement New feature or request python labels Apr 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 56.06061% with 29 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 39.28% 15 Missing and 2 partials ⚠️
rust/lance/src/io/exec/filtered_read.rs 41.66% 5 Missing and 2 partials ⚠️
rust/lance/src/io/exec/scan.rs 76.92% 1 Missing and 2 partials ⚠️
rust/lance-file/src/reader.rs 84.61% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

The encoding layer already supports byte-based batching via
SchedulerDecoderConfig.batch_size_bytes but all callers hardcoded it to
None. This wires the parameter through FileReaderOptions, Scanner,
LanceScanConfig, and the Python bindings so users can specify it when
scanning a dataset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@westonpace westonpace force-pushed the feat/byte-sized-batches-file-reader branch from f94ed45 to 9024fa5 Compare April 8, 2026 14:16
westonpace and others added 5 commits April 8, 2026 07:54
Instead of threading batch_size_bytes individually through LanceScanConfig
and FilteredReadOptions, pass the full FileReaderOptions bundle so future
options flow through automatically.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… levels

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The parameter was added to the function but missing from the
#[pyo3(signature=...)] attribute, causing a positional argument mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ReadParams and Scanner are not in scope from lance-file, so use plain
backtick references instead of rustdoc links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@westonpace westonpace marked this pull request as ready for review April 9, 2026 20:34
.or_else(|| self.dataset.file_reader_options.clone());
match (base, self.batch_size_bytes) {
(Some(mut opts), Some(bsb)) => {
if opts.batch_size_bytes.is_none() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we always set opts.batch_size_bytes = Some(bsb) no matter whether opts.batch_size_bytes is None or not?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set opts.batch_size_bytes then we will override batch_size.

I don't want to change the default (yet) so that this can stay a non-breaking change. The default is that the user doesn't set batch_size_bytes on either the scanner settings or the dataset settings and so we use batch_size instead.

@westonpace westonpace merged commit e7369fb into lance-format:main Apr 10, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants