feat: add batch_size_bytes to encoding decode stream by westonpace · Pull Request #6388 · lance-format/lance

westonpace · 2026-04-02T13:43:20Z

Summary

This adds some initial framework for #6387 . It adds the batch_size_bytes option which will be wired up in a future PR. It provides a basic implementation that uses a guess-and-check strategy to try and figure out how many rows to use. This is inexact and some batches it emits will be too large. Future PRs will add accurate sizing from the decode layers to hopefully avoid the need for guessing entirely. Still, this is enough to get the feature wired up and then we can improve it later.

Add batch_size_bytes: Option<u64> to SchedulerDecoderConfig and thread it through the structural v2.1 decode path
When set, compute rows-per-batch from byte estimates instead of the fixed batch_size row count
After each batch decodes, measure actual bytes-per-row and feed it back so subsequent batches converge toward the target byte size
Feedback degrades gradually (midpoint) when actual size is smaller than the estimate; adopts immediately when larger to avoid OOM
Only the v2.1+ StructuralBatchDecodeStream is modified; v2.0 BatchDecodeStream is unchanged (logs a warning if the option is set)
Wiring this through to the file reader and Python/Java bindings will be done in a follow-up PR

Test plan

test_estimate_bytes_per_row — unit test for the schema-based byte estimator
test_byte_sized_batches_fixed_width — 1000 rows × 4 Int32 columns, batch_size_bytes=1600 → 10 batches of exactly 100 rows, roundtrip verified
test_byte_sized_batches_none_unchanged — batch_size_bytes=None still uses rows_per_batch (no behavioral change)
test_byte_sized_batches_feedback_convergence — 100-byte strings with 64-byte schema estimate; verifies second/third batches converge to ~50 rows after feedback
cargo clippy -p lance-encoding --tests -p lance-file -- -D warnings clean
cargo fmt --all -- --check clean

🤖 Generated with Claude Code

westonpace · 2026-04-02T14:12:14Z

+/// Compute the actual data size (in bytes) of a record batch,
+/// accounting only for the portion of buffers that belongs to the
+/// batch's row range. Unlike `get_array_memory_size()`, this does
+/// not over-count when arrays share a larger underlying page buffer.
+fn batch_data_size(batch: &RecordBatch) -> u64 {
+    batch
+        .columns()
+        .iter()
+        .map(|c| array_data_size(c.as_ref()))
+        .sum()
+}


I don't like this. I'm going to make a prequel PR to address getting the size of decoded batches

codecov · 2026-04-02T14:12:21Z

Codecov Report

❌ Patch coverage is 91.07981% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-encoding/src/decoder.rs	91.90%	10 Missing and 7 partials ⚠️
rust/lance-file/src/reader.rs	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Thread a new `batch_size_bytes: Option<u64>` option from `SchedulerDecoderConfig` through `create_decode_stream` into `StructuralBatchDecodeStream`. All existing call sites pass `None`, so there is no behavioral change. For legacy v2.0 files the option is ignored with a warning. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When `batch_size_bytes` is `Some`, compute the number of rows to drain per batch from an estimated bytes-per-row instead of using `rows_per_batch`. The estimate is computed once from the schema using `estimate_bytes_per_row()`, which is exact for fixed-width types and uses rough defaults for variable-width types. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After each batch is decoded, measure the actual data bytes per row and feed it back so that the next `next_batch_task()` call uses the measured value instead of the schema-based estimate. This corrects for inaccurate initial estimates on variable-width data (strings, binary) where the schema default of 64 bytes may be far off. The measurement uses `batch_data_size()`, a new helper that computes the actual data contribution of a batch by walking column types and reading offsets for variable-width arrays. This avoids the over-counting from `get_array_memory_size()` which reports full shared page-buffer capacity rather than per-batch data. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Feedback loop now degrades gradually when actual bpr is smaller than the current estimate (midpoint) instead of snapping immediately. Larger values are still adopted immediately to avoid OOM. - Remove "(legacy)" from user-facing v2.0 warning message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wjones127

All looks very reasonable 👍

wjones127 · 2026-04-07T18:17:40Z

+        return w as f64;
+    }
+    match data_type {
+        DataType::Boolean => 1.0,


nitpick: technically the data is bitpacked, so I would think this would be more like 1.0 / 8.0. But probably fine to leave as 1.0. Might be worth a comment.

Good point. Since we're using floats I went ahead and switched to 1.0 / 8.0. This also led me to realize we forgot to account for validity bitmaps but I don't see a simple way to do so and this is just an estimate so I just made a comment for now.

## Summary Stacked on #6388. Please merge that PR first. - Adds `batch_size_bytes: Option<u64>` to `FileReaderOptions` and propagates it through all 6 `SchedulerDecoderConfig` creation sites in the file reader - Adds `batch_size_bytes` field + setter to `Scanner`, wired through both `scan_fragments` (via `LanceScanConfig`) and `pushdown_scan` (via `FileReaderOptions` in `ScanConfig`) - Adds `batch_size_bytes` to `LanceScanConfig`, with `try_new_v2` injecting it into `FragReadConfig` via `FileReaderOptions` - Exposes `batch_size_bytes` in the Python API: `LanceDataset.scanner()`, `to_table()`, `to_batches()`, `ScannerBuilder` ## Test plan - [x] `cargo check -p lance-file -p lance --tests` — clean - [x] `cargo clippy -p lance-file -p lance --tests -- -D warnings` — clean - [x] `cargo fmt --all` — applied - [x] `cargo test -p lance-encoding -- byte_sized` — 3/3 pass - [x] `cargo test -p lance -- test_scan` — 38/38 pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

github-actions Bot added the enhancement New feature or request label Apr 2, 2026

westonpace marked this pull request as draft April 2, 2026 14:03

westonpace commented Apr 2, 2026

View reviewed changes

westonpace mentioned this pull request Apr 2, 2026

feat: thread data_size through decode pipeline #6391

Merged

4 tasks

westonpace and others added 4 commits April 6, 2026 12:27

style: apply rustfmt

81b8873

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

westonpace force-pushed the feat/byte-sized-batches-api branch from 4fcfa97 to 81b8873 Compare April 6, 2026 20:30

westonpace changed the title ~~feat: add batch_size_bytes option to file reader~~ feat: add batch_size_bytes to encoding decode stream Apr 7, 2026

westonpace marked this pull request as ready for review April 7, 2026 15:39

github-actions Bot added the python label Apr 7, 2026

westonpace force-pushed the feat/byte-sized-batches-api branch from f94ed45 to 7715379 Compare April 7, 2026 16:45

westonpace mentioned this pull request Apr 7, 2026

feat: wire batch_size_bytes to Python and public Rust API #6428

Merged

5 tasks

wjones127 approved these changes Apr 7, 2026

View reviewed changes

Address PR feedback

a8870a9

westonpace merged commit 40d09bf into lance-format:main Apr 8, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add batch_size_bytes to encoding decode stream#6388

feat: add batch_size_bytes to encoding decode stream#6388
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-api

westonpace commented Apr 2, 2026 •

edited

Loading

Uh oh!

westonpace Apr 2, 2026

Uh oh!

codecov Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

wjones127 left a comment

Uh oh!

wjones127 Apr 7, 2026

Uh oh!

westonpace Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

westonpace commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

westonpace Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

wjones127 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

westonpace commented Apr 2, 2026 •

edited

Loading

codecov Bot commented Apr 2, 2026 •

edited

Loading