Skip to content

feat: add batch_size_bytes to encoding decode stream#6388

Merged
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-api
Apr 8, 2026
Merged

feat: add batch_size_bytes to encoding decode stream#6388
westonpace merged 6 commits intolance-format:mainfrom
westonpace:feat/byte-sized-batches-api

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Apr 2, 2026

Summary

This adds some initial framework for #6387 . It adds the batch_size_bytes option which will be wired up in a future PR. It provides a basic implementation that uses a guess-and-check strategy to try and figure out how many rows to use. This is inexact and some batches it emits will be too large. Future PRs will add accurate sizing from the decode layers to hopefully avoid the need for guessing entirely. Still, this is enough to get the feature wired up and then we can improve it later.

  • Add batch_size_bytes: Option<u64> to SchedulerDecoderConfig and thread it through the structural v2.1 decode path
  • When set, compute rows-per-batch from byte estimates instead of the fixed batch_size row count
  • After each batch decodes, measure actual bytes-per-row and feed it back so subsequent batches converge toward the target byte size
  • Feedback degrades gradually (midpoint) when actual size is smaller than the estimate; adopts immediately when larger to avoid OOM
  • Only the v2.1+ StructuralBatchDecodeStream is modified; v2.0 BatchDecodeStream is unchanged (logs a warning if the option is set)
  • Wiring this through to the file reader and Python/Java bindings will be done in a follow-up PR

Test plan

  • test_estimate_bytes_per_row — unit test for the schema-based byte estimator
  • test_byte_sized_batches_fixed_width — 1000 rows × 4 Int32 columns, batch_size_bytes=1600 → 10 batches of exactly 100 rows, roundtrip verified
  • test_byte_sized_batches_none_unchangedbatch_size_bytes=None still uses rows_per_batch (no behavioral change)
  • test_byte_sized_batches_feedback_convergence — 100-byte strings with 64-byte schema estimate; verifies second/third batches converge to ~50 rows after feedback
  • cargo clippy -p lance-encoding --tests -p lance-file -- -D warnings clean
  • cargo fmt --all -- --check clean

🤖 Generated with Claude Code

@github-actions github-actions Bot added the enhancement New feature or request label Apr 2, 2026
@westonpace westonpace marked this pull request as draft April 2, 2026 14:03
Comment thread rust/lance-encoding/src/decoder.rs Outdated
Comment on lines +1686 to +1696
/// Compute the actual data size (in bytes) of a record batch,
/// accounting only for the portion of buffers that belongs to the
/// batch's row range. Unlike `get_array_memory_size()`, this does
/// not over-count when arrays share a larger underlying page buffer.
fn batch_data_size(batch: &RecordBatch) -> u64 {
batch
.columns()
.iter()
.map(|c| array_data_size(c.as_ref()))
.sum()
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this. I'm going to make a prequel PR to address getting the size of decoded batches

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 91.07981% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/decoder.rs 91.90% 10 Missing and 7 partials ⚠️
rust/lance-file/src/reader.rs 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace and others added 4 commits April 6, 2026 12:27
Thread a new `batch_size_bytes: Option<u64>` option from
`SchedulerDecoderConfig` through `create_decode_stream` into
`StructuralBatchDecodeStream`. All existing call sites pass `None`,
so there is no behavioral change. For legacy v2.0 files the option
is ignored with a warning.

Part of lance-format#6387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When `batch_size_bytes` is `Some`, compute the number of rows to
drain per batch from an estimated bytes-per-row instead of using
`rows_per_batch`. The estimate is computed once from the schema
using `estimate_bytes_per_row()`, which is exact for fixed-width
types and uses rough defaults for variable-width types.

Part of lance-format#6387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After each batch is decoded, measure the actual data bytes per row
and feed it back so that the next `next_batch_task()` call uses the
measured value instead of the schema-based estimate. This corrects
for inaccurate initial estimates on variable-width data (strings,
binary) where the schema default of 64 bytes may be far off.

The measurement uses `batch_data_size()`, a new helper that computes
the actual data contribution of a batch by walking column types and
reading offsets for variable-width arrays. This avoids the
over-counting from `get_array_memory_size()` which reports full
shared page-buffer capacity rather than per-batch data.

Part of lance-format#6387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@westonpace westonpace force-pushed the feat/byte-sized-batches-api branch from 4fcfa97 to 81b8873 Compare April 6, 2026 20:30
- Feedback loop now degrades gradually when actual bpr is smaller than
  the current estimate (midpoint) instead of snapping immediately.
  Larger values are still adopted immediately to avoid OOM.
- Remove "(legacy)" from user-facing v2.0 warning message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@westonpace westonpace changed the title feat: add batch_size_bytes option to file reader feat: add batch_size_bytes to encoding decode stream Apr 7, 2026
@westonpace westonpace marked this pull request as ready for review April 7, 2026 15:39
@github-actions github-actions Bot added the python label Apr 7, 2026
@westonpace westonpace force-pushed the feat/byte-sized-batches-api branch from f94ed45 to 7715379 Compare April 7, 2026 16:45
Copy link
Copy Markdown
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks very reasonable 👍

Comment thread rust/lance-encoding/src/decoder.rs Outdated
return w as f64;
}
match data_type {
DataType::Boolean => 1.0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: technically the data is bitpacked, so I would think this would be more like 1.0 / 8.0. But probably fine to leave as 1.0. Might be worth a comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Since we're using floats I went ahead and switched to 1.0 / 8.0. This also led me to realize we forgot to account for validity bitmaps but I don't see a simple way to do so and this is just an estimate so I just made a comment for now.

@westonpace westonpace merged commit 40d09bf into lance-format:main Apr 8, 2026
27 checks passed
westonpace added a commit that referenced this pull request Apr 10, 2026
## Summary

Stacked on #6388. Please merge that PR first.

- Adds `batch_size_bytes: Option<u64>` to `FileReaderOptions` and
propagates it through all 6 `SchedulerDecoderConfig` creation sites in
the file reader
- Adds `batch_size_bytes` field + setter to `Scanner`, wired through
both `scan_fragments` (via `LanceScanConfig`) and `pushdown_scan` (via
`FileReaderOptions` in `ScanConfig`)
- Adds `batch_size_bytes` to `LanceScanConfig`, with `try_new_v2`
injecting it into `FragReadConfig` via `FileReaderOptions`
- Exposes `batch_size_bytes` in the Python API:
`LanceDataset.scanner()`, `to_table()`, `to_batches()`, `ScannerBuilder`

## Test plan

- [x] `cargo check -p lance-file -p lance --tests` — clean
- [x] `cargo clippy -p lance-file -p lance --tests -- -D warnings` —
clean
- [x] `cargo fmt --all` — applied
- [x] `cargo test -p lance-encoding -- byte_sized` — 3/3 pass
- [x] `cargo test -p lance -- test_scan` — 38/38 pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants