feat: add batch_size_bytes to encoding decode stream#6388
feat: add batch_size_bytes to encoding decode stream#6388westonpace merged 6 commits intolance-format:mainfrom
Conversation
| /// Compute the actual data size (in bytes) of a record batch, | ||
| /// accounting only for the portion of buffers that belongs to the | ||
| /// batch's row range. Unlike `get_array_memory_size()`, this does | ||
| /// not over-count when arrays share a larger underlying page buffer. | ||
| fn batch_data_size(batch: &RecordBatch) -> u64 { | ||
| batch | ||
| .columns() | ||
| .iter() | ||
| .map(|c| array_data_size(c.as_ref())) | ||
| .sum() | ||
| } |
There was a problem hiding this comment.
I don't like this. I'm going to make a prequel PR to address getting the size of decoded batches
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Thread a new `batch_size_bytes: Option<u64>` option from `SchedulerDecoderConfig` through `create_decode_stream` into `StructuralBatchDecodeStream`. All existing call sites pass `None`, so there is no behavioral change. For legacy v2.0 files the option is ignored with a warning. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When `batch_size_bytes` is `Some`, compute the number of rows to drain per batch from an estimated bytes-per-row instead of using `rows_per_batch`. The estimate is computed once from the schema using `estimate_bytes_per_row()`, which is exact for fixed-width types and uses rough defaults for variable-width types. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After each batch is decoded, measure the actual data bytes per row and feed it back so that the next `next_batch_task()` call uses the measured value instead of the schema-based estimate. This corrects for inaccurate initial estimates on variable-width data (strings, binary) where the schema default of 64 bytes may be far off. The measurement uses `batch_data_size()`, a new helper that computes the actual data contribution of a batch by walking column types and reading offsets for variable-width arrays. This avoids the over-counting from `get_array_memory_size()` which reports full shared page-buffer capacity rather than per-batch data. Part of lance-format#6387 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4fcfa97 to
81b8873
Compare
- Feedback loop now degrades gradually when actual bpr is smaller than the current estimate (midpoint) instead of snapping immediately. Larger values are still adopted immediately to avoid OOM. - Remove "(legacy)" from user-facing v2.0 warning message. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f94ed45 to
7715379
Compare
wjones127
left a comment
There was a problem hiding this comment.
All looks very reasonable 👍
| return w as f64; | ||
| } | ||
| match data_type { | ||
| DataType::Boolean => 1.0, |
There was a problem hiding this comment.
nitpick: technically the data is bitpacked, so I would think this would be more like 1.0 / 8.0. But probably fine to leave as 1.0. Might be worth a comment.
There was a problem hiding this comment.
Good point. Since we're using floats I went ahead and switched to 1.0 / 8.0. This also led me to realize we forgot to account for validity bitmaps but I don't see a simple way to do so and this is just an estimate so I just made a comment for now.
## Summary Stacked on #6388. Please merge that PR first. - Adds `batch_size_bytes: Option<u64>` to `FileReaderOptions` and propagates it through all 6 `SchedulerDecoderConfig` creation sites in the file reader - Adds `batch_size_bytes` field + setter to `Scanner`, wired through both `scan_fragments` (via `LanceScanConfig`) and `pushdown_scan` (via `FileReaderOptions` in `ScanConfig`) - Adds `batch_size_bytes` to `LanceScanConfig`, with `try_new_v2` injecting it into `FragReadConfig` via `FileReaderOptions` - Exposes `batch_size_bytes` in the Python API: `LanceDataset.scanner()`, `to_table()`, `to_batches()`, `ScannerBuilder` ## Test plan - [x] `cargo check -p lance-file -p lance --tests` — clean - [x] `cargo clippy -p lance-file -p lance --tests -- -D warnings` — clean - [x] `cargo fmt --all` — applied - [x] `cargo test -p lance-encoding -- byte_sized` — 3/3 pass - [x] `cargo test -p lance -- test_scan` — 38/38 pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This adds some initial framework for #6387 . It adds the
batch_size_bytesoption which will be wired up in a future PR. It provides a basic implementation that uses a guess-and-check strategy to try and figure out how many rows to use. This is inexact and some batches it emits will be too large. Future PRs will add accurate sizing from the decode layers to hopefully avoid the need for guessing entirely. Still, this is enough to get the feature wired up and then we can improve it later.batch_size_bytes: Option<u64>toSchedulerDecoderConfigand thread it through the structural v2.1 decode pathbatch_sizerow countStructuralBatchDecodeStreamis modified; v2.0BatchDecodeStreamis unchanged (logs a warning if the option is set)Test plan
test_estimate_bytes_per_row— unit test for the schema-based byte estimatortest_byte_sized_batches_fixed_width— 1000 rows × 4 Int32 columns,batch_size_bytes=1600→ 10 batches of exactly 100 rows, roundtrip verifiedtest_byte_sized_batches_none_unchanged—batch_size_bytes=Nonestill usesrows_per_batch(no behavioral change)test_byte_sized_batches_feedback_convergence— 100-byte strings with 64-byte schema estimate; verifies second/third batches converge to ~50 rows after feedbackcargo clippy -p lance-encoding --tests -p lance-file -- -D warningscleancargo fmt --all -- --checkclean🤖 Generated with Claude Code