Skip to content

Commit 40d09bf

Browse files
westonpaceclaude
andauthored
feat: add batch_size_bytes to encoding decode stream (#6388)
## Summary This adds some initial framework for #6387 . It adds the `batch_size_bytes` option which will be wired up in a future PR. It provides a basic implementation that uses a guess-and-check strategy to try and figure out how many rows to use. This is inexact and some batches it emits will be too large. Future PRs will add accurate sizing from the decode layers to hopefully avoid the need for guessing entirely. Still, this is enough to get the feature wired up and then we can improve it later. - Add `batch_size_bytes: Option<u64>` to `SchedulerDecoderConfig` and thread it through the structural v2.1 decode path - When set, compute rows-per-batch from byte estimates instead of the fixed `batch_size` row count - After each batch decodes, measure actual bytes-per-row and feed it back so subsequent batches converge toward the target byte size - Feedback degrades gradually (midpoint) when actual size is smaller than the estimate; adopts immediately when larger to avoid OOM - Only the v2.1+ `StructuralBatchDecodeStream` is modified; v2.0 `BatchDecodeStream` is unchanged (logs a warning if the option is set) - Wiring this through to the file reader and Python/Java bindings will be done in a follow-up PR ## Test plan - [x] `test_estimate_bytes_per_row` — unit test for the schema-based byte estimator - [x] `test_byte_sized_batches_fixed_width` — 1000 rows × 4 Int32 columns, `batch_size_bytes=1600` → 10 batches of exactly 100 rows, roundtrip verified - [x] `test_byte_sized_batches_none_unchanged` — `batch_size_bytes=None` still uses `rows_per_batch` (no behavioral change) - [x] `test_byte_sized_batches_feedback_convergence` — 100-byte strings with 64-byte schema estimate; verifies second/third batches converge to ~50 rows after feedback - [x] `cargo clippy -p lance-encoding --tests -p lance-file -- -D warnings` clean - [x] `cargo fmt --all -- --check` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5310f36 commit 40d09bf

4 files changed

Lines changed: 339 additions & 2 deletions

File tree

rust/lance-encoding/benches/decoder.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -541,6 +541,7 @@ fn bench_decode_compressed_parallel(c: &mut Criterion) {
541541
false,
542542
false,
543543
rx,
544+
None,
544545
)
545546
.unwrap();
546547

0 commit comments

Comments
 (0)