Skip to content

Support byte-sized batch limits in file reader #6387

@westonpace

Description

@westonpace

Summary

Allow the batch size to be specified in bytes instead of rows when reading lance files. This gives callers control over memory usage per batch, which is especially important for variable-width data where row count is a poor proxy for memory consumption.

Approach

Add an estimate_decoded_bytes method to the page and field decoder traits. Before each drain, the batch stream queries this to compute a row count that fits the byte budget. A post-decode feedback loop measures actual batch sizes and corrects the estimate for subsequent batches. No file format changes required — all estimates use data already available at decode time.

See investigation notes and worktree at feat-byte-sized-batches-file-reader for full design details.

Tasks

  • Add batch_size_bytes config option — wire batch_size_bytes: Option<u64> through SchedulerDecoderConfig into BatchDecodeStream. When set, replaces fixed rows_per_batch. When unset, existing row-based behavior unchanged.
  • Modify BatchDecodeStream::next_batch_task() for byte-based row selection — query estimate_decoded_bytes across children, compute row count that fits the budget, use cross-column coordination loop (estimate → check → adjust → converge).
  • Add post-decode feedback loop — after into_batch() measures actual batch size (decoder.rs:2551), feed measured bytes-per-row back to refine row count for subsequent batches.
  • Add estimate_decoded_bytes to StructuralPageDecoder trait — default implementation returns conservative fallback so the system works before all encodings are covered.
  • Add estimate_decoded_bytes to StructuralFieldDecoder trait — field-level decoder spans pages, knows data type, delegates to page decoders. Struct decoder aggregates across children.
  • Implement estimates for exact fixed-width encodings — Flat, Constant, InlineBitpacking, OutOfLineBitpacking, RLE, ByteStreamSplit, PackedStruct. All num_rows * known_width.
  • Implement estimate for General compression (LZ4/Zstd) — read length prefix from compressed frame header (8-byte u64 for Zstd, 4-byte u32 for LZ4) to get exact decompressed size without decompressing.
  • Implement estimate for Dictionary encoding — inspect loaded dictionary DataBlock. Fixed-width: use value width. Variable-width: scan offsets for max value size, bound = num_rows * max_value_size.
  • Implement estimate for Variable encoding (no wrapper) — read offsets from loaded chunk data for exact size of N rows.
  • Implement estimate for FSST — apply 8x algorithmic bound on compressed data size. If wrapped in General, read General length prefix first then apply 8x.
  • Implement estimation for list types with miniblock rep index — binary search rep index to map rows to chunks, estimate at chunk granularity, overestimate boundary chunks.
  • Testing — estimation accuracy per encoding, end-to-end byte-sized batch tests, edge cases (empty/single-row pages, extreme variance), backward compatibility.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions