You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(parquet): make PushBuffers boundary-agnostic for prefetch IO
The `PushDecoder` (introduced in #7997, #8080) is designed to decouple
IO and CPU. It holds non-contiguous byte ranges, with a
`NeedsData`/`push_range` protocol. However, it requires each logical
read to be satisfied in full by a single physical buffer: `has_range`,
`get_bytes`, and `Read::read` all searched for one buffer that entirely
covered the requested range.
This assumption conflates two orthogonal IO strategies:
- Coalescing: the IO layer merges adjacent requested ranges into fewer,
larger fetches.
- Prefetching: the IO layer pushes data ahead of what the decoder has
requested. This is an inversion of control: the IO layer speculatively
fills buffers at offsets not yet requested and for arbitrary buffer
sizes.
These two strategies interact poorly with the current release mechanism
(`clear_ranges`), which matches buffers by exact range equality:
- Coalescing is both rewarded and punished. It is load bearing because
without it, the number of physical buffers scale with ranges
requested, and `clear_ranges` performs an O(N×M) scan to remove
consumed ranges, producing quadratic overhead on wide schemas.
But it is also punished because a coalesced buffer never exactly
matches any individual requested range, so `clear_ranges` silently
skips it: the buffer leaks in `PushBuffers` until the decoder
finishes or the caller manually calls `release_all_ranges` (#9624).
This increases peak RSS proportionally to the amount of data coalesced
ahead of the current row group.
- Prefetching is structurally impossible: speculatively pushed
buffers will straddle future read boundaries, so the decoder
cannot consume them, and `clear_ranges` cannot release them.
This commit makes `PushBuffers` boundary-agnostic, completing the
prefetching story, and changes the internals to scale with buffer count
instead of range count:
- Buffer stitching: `has_range`, `get_bytes`, and `Read::read` resolve
logical ranges across multiple contiguous physical buffers via binary
search, so the IO layer is free to push arbitrarily-sized parts
without knowing future read boundaries. This is a nice improvement,
because some IO layer can be made much more efficient when using
uniform buffers and vectorized reads.
- Incremental release (`release_through`): replaces `clear_ranges` with
a watermark-based release that drops all buffers below a byte offset,
trimming straddling buffers via zero-copy `Bytes::slice`.
The decoder calls this automatically at row-group boundaries.
Benchmark results (vs baseline):
push_decoder/1buf/1000ranges 321.9 µs (was 323.5 µs, −1%)
push_decoder/1buf/10000ranges 3.26 ms (was 3.25 ms, +0%)
push_decoder/1buf/100000ranges 34.9 ms (was 34.6 ms, +1%)
push_decoder/1buf/500000ranges 192.2 ms (was 185.3 ms, +4%)
push_decoder/Nbuf/1000ranges 363.9 µs (was 437.2 µs, −17%)
push_decoder/Nbuf/10000ranges 3.82 ms (was 10.7 ms, −64%)
push_decoder/Nbuf/100000ranges 42.1 ms (was 711.6 ms, −94%)
Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
0 commit comments