Skip to content

PushBuffers::clear_ranges is quadratic and leaks #9695

@HippoBaro

Description

@HippoBaro

Describe the bug

PushBuffers::clear_ranges performs an O(N*M) scan to release consumed buffers, where N is the number of buffered ranges and M is the number of ranges to clear. On wide schemas (10k+ columns), this produces quadratic overhead in PushDecoder row group construction.

Additionally, clear_ranges matches buffers by exact range equality. When the IO layer coalesces adjacent requested ranges into fewer, larger fetches, the coalesced buffer never exactly matches any individual requested range, so clear_ranges silently skips it. The buffer leaks in PushBuffers until the decoder finishes or the caller manually calls release_all_ranges, increasing peak RSS proportionally to the amount of data coalesced ahead of the current row group.

This puts coalescing in a bind: without it, buffer count scales with range count and the quadratic clear_ranges dominates. With it, memory is not reclaimed incrementally.

To Reproduce

Use PushDecoder on a Parquet file with a wide schema (10k+ columns). Push data without coalesced buffers. Observe row group construction time scaling quadratically with column count. Conversely if using coalescing, observe memory not being released by clear_ranges, growing RSS until the decoder finishes.

Expected behavior

Buffer release should scale with buffer count (not range count), and coalesced or arbitrarily-sized buffers should be released incrementally as the decoder progresses.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions