Skip to content

Commit 9d647dc

Browse files
adriangbclaude
andcommitted
perf(parquet): force-inline ByteBudgetChunker hot path, split cold path out
The previous `#[inline]` hint was no longer enough once `pick_sub_batch_size` grew the `ValueCountStrategy` match — LLVM silently stopped inlining and the most recent GKE bench bounced `string_dictionary/*` back to +46–81% (`default` +81%, `parquet_2` +86%, `bloom_filter` +46%). Fix: 1. Mark `pick_sub_batch_size` `#[inline(always)]`. The hot path is just `if static_always_fits || has_dictionary || chunk_size == 0 { return chunk_size; }` — one struct-field load + one virtual call — so unconditional inlining is the right call, not a heuristic suggestion. 2. Pull the byte-budget computation out into a separate `byte_budget_sub_batch_size` method marked `#[inline(never)]`. This keeps the inlined fast path small even as the slow path grows; the slow path is paid for explicitly when bypasses don't fire, not smuggled into every chunk's inline body. Same behavior, just compiler-friendlier code layout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent bb19d3e commit 9d647dc

1 file changed

Lines changed: 34 additions & 5 deletions

File tree

parquet/src/column/writer/byte_budget_chunker.rs

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -134,12 +134,14 @@ impl ByteBudgetChunker {
134134
/// dictionary fallback bound dict-encoded pages independently.
135135
/// - When `chunk_size == 0`, there's nothing to size.
136136
///
137-
/// Inlined because the hot path through the bypass conditions reduces
138-
/// to one struct-field load + one virtual call into the encoder and
139-
/// keeping it across the module boundary stops the compiler from
140-
/// folding it into the per-chunk loop's register schedule.
137+
/// Hot path: when one of the bypass conditions fires this returns
138+
/// `chunk_size` with one struct-field load and one virtual call into
139+
/// the encoder. Marked `#[inline(always)]` because LLVM's heuristic
140+
/// would otherwise refuse to inline now that the slow path lives
141+
/// nearby — the GKE bench showed a +80% regression on
142+
/// `string_dictionary/*` when the hint was just `#[inline]`.
141143
#[allow(clippy::too_many_arguments)]
142-
#[inline]
144+
#[inline(always)]
143145
pub(crate) fn pick_sub_batch_size<E: ColumnValueEncoder>(
144146
&self,
145147
encoder: &E,
@@ -154,6 +156,33 @@ impl ByteBudgetChunker {
154156
if self.static_always_fits || encoder.has_dictionary() || chunk_size == 0 {
155157
return chunk_size;
156158
}
159+
self.byte_budget_sub_batch_size::<E>(
160+
values,
161+
value_indices,
162+
chunk_def,
163+
strategy,
164+
values_offset,
165+
chunk_size,
166+
end_offset,
167+
)
168+
}
169+
170+
/// Cold path: the encoder is plain-encoding and the bypass conditions
171+
/// didn't fire, so we have to look at value sizes to decide whether
172+
/// the chunk fits. Pulled out of `pick_sub_batch_size` and marked
173+
/// `#[inline(never)]` so the inlined fast path stays small.
174+
#[allow(clippy::too_many_arguments)]
175+
#[inline(never)]
176+
fn byte_budget_sub_batch_size<E: ColumnValueEncoder>(
177+
&self,
178+
values: &E::Values,
179+
value_indices: Option<&[usize]>,
180+
chunk_def: LevelDataRef<'_>,
181+
strategy: ValueCountStrategy<'_>,
182+
values_offset: usize,
183+
chunk_size: usize,
184+
end_offset: usize,
185+
) -> usize {
157186
// Count how many values fall in this chunk's level range. The
158187
// strategy was picked once per `write_batch_internal` call so
159188
// the common all-non-null case (every level has a value) skips

0 commit comments

Comments
 (0)