perf(parquet): mark cold paths #[cold] so they move out of hot icache

adriangb · claude · adriangb · commit 0b13cb99a7ce · 2026-05-14T22:10:36.000-07:00
The GKE bench shows `string_dictionary/*` consistently ~+80% across
every branch commit, even though the chunker's fast path returns
`chunk_size` with a single struct-field load while `has_dictionary()`
is true (which it is for the entire `string_dictionary` bench since
`create_random_batch` produces a low-cardinality dict that doesn't
spill the writer's encoder).

Working hypothesis: the regression is icache pressure from the new
code's mere presence. The cold path (`byte_budget_sub_batch_size`,
`write_granular_chunk`) is never executed for `string_dictionary` but
sits inline near the encoder's hot path and pushes hot bytes out of
L1i.

Mark both cold paths `#[cold]` so LLVM places them in a separate text
section. The hot encoder loop should stay tighter in icache.

This is a hypothesis-driven attempt; if GKE doesn't move it tells us
the regression source is somewhere else and we keep digging.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/parquet/src/column/writer/byte_budget_chunker.rs b/parquet/src/column/writer/byte_budget_chunker.rs
@@ -170,9 +170,12 @@ impl ByteBudgetChunker {
     /// Cold path: the encoder is plain-encoding and the bypass conditions
     /// didn't fire, so we have to look at value sizes to decide whether
     /// the chunk fits. Pulled out of `pick_sub_batch_size` and marked
-    /// `#[inline(never)]` so the inlined fast path stays small.
+    /// `#[inline(never)]` + `#[cold]` so the inlined fast path stays
+    /// small and the dead-code placement signal pushes this body
+    /// physically away from the hot encoder loop's icache footprint.
     #[allow(clippy::too_many_arguments)]
     #[inline(never)]
+    #[cold]
     fn byte_budget_sub_batch_size<E: ColumnValueEncoder>(
         &self,
         values: &E::Values,
diff --git a/parquet/src/column/writer/mod.rs b/parquet/src/column/writer/mod.rs
@@ -774,7 +774,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
     /// record never spans data pages, matching the parquet format rule.
     ///
     /// Returns the total number of values consumed across all sub-batches.
+    ///
+    /// Marked `#[cold]` because the byte-budget path that calls this
+    /// fires only on columns whose values are individually larger than
+    /// `data_page_size_limit / write_batch_size` (e.g. multi-MiB
+    /// BYTE_ARRAY blobs). Keeping it out of the hot section lets the
+    /// hot `write_mini_batch` path keep its icache locality.
     #[allow(clippy::too_many_arguments)]
+    #[cold]
     fn write_granular_chunk(
         &mut self,
         values: &E::Values,