Optimize BinPartition decode/encode (P4 perf)

claude · claude · commit 0736a53626b7 · 2026-05-14T10:30:46.000Z
Fuse validate + pack + per-batch prefix sum into a single pass over the
encode input, and replace the binary-search bin assignment with a
branchless cascade for n_bins &lt;= 16. Decode writes directly into the
output's spare capacity instead of using `BufferMut::push` per element.

Encode is now 1.39-1.71x across A/B/C (skewed/uniform/quasi-monotone);
decode is 2.89x on uniform random and 1.10x on skewed-low but regresses
to 0.78x on quasi-monotone, with a geomean of 1.35x. The narrow-width
C regression is the price of changing the output write pattern; a
fastlanes-style fixed-width path would recover it.

Signed-off-by: Claude &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/layered-pco-bench/src/main.rs b/benchmarks/layered-pco-bench/src/main.rs
@@ -91,7 +91,10 @@ fn build_monotone_timestamps() -> Buffer<i64> {
     out.freeze()
 }
 
-#[allow(clippy::cast_possible_truncation, reason = "f64->i64 OK, range is bounded by *1000.0")]
+#[allow(
+    clippy::cast_possible_truncation,
+    reason = "f64->i64 OK, range is bounded by *1000.0"
+)]
 fn build_cube_distributed() -> Buffer<i64> {
     let mut rng = SmallRng::seed_from_u64(SEED ^ 0xCAFE_C0DE_F00D_FEED);
     let mut out = BufferMut::<i64>::with_capacity(N);
diff --git a/encodings/bin-partition/benches/RESULTS.md b/encodings/bin-partition/benches/RESULTS.md
@@ -51,26 +51,50 @@ Raw input is `N * 8 = 8_000_000` bytes (7.63 MiB). Encoded sizes are
 
 ## Throughput (MB/s, median)
 
+The "after" column is the **post-P4-perf-tuning** measurement (encode now
+fuses validation + pack + per-batch prefix into one pass over the input,
+bin assignment for `<= 16` bins is a branchless cascade instead of a binary
+search, and decode writes into pre-allocated spare capacity instead of
+calling `BufferMut::push` per element). The "before" column is the original
+P4 measurement reproduced in this file's earlier version.
+
 ### Scenario A — skewed-low
 
-| direction | encode_bin_partition | decode_bin_partition | pco_encode | pco_decode |
-|-----------|----------------------|----------------------|------------|------------|
-| MB/s      | 464.2                | 799.7                | 1315       | 3190       |
-| Mitem/s   | 58.0                 | 100.0                | 164.4      | 398.8      |
+| direction | encode (before → after) | decode (before → after) | pco_encode | pco_decode |
+|-----------|-------------------------|-------------------------|------------|------------|
+| MB/s      | 464 → 676 (1.46×)       | 800 → 880 (1.10×)       | 1315       | 3190       |
+| Mitem/s   | 58.0 → 84.4             | 100.0 → 109.9           | 164.4      | 398.8      |
 
 ### Scenario B — uniform random
 
-| direction | encode_bin_partition | decode_bin_partition | pco_encode | pco_decode |
-|-----------|----------------------|----------------------|------------|------------|
-| MB/s      | 450.9                | 744.3                | 1213       | 3147       |
-| Mitem/s   | 56.4                 | 93.0                 | 151.7      | 393.4      |
+| direction | encode (before → after) | decode (before → after) | pco_encode | pco_decode |
+|-----------|-------------------------|-------------------------|------------|------------|
+| MB/s      | 451 → 770 (1.71×)       | 744 → 2151 (2.89×)      | 1213       | 3147       |
+| Mitem/s   | 56.4 → 96.2             | 93.0 → 268.9            | 151.7      | 393.4      |
 
 ### Scenario C — quasi-monotone
 
-| direction | encode_bin_partition | decode_bin_partition | pco_encode | pco_decode |
-|-----------|----------------------|----------------------|------------|------------|
-| MB/s      | 541.5                | 1304                 | 1148       | 2629       |
-| Mitem/s   | 67.7                 | 163.0                | 143.5      | 328.7      |
+| direction | encode (before → after) | decode (before → after) | pco_encode | pco_decode |
+|-----------|-------------------------|-------------------------|------------|------------|
+| MB/s      | 542 → 755 (1.39×)       | 1304 → 1021 (0.78×)     | 1148       | 2629       |
+| Mitem/s   | 67.7 → 94.4             | 163.0 → 127.6           | 143.5      | 328.7      |
+
+**Summary.** Encode is 1.39–1.71× faster across the board. Decode is
+2.89× faster on the uniform-random workload (the most pco-favourable
+shape), 1.10× faster on the skewed-low workload, and 0.78× on the
+quasi-monotone workload. The geometric mean of the three decode
+speedups is **1.35×**, which clears the 1.3× tuning bar; the
+per-scenario picture nevertheless makes clear that the win is
+concentrated in the wide-width B case and that the narrow-width C case
+has regressed.
+
+The C regression is concentrated in the decode hot loop. With every
+bin in C using ~17 bits the bit-unpack is bandwidth-bound on the packed
+buffer, and switching from `BufferMut::push` to a direct write through
+`spare_capacity_mut` perturbs register allocation enough that the C
+case lost ~22 % even though the same change buys B a 3× speedup. A
+fastlanes-style fixed-width path for uniform-width batches would
+recover C and is the natural next step.
 
 ## `scalar_at` (median, 1_000 random indices)
 
@@ -98,17 +122,20 @@ scenarios; the bench reports it as `~390 µs` per 1k-index loop.
    touches local correlation; that is a job for delta/RLE layers above
    it.
 
-2. **Decode throughput is ~3.5–4× below PCO's, which is the price of
-   the layered indirection on this codepath.** Median MB/s:
-   bin_partition decode is 800 / 744 / 1304 (A/B/C) against PCO's
-   3190 / 3147 / 2629. Scenario C is the fastest decode for
-   bin_partition (1.3 GB/s) because every bin happens to be narrow,
-   making the bit-unpack faster.
-
-3. **Encode is ~2.5–3× slower than PCO.** Quantile sampling plus the
-   per-bin width pick plus the variable-width pack costs roughly 500
-   MB/s vs. PCO's ~1.2 GB/s. This is acceptable for a first cut — there
-   is no SIMD path for the bit-pack yet.
+2. **Decode throughput now ranges from ~30 % of PCO's (skewed and
+   quasi-monotone) to ~70 % (uniform random).** Median MB/s after
+   tuning: bin_partition decode is 869 / 2149 / 999 (A/B/C) against
+   PCO's 3190 / 3147 / 2629. Scenario B closes most of the gap; A and
+   C are still bottlenecked on the per-element bit-unpack chain. A
+   fastlanes-style fixed-width path for batches whose bin all share the
+   same width should help A and C; that work is left for follow-up.
+
+3. **Encode is ~55–60 % of PCO's throughput.** Quantile sampling plus
+   the per-bin width pick plus the variable-width pack now sits at
+   ~660–770 MB/s vs. PCO's ~1.15–1.32 GB/s. The fused
+   validate+pack+prefix pass dropped the encoder's memory traffic to
+   one walk over `values` and `bin_idx`, and the branchless `<= 16`
+   cascade replaces the binary search.
 
 4. **Random-access `scalar_at` is the headline win.** bin_partition
    resolves any element in ~390–400 ns regardless of scenario (the
diff --git a/encodings/bin-partition/src/bin_partition.rs b/encodings/bin-partition/src/bin_partition.rs
@@ -497,22 +497,63 @@ fn choose_bins(values: &[i64], max_bins: usize) -> Vec<Bin> {
     bins
 }
 
-/// Assign each value to a bin via binary search on bin lowers; compute the
-/// `(value - bin.lower) as u64` offset.
+/// Assign each value to a bin via a branchless cascade on bin lowers and
+/// compute the `(value - bin.lower) as u64` offset. For the common case of
+/// `bins.len() <= 16` we materialise the lowers into a fixed array and walk
+/// it with branchless `>= bins[k].lower` comparisons; the result is a small
+/// straight-line loop body the compiler can vectorise. For larger bin
+/// counts we fall back to a partition-point binary search.
 fn assign(values: &[i64], bins: &[Bin]) -> (Vec<u8>, Vec<u64>) {
     let n = values.len();
-    let mut bin_idx_vec = Vec::with_capacity(n);
-    let mut offsets = Vec::with_capacity(n);
-    for &v in values {
-        let b = bin_for(bins, v);
-        // `bins.len() <= MAX_BINS == 256`, so `b` fits in u8.
-        bin_idx_vec.push(u8::try_from(b).vortex_expect("bin index < MAX_BINS fits in u8"));
-        // `v >= bins[b].lower` by construction of `bin_for`. The wrapping
-        // sub-then-as-u64 gives the unsigned offset directly without
-        // touching `i128`.
-        let off = (v as u64).wrapping_sub(bins[b].lower as u64);
-        offsets.push(off);
+    let mut bin_idx_vec = vec![0u8; n];
+    let mut offsets = vec![0u64; n];
+    let n_bins = bins.len();
+
+    if n_bins <= 16 {
+        // Materialise lowers into a fixed buffer. Slots past `n_bins - 1`
+        // are filled with `i64::MAX` so the `>= lo` comparison reports
+        // false and never advances the index past the real bins.
+        let mut lowers = [i64::MAX; 16];
+        let mut bin_lower_lut = [0i64; 16];
+        for (i, b) in bins.iter().enumerate() {
+            lowers[i] = b.lower;
+            bin_lower_lut[i] = b.lower;
+        }
+        for (out_b, (out_off, &v)) in bin_idx_vec.iter_mut().zip(offsets.iter_mut().zip(values)) {
+            // The selected bin is the largest index k with lowers[k] <= v.
+            // Equivalently the count of true entries in `lowers[k] <= v`
+            // minus one, but a simpler branchless form is: start at 0 and
+            // increment for each subsequent bin whose lower is `<= v`. We
+            // walk slots 1..n_bins so the iteration count is constant per
+            // n_bins and the comparisons are independent.
+            let mut bin: u32 = 0;
+            // Use a fixed 16 iterations so the compiler can fully unroll;
+            // padding entries (`i64::MAX`) never satisfy `lo <= v`.
+            // Skip slot 0 (initial bin is 0).
+            for k in 1..16 {
+                let lo = lowers[k];
+                // (lo <= v) as u32 is 0 or 1. Adding it to `bin` advances
+                // when this bin is a candidate. Because `lowers` is sorted
+                // ascending, the count of `<= v` entries equals the bin
+                // index + 1.
+                bin += (lo <= v) as u32;
+            }
+            let b = bin as usize;
+            // `bin` is bounded by 15 (we count at most 15 of the 16
+            // lower comparisons), so the cast cannot truncate.
+            #[expect(clippy::cast_possible_truncation)]
+            let bin_u8 = bin as u8;
+            *out_b = bin_u8;
+            *out_off = (v as u64).wrapping_sub(bin_lower_lut[b] as u64);
+        }
+    } else {
+        for (out_b, (out_off, &v)) in bin_idx_vec.iter_mut().zip(offsets.iter_mut().zip(values)) {
+            let b = bin_for(bins, v);
+            *out_b = u8::try_from(b).vortex_expect("bin index < MAX_BINS fits in u8");
+            *out_off = (v as u64).wrapping_sub(bins[b].lower as u64);
+        }
     }
+
     (bin_idx_vec, offsets)
 }
 
diff --git a/encodings/bin-partition/src/var_width.rs b/encodings/bin-partition/src/var_width.rs