@@ -51,26 +51,50 @@ Raw input is `N * 8 = 8_000_000` bytes (7.63 MiB). Encoded sizes are
5151
5252## Throughput (MB/s, median)
5353
54+ The "after" column is the ** post-P4-perf-tuning** measurement (encode now
55+ fuses validation + pack + per-batch prefix into one pass over the input,
56+ bin assignment for ` <= 16 ` bins is a branchless cascade instead of a binary
57+ search, and decode writes into pre-allocated spare capacity instead of
58+ calling ` BufferMut::push ` per element). The "before" column is the original
59+ P4 measurement reproduced in this file's earlier version.
60+
5461### Scenario A — skewed-low
5562
56- | direction | encode_bin_partition | decode_bin_partition | pco_encode | pco_decode |
57- | -----------| ----------------------| ----------------------| ------------| ------------|
58- | MB/s | 464.2 | 799.7 | 1315 | 3190 |
59- | Mitem/s | 58.0 | 100.0 | 164.4 | 398.8 |
63+ | direction | encode (before → after) | decode (before → after) | pco_encode | pco_decode |
64+ | -----------| ------------------------- | --- ----------------------| ------------| ------------|
65+ | MB/s | 464 → 676 (1.46×) | 800 → 880 (1.10×) | 1315 | 3190 |
66+ | Mitem/s | 58.0 → 84.4 | 100.0 → 109.9 | 164.4 | 398.8 |
6067
6168### Scenario B — uniform random
6269
63- | direction | encode_bin_partition | decode_bin_partition | pco_encode | pco_decode |
64- | -----------| ----------------------| ----------------------| ------------| ------------|
65- | MB/s | 450.9 | 744.3 | 1213 | 3147 |
66- | Mitem/s | 56.4 | 93.0 | 151.7 | 393.4 |
70+ | direction | encode (before → after) | decode (before → after) | pco_encode | pco_decode |
71+ | -----------| ------------------------- | --- ----------------------| ------------| ------------|
72+ | MB/s | 451 → 770 (1.71×) | 744 → 2151 (2.89×) | 1213 | 3147 |
73+ | Mitem/s | 56.4 → 96.2 | 93.0 → 268.9 | 151.7 | 393.4 |
6774
6875### Scenario C — quasi-monotone
6976
70- | direction | encode_bin_partition | decode_bin_partition | pco_encode | pco_decode |
71- | -----------| ----------------------| ----------------------| ------------| ------------|
72- | MB/s | 541.5 | 1304 | 1148 | 2629 |
73- | Mitem/s | 67.7 | 163.0 | 143.5 | 328.7 |
77+ | direction | encode (before → after) | decode (before → after) | pco_encode | pco_decode |
78+ | -----------| -------------------------| -------------------------| ------------| ------------|
79+ | MB/s | 542 → 755 (1.39×) | 1304 → 1021 (0.78×) | 1148 | 2629 |
80+ | Mitem/s | 67.7 → 94.4 | 163.0 → 127.6 | 143.5 | 328.7 |
81+
82+ ** Summary.** Encode is 1.39–1.71× faster across the board. Decode is
83+ 2.89× faster on the uniform-random workload (the most pco-favourable
84+ shape), 1.10× faster on the skewed-low workload, and 0.78× on the
85+ quasi-monotone workload. The geometric mean of the three decode
86+ speedups is ** 1.35×** , which clears the 1.3× tuning bar; the
87+ per-scenario picture nevertheless makes clear that the win is
88+ concentrated in the wide-width B case and that the narrow-width C case
89+ has regressed.
90+
91+ The C regression is concentrated in the decode hot loop. With every
92+ bin in C using ~ 17 bits the bit-unpack is bandwidth-bound on the packed
93+ buffer, and switching from ` BufferMut::push ` to a direct write through
94+ ` spare_capacity_mut ` perturbs register allocation enough that the C
95+ case lost ~ 22 % even though the same change buys B a 3× speedup. A
96+ fastlanes-style fixed-width path for uniform-width batches would
97+ recover C and is the natural next step.
7498
7599## ` scalar_at ` (median, 1_000 random indices)
76100
@@ -98,17 +122,20 @@ scenarios; the bench reports it as `~390 µs` per 1k-index loop.
98122 touches local correlation; that is a job for delta/RLE layers above
99123 it.
100124
101- 2 . ** Decode throughput is ~ 3.5–4× below PCO's, which is the price of
102- the layered indirection on this codepath.** Median MB/s:
103- bin_partition decode is 800 / 744 / 1304 (A/B/C) against PCO's
104- 3190 / 3147 / 2629. Scenario C is the fastest decode for
105- bin_partition (1.3 GB/s) because every bin happens to be narrow,
106- making the bit-unpack faster.
107-
108- 3 . ** Encode is ~ 2.5–3× slower than PCO.** Quantile sampling plus the
109- per-bin width pick plus the variable-width pack costs roughly 500
110- MB/s vs. PCO's ~ 1.2 GB/s. This is acceptable for a first cut — there
111- is no SIMD path for the bit-pack yet.
125+ 2 . ** Decode throughput now ranges from ~ 30 % of PCO's (skewed and
126+ quasi-monotone) to ~ 70 % (uniform random).** Median MB/s after
127+ tuning: bin_partition decode is 869 / 2149 / 999 (A/B/C) against
128+ PCO's 3190 / 3147 / 2629. Scenario B closes most of the gap; A and
129+ C are still bottlenecked on the per-element bit-unpack chain. A
130+ fastlanes-style fixed-width path for batches whose bin all share the
131+ same width should help A and C; that work is left for follow-up.
132+
133+ 3 . ** Encode is ~ 55–60 % of PCO's throughput.** Quantile sampling plus
134+ the per-bin width pick plus the variable-width pack now sits at
135+ ~ 660–770 MB/s vs. PCO's ~ 1.15–1.32 GB/s. The fused
136+ validate+pack+prefix pass dropped the encoder's memory traffic to
137+ one walk over ` values ` and ` bin_idx ` , and the branchless ` <= 16 `
138+ cascade replaces the binary search.
112139
1131404 . ** Random-access ` scalar_at ` is the headline win.** bin_partition
114141 resolves any element in ~ 390–400 ns regardless of scenario (the
0 commit comments