Skip to content

Commit 552f93e

Browse files
MagicalTuxclaude
andcommitted
perf(brotli,zstd): faster encode on low-redundancy input
Audit of the codecs against their official CLIs found two encoders that were dramatically slower than the reference on incompressible/low-match data — both linear but with pathological constants. (Interop and the streaming contract were otherwise clean across every codec with an official tool; lh1/lh2's Unsupported-without-length is documented, not a bug.) brotli: the literal-context histogram clustering was O(contexts^3 * 256) — it rescanned all cluster pairs and recomputed each cluster's cost from scratch on every merge — which exploded on dense histograms (~37k instructions/byte on random input). Cache per-cluster costs and the pairwise-delta matrix, updating only the merged cluster each round. The merge sequence and compressed output are byte-for-byte identical; incompressible encode is ~8x faster. zstd: the match finder used a fixed 64 Ki-bucket hash table over an up-to-8 MiB window (load factor in the hundreds), so each probe walked a full chain of useless far links. Size the table to the window. Also build the per-block match index incrementally — the chains persist across blocks (the history prefix is byte-stable until a window trim) instead of re-indexing all of history every block, which was O(history) per block and quadratic over a stream. Output is unchanged on single-block inputs and equal-or-smaller on multi-block ones (0 ratio regressions observed); random encode is ~3x faster. Verified: brotli output byte-identical across inputs x quality 0..11; zstd 0 ratio regressions and interop both ways with the zstd CLI; 50-case zstd fuzz (incl. >8 MiB trim path and block boundaries) round-trips through both our decoder and the CLI; full suite, clippy, and fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 227f352 commit 552f93e

4 files changed

Lines changed: 164 additions & 36 deletions

File tree

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
2929
streaming XXH64 implementation; the decompressed output is hashed and checked
3030
against the 4-byte frame trailer, reporting `ChecksumMismatch` on corruption.
3131

32+
### Changed
33+
34+
- *(brotli)* much faster encode on low-redundancy input. The literal-context
35+
histogram clustering was O(contexts³ · 256) — it rescanned every cluster pair
36+
and recomputed each cluster's cost from scratch on every merge — which blew up
37+
on dense histograms (e.g. random/incompressible data: ~37k instructions per
38+
byte). It now caches per-cluster costs and the pairwise-delta matrix and
39+
updates only the merged cluster each round. The merge sequence, and therefore
40+
the compressed output, is byte-for-byte identical; encode of incompressible
41+
input is ~8× faster.
42+
- *(zstd)* faster encode, especially on low-match input, with equal-or-better
43+
ratio. The match finder's hash table was a fixed 64 Ki buckets over an up-to
44+
8 MiB window (load factor in the hundreds), so every probe walked a full chain
45+
of useless far links; it is now sized to the window. The per-block match index
46+
is also built incrementally — the chains persist across blocks instead of
47+
re-indexing all of history every block (which was O(history) per block, i.e.
48+
quadratic over a stream). Output is unchanged on single-block inputs and
49+
equal-or-smaller on multi-block inputs (no ratio regression observed).
50+
3251
### Fixed
3352

3453
- *(decoder bridge)* a decoder that buffers a whole block internally (notably

src/brotli/encoder_ctx.rs

Lines changed: 45 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -159,22 +159,49 @@ pub(crate) fn cluster(
159159
}
160160
}
161161

162+
// Agglomerative clustering. The naive form recomputes every pair's merge
163+
// delta — including each cluster's own `histogram_bits` — on every iteration,
164+
// which is O(active³ · 256) and blows up on dense histograms (e.g. random
165+
// input, where every context spans all 256 symbols). Instead cache each
166+
// cluster's self-cost and the pairwise deltas, keyed by stable cluster id,
167+
// and after each merge recompute only the merged cluster's row. The merge
168+
// sequence — and therefore the resulting model and compressed output — is
169+
// byte-for-byte identical to the naive version; only redundant work is cut.
170+
let mut self_bits = alloc::vec![0u64; NUM_CONTEXTS];
171+
for &c in &active {
172+
self_bits[c] = histogram_bits(&histograms[c], totals[c]);
173+
}
174+
// `delta[ci][cj]` for `ci < cj`; valid only for currently-active pairs.
175+
let mut delta = alloc::vec![alloc::vec![0i64; NUM_CONTEXTS]; NUM_CONTEXTS];
176+
let pair_delta = |ci: usize, cj: usize, sb: &[u64], hs: &[[u32; 256]], ts: &[u32]| -> i64 {
177+
let bm = merged_bits(&hs[ci], ts[ci], &hs[cj], ts[cj]);
178+
bm as i64 - sb[ci] as i64 - sb[cj] as i64 - HEADER_COST_BITS as i64
179+
};
180+
for ai in 0..active.len() {
181+
for aj in (ai + 1)..active.len() {
182+
let (ci, cj) = (active[ai], active[aj]);
183+
delta[ci][cj] = pair_delta(ci, cj, &self_bits, &histograms, &totals);
184+
}
185+
}
186+
162187
while active.len() > 1 {
163188
let force = active.len() > max_trees;
164189
let mut best_i = 0usize;
165190
let mut best_j = 0usize;
166191
let mut best_delta: i64 = i64::MAX;
192+
// Same scan order and strict `<` tie-break as the naive loop, so the
193+
// chosen pair is identical — but now a cheap matrix lookup, not a
194+
// 256-symbol recomputation.
167195
for ai in 0..active.len() {
168196
for aj in (ai + 1)..active.len() {
169-
let ci = active[ai];
170-
let cj = active[aj];
171-
let bi = histogram_bits(&histograms[ci], totals[ci]);
172-
let bj = histogram_bits(&histograms[cj], totals[cj]);
173-
let bm = merged_bits(&histograms[ci], totals[ci], &histograms[cj], totals[cj]);
174-
// Merging trades a header allowance against extra data bits.
175-
let delta = bm as i64 - bi as i64 - bj as i64 - HEADER_COST_BITS as i64;
176-
if delta < best_delta {
177-
best_delta = delta;
197+
let (ci, cj) = (active[ai], active[aj]);
198+
let d = if ci < cj {
199+
delta[ci][cj]
200+
} else {
201+
delta[cj][ci]
202+
};
203+
if d < best_delta {
204+
best_delta = d;
178205
best_i = ai;
179206
best_j = aj;
180207
}
@@ -197,6 +224,15 @@ pub(crate) fn cluster(
197224
}
198225
}
199226
active.swap_remove(best_j);
227+
// Only the merged cluster `ci`'s costs changed; refresh its self-cost
228+
// and its delta against every other surviving cluster.
229+
self_bits[ci] = histogram_bits(&histograms[ci], totals[ci]);
230+
for &ck in &active {
231+
if ck != ci {
232+
let (lo, hi) = if ci < ck { (ci, ck) } else { (ck, ci) };
233+
delta[lo][hi] = pair_delta(lo, hi, &self_bits, &histograms, &totals);
234+
}
235+
}
200236
}
201237

202238
// Compress cluster ids to a dense 0..num_trees range.

src/zstd/encoder.rs

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -307,14 +307,22 @@ impl Encoder {
307307
let buffer = buffer.as_slice();
308308
let buf_len = buffer.len();
309309

310-
// Rebuild the chains for this buffer and pre-index only the retained
311-
// history (`[0, start)`). Each parser then splices in the *current
312-
// block's* positions lazily as it advances, so the hash chains never
313-
// contain positions ahead of the probe — the standard LZ invariant that
314-
// keeps match finding correct and the depth budget meaningful. Indexing
315-
// history up front is what enables cross-block back-references.
316-
self.matcher.resize_for(buf_len);
317-
for i in 0..start.min(buf_len.saturating_sub(3)) {
310+
// Pre-index the retained history (`[0, start)`) so cross-block
311+
// back-references are findable; each parser then splices in the
312+
// *current block's* positions lazily as it advances, preserving the LZ
313+
// invariant that the chains never contain positions ahead of the probe.
314+
//
315+
// The chains persist across blocks (the history prefix is byte-stable
316+
// until the window is trimmed), so we only index the positions not
317+
// already indexed by earlier blocks — `[inserted_upto, start)`. The old
318+
// code re-indexed all of history every block, which is O(history) per
319+
// block and quadratic over a stream; this makes it amortised O(input).
320+
// `prepare_incremental` keeps the existing chains (rebuilding only on a
321+
// head-size change); window trims call `resize_for`, which resets the
322+
// high-water so the next block re-indexes from scratch.
323+
self.matcher.prepare_incremental(buf_len);
324+
let index_to = start.min(buf_len.saturating_sub(3));
325+
for i in self.matcher.inserted_upto()..index_to {
318326
self.matcher.insert(buffer, i);
319327
}
320328

src/zstd/matcher.rs

Lines changed: 84 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,6 @@
1616
//! - `Match { length, distance }` returned by value, with `MIN_MATCH = 3`
1717
//! (zstd's minimum) and a generous `MAX_MATCH` cap.
1818
19-
use alloc::boxed::Box;
20-
2119
/// Minimum match length the matcher will report (RFC 8478 §3.1.1.3.2 implies
2220
/// a hard minimum of 3 via the match-length base table).
2321
pub const MIN_MATCH: usize = 3;
@@ -30,9 +28,15 @@ pub const MIN_MATCH: usize = 3;
3028
/// periodicity at distance ~445 bytes): each long match amortises the
3129
/// per-sequence FSE-table cost across thousands more output bytes.
3230
pub const MAX_MATCH: usize = 65535;
33-
/// Hash table size (must be a power of two).
34-
const HASH_BITS: u32 = 15;
35-
const HASH_SIZE: usize = 1 << HASH_BITS;
31+
/// Minimum hash-table size (power of two). The table is sized to the indexed
32+
/// buffer at construction / `resize_for` time and floored here for tiny inputs.
33+
const HASH_MIN_BITS: u32 = 15;
34+
/// Upper bound on the hash table (4 Mi buckets = 16 MiB). The matcher indexes
35+
/// up to an 8 MiB history; a fixed small table would give that window a load
36+
/// factor in the hundreds, so on low-match input every probe walked the full
37+
/// `max_chain` of useless far-distance links. Sizing the table to the buffer
38+
/// keeps chains short (the same reason liblzma sizes its hash to the dict).
39+
const HASH_MAX_BITS: u32 = 22;
3640
/// "Empty" marker in the hash table.
3741
const NIL: u32 = u32::MAX;
3842

@@ -46,31 +50,86 @@ pub struct Match {
4650

4751
/// Per-block matcher state.
4852
pub struct MatchFinder {
49-
head: Box<[u32; HASH_SIZE]>,
53+
head: Vec<u32>,
54+
/// Right-shift applied to the 32-bit hash to land in `head`; `32 - log2(len)`.
55+
head_shift: u32,
5056
/// Linked-list chain `prev[pos]` = position of the previous occurrence of
5157
/// the same 4-byte prefix.
5258
prev: Vec<u32>,
59+
/// Number of leading positions already spliced into the chains. The chains
60+
/// persist across blocks (the buffer prefix is byte-stable until the window
61+
/// is trimmed), so each block only needs to insert positions `>= this`
62+
/// rather than re-indexing all of history — turning the per-block O(history)
63+
/// rebuild (quadratic over a stream) into amortised O(input).
64+
inserted_upto: usize,
5365
}
5466

5567
use alloc::vec;
5668
use alloc::vec::Vec;
5769

58-
/// Hash function over four bytes. A multiplicative hash with a prime
59-
/// multiplier gives reasonable distribution and is cheap to compute.
70+
/// Full-width multiplicative hash over four bytes. The caller takes the top
71+
/// `head` bits via `head_shift`; the high bits of a golden-ratio multiply are
72+
/// the well-distributed ones.
73+
#[inline]
6074
fn hash4(b: &[u8]) -> u32 {
6175
let v = (b[0] as u32) | ((b[1] as u32) << 8) | ((b[2] as u32) << 16) | ((b[3] as u32) << 24);
62-
// 0x9E3779B1 = golden-ratio multiplier; high bits are the well-distributed ones.
63-
v.wrapping_mul(0x9E37_79B1) >> (32 - HASH_BITS)
76+
v.wrapping_mul(0x9E37_79B1)
77+
}
78+
79+
/// `(head_len, head_shift)` for a buffer of `buffer_len` bytes: the table is the
80+
/// buffer size rounded up to a power of two, clamped to `[HASH_MIN_BITS,
81+
/// HASH_MAX_BITS]`, so the average chain length stays O(1).
82+
fn head_params(buffer_len: usize) -> (usize, u32) {
83+
let bits = buffer_len
84+
.next_power_of_two()
85+
.trailing_zeros()
86+
.clamp(HASH_MIN_BITS, HASH_MAX_BITS);
87+
(1usize << bits, 32 - bits)
6488
}
6589

6690
impl MatchFinder {
6791
pub fn new(buffer_len: usize) -> Self {
92+
let (head_len, head_shift) = head_params(buffer_len);
6893
Self {
69-
head: Box::new([NIL; HASH_SIZE]),
94+
head: vec![NIL; head_len],
95+
head_shift,
7096
prev: vec![NIL; buffer_len.max(1)],
97+
inserted_upto: 0,
7198
}
7299
}
73100

101+
/// How many leading positions are already in the chains.
102+
#[inline]
103+
pub fn inserted_upto(&self) -> usize {
104+
self.inserted_upto
105+
}
106+
107+
/// Prepare to index a buffer of `buffer_len` bytes *incrementally*, keeping
108+
/// the chains built for the byte-stable prefix from earlier blocks. Grows
109+
/// the per-position array (preserving entries) and only rebuilds the head
110+
/// table when the ideal size changes (a power-of-two growth, O(log input)
111+
/// times total) — a rebuild resets `inserted_upto` so the caller re-indexes
112+
/// the prefix that round. Use [`resize_for`](Self::resize_for) instead when
113+
/// the window is trimmed and absolute positions shift.
114+
pub fn prepare_incremental(&mut self, buffer_len: usize) {
115+
if self.prev.len() < buffer_len {
116+
self.prev.resize(buffer_len.max(1), NIL);
117+
}
118+
let (head_len, head_shift) = head_params(buffer_len);
119+
if head_len != self.head.len() {
120+
self.head.clear();
121+
self.head.resize(head_len, NIL);
122+
self.head_shift = head_shift;
123+
self.inserted_upto = 0;
124+
}
125+
}
126+
127+
/// Bucket index for the 4 bytes at `b`.
128+
#[inline]
129+
fn bucket(&self, b: &[u8]) -> usize {
130+
(hash4(b) >> self.head_shift) as usize
131+
}
132+
74133
/// Forget every position recorded so far. The buffer length stays the
75134
/// same. Not currently called — [`MatchFinder::resize_for`] is used on
76135
/// each new block — but kept for completeness / future tuning.
@@ -89,20 +148,26 @@ impl MatchFinder {
89148
pub fn resize_for(&mut self, buffer_len: usize) {
90149
self.prev.clear();
91150
self.prev.resize(buffer_len.max(1), NIL);
92-
for h in self.head.iter_mut() {
93-
*h = NIL;
94-
}
151+
let (head_len, head_shift) = head_params(buffer_len);
152+
self.head_shift = head_shift;
153+
self.head.clear();
154+
self.head.resize(head_len, NIL);
155+
self.inserted_upto = 0;
95156
}
96157

97-
/// Record `buffer[pos..pos+4]`.
158+
/// Record `buffer[pos..pos+4]`. Positions must be inserted in increasing
159+
/// order (the standard LZ invariant); `inserted_upto` tracks the high-water
160+
/// so later blocks can skip what is already indexed.
98161
pub fn insert(&mut self, buffer: &[u8], pos: usize) {
99162
if pos + 4 > buffer.len() {
100163
return;
101164
}
102-
let h = hash4(&buffer[pos..pos + 4]) as usize;
103-
// Safety: head is fixed size HASH_SIZE, h < HASH_SIZE.
165+
let h = self.bucket(&buffer[pos..pos + 4]);
104166
self.prev[pos] = self.head[h];
105167
self.head[h] = pos as u32;
168+
if pos + 1 > self.inserted_upto {
169+
self.inserted_upto = pos + 1;
170+
}
106171
}
107172

108173
/// Find the longest match for `buffer[pos..]` against any earlier
@@ -126,7 +191,7 @@ impl MatchFinder {
126191
// Can't compute the 4-byte hash; just fail (rare; near end of buf).
127192
return None;
128193
}
129-
let h = hash4(&buffer[pos..pos + 4]) as usize;
194+
let h = self.bucket(&buffer[pos..pos + 4]);
130195
let max_dist = window.min(pos);
131196
let max_len = MAX_MATCH.min(buffer.len() - pos);
132197
if max_len < MIN_MATCH {
@@ -225,7 +290,7 @@ impl MatchFinder {
225290
if pos + MIN_MATCH > buffer.len() || pos + 4 > buffer.len() {
226291
return;
227292
}
228-
let h = hash4(&buffer[pos..pos + 4]) as usize;
293+
let h = self.bucket(&buffer[pos..pos + 4]);
229294
let max_dist = window.min(pos);
230295
let max_len = MAX_MATCH.min(buffer.len() - pos);
231296
if max_len < MIN_MATCH {

0 commit comments

Comments
 (0)