|
| 1 | +# Benchmarking |
| 2 | + |
| 3 | +> Status: design rationale for the benchmark suite under [`benches/`](../../benches) |
| 4 | +> and shared benchmark support under [`bench-support/`](../../bench-support). |
| 5 | +> Companion to [`design.md`](design.md) §10 and the benchmark reference docs. |
| 6 | +
|
| 7 | +cachekit benchmarks are designed to answer cache questions, not just produce |
| 8 | +fast-looking numbers. A cache policy can be excellent on uniform keys and weak |
| 9 | +under scans, or fast on micro-operations and poor at preserving hit rate. The |
| 10 | +benchmark suite therefore separates micro-operation cost, policy effectiveness, |
| 11 | +trace-shaped workloads, reporting, and machine-readable artifacts. |
| 12 | + |
| 13 | +## Goals |
| 14 | + |
| 15 | +- Compare policies under workload shapes that resemble real cache traffic. |
| 16 | +- Keep measured loops free of allocator noise and dynamic dispatch. |
| 17 | +- Produce both human-readable reports and stable JSON artifacts. |
| 18 | +- Preserve enough metadata to reproduce a run: git commit, branch, dirty bit, |
| 19 | + rustc version, host triple, CPU model, capacity, universe, operations, seed. |
| 20 | +- Make adding a policy or workload a registry edit, not a benchmark rewrite. |
| 21 | + |
| 22 | +## Benchmark Layers |
| 23 | + |
| 24 | +The benchmark suite has four layers: |
| 25 | + |
| 26 | +| Layer | Files | Purpose | |
| 27 | +|---|---|---| |
| 28 | +| Criterion measurements | `benches/workloads.rs`, `benches/ops.rs`, `benches/comparison.rs`, `benches/policy/*.rs` | statistically sampled latency and throughput | |
| 29 | +| Console reports | `benches/reports.rs` | fast, readable tables without Criterion overhead | |
| 30 | +| JSON artifact runner | `benches/runner.rs` | structured output for docs, charts, CI, historical comparison | |
| 31 | +| Shared support crate | `bench-support/` | policy registry, workloads, metrics, JSON schema, doc renderer | |
| 32 | + |
| 33 | +This split is deliberate. Criterion is good for micro-benchmark statistics; the |
| 34 | +artifact runner is good for automation; console reports are good while tuning a |
| 35 | +policy locally. No single binary is forced to serve every audience. |
| 36 | + |
| 37 | +## Monomorphic Policy Registry |
| 38 | + |
| 39 | +Benchmarks iterate policies through `for_each_policy!` in |
| 40 | +[`bench-support/src/registry.rs`](../../bench-support/src/registry.rs): |
| 41 | + |
| 42 | +```rust,ignore |
| 43 | +for_each_policy! { |
| 44 | + with |policy_id, display_name, make_cache| { |
| 45 | + let mut cache = make_cache(CAPACITY); |
| 46 | + // measured workload... |
| 47 | + } |
| 48 | +} |
| 49 | +``` |
| 50 | + |
| 51 | +The macro expands to one block per concrete policy type. This avoids dynamic |
| 52 | +dispatch in the measured loop while keeping policy iteration centralized. |
| 53 | +`POLICIES` in the same module provides presentation metadata (stable id, |
| 54 | +display name, chart color) for renderers and reports. |
| 55 | + |
| 56 | +The trade-off is that adding a policy touches the macro and metadata table. A |
| 57 | +test (`policies_metadata_matches_macro`) keeps the two from drifting. This is |
| 58 | +the same explicit-boilerplate-over-magic choice as `DynCache`: more arms in |
| 59 | +source, fewer surprises in hot code. |
| 60 | + |
| 61 | +## Workload Registry |
| 62 | + |
| 63 | +Workload definitions live in `bench-support/src/registry.rs`; generators live in |
| 64 | +[`bench-support/src/workload.rs`](../../bench-support/src/workload.rs). The |
| 65 | +current standard workloads cover: |
| 66 | + |
| 67 | +- Uniform random keys for raw overhead baselines. |
| 68 | +- Hot-set access for explicit skew. |
| 69 | +- Sequential scan for scan-pollution stress. |
| 70 | +- Zipfian and scrambled Zipfian for power-law access. |
| 71 | +- Latest / recency-biased access. |
| 72 | +- Shifting hotspots and flash crowds for adaptation. |
| 73 | +- Composite scan-resistance mixes. |
| 74 | + |
| 75 | +[`docs/benchmarks/workloads.md`](../benchmarks/workloads.md) is the catalog. It |
| 76 | +also contains a large roadmap of workloads that should not be confused with |
| 77 | +implemented cases. New workloads should land first in the support crate, then in |
| 78 | +the docs, then in reports. |
| 79 | + |
| 80 | +## Value Construction Discipline |
| 81 | + |
| 82 | +`benches/runner.rs` pre-allocates one `Arc<u64>` per key in the universe and |
| 83 | +passes a closure that returns `Arc::clone`: |
| 84 | + |
| 85 | +```rust,ignore |
| 86 | +fn preallocate_values() -> Vec<Arc<u64>> { |
| 87 | + (0..UNIVERSE).map(Arc::new).collect() |
| 88 | +} |
| 89 | +``` |
| 90 | + |
| 91 | +The rule is: **do not allocate values inside the measured operation loop**. |
| 92 | +Allocating on every miss makes the benchmark measure the allocator and value |
| 93 | +constructor, not the policy. A cheap `Arc::clone` isolates hit/miss behaviour, |
| 94 | +eviction order, and policy metadata overhead. |
| 95 | + |
| 96 | +This is especially important because policies store values differently: |
| 97 | +`FastLru` stores `V` directly, while LRU / LFU / Heap-LFU use `Arc<V>` in some |
| 98 | +paths. Pre-allocation keeps those representation differences from dominating |
| 99 | +the benchmark. |
| 100 | + |
| 101 | +## Artifact Schema |
| 102 | + |
| 103 | +`bench-support/src/json_results.rs` defines the stable JSON schema for results: |
| 104 | + |
| 105 | +- `SCHEMA_VERSION` follows semantic schema rules. |
| 106 | +- Major bumps remove or rename required fields. |
| 107 | +- Minor bumps add optional fields. |
| 108 | +- Renderers accept any artifact with a matching major. |
| 109 | + |
| 110 | +Each `BenchmarkArtifact` contains: |
| 111 | + |
| 112 | +- `metadata`: timestamp, git commit, branch, dirty bit, rustc, host, CPU, |
| 113 | + benchmark config. |
| 114 | +- `results`: rows keyed by policy, workload, and `case_id`. |
| 115 | +- `metrics`: optional typed sections for hit rate, throughput, latency, |
| 116 | + eviction, scan resistance, adaptation speed. |
| 117 | + |
| 118 | +The schema is presentation-neutral. Markdown tables and charts are rendered |
| 119 | +later by `bench-support/src/bin/render_docs.rs`, so measurement and presentation |
| 120 | +can evolve independently. |
| 121 | + |
| 122 | +## Case IDs |
| 123 | + |
| 124 | +Use `case_id::*` constants from `json_results.rs` instead of string literals: |
| 125 | + |
| 126 | +- `hit_rate` |
| 127 | +- `comprehensive` |
| 128 | +- `scan_resistance` |
| 129 | +- `adaptation` |
| 130 | + |
| 131 | +This catches typos at compile time and prevents a result section from silently |
| 132 | +disappearing from rendered docs. Adding a new case means adding a constant, |
| 133 | +teaching the runner to populate it, and teaching the renderer how to display it. |
| 134 | + |
| 135 | +## What Each Benchmark Answers |
| 136 | + |
| 137 | +| Benchmark | Question | |
| 138 | +|---|---| |
| 139 | +| `ops.rs` | What is the raw cost of `get` / `insert` / policy-specific operations? | |
| 140 | +| `workloads.rs` | Which policies preserve hit rate under standard workloads? | |
| 141 | +| `comparison.rs` | How does cachekit compare with external crates (`lru`, `quick_cache`)? | |
| 142 | +| `policy/*.rs` | What is the cost of each policy's unique operations? | |
| 143 | +| `reports.rs` | What should a human inspect while tuning? | |
| 144 | +| `runner.rs` | What should CI and docs consume? | |
| 145 | + |
| 146 | +Do not overload one benchmark to answer all questions. If you need policy |
| 147 | +micro-cost, use `ops.rs`; if you need hit rate under scans, use `workloads.rs` |
| 148 | +or `runner.rs`. |
| 149 | + |
| 150 | +## Reproducibility Rules |
| 151 | + |
| 152 | +- Seed every workload. Default seed is 42 unless a benchmark is explicitly |
| 153 | + sweeping seeds. |
| 154 | +- Record the git dirty bit. Dirty runs are useful locally but should not be |
| 155 | + published as release baselines without a note. |
| 156 | +- Keep capacity, universe, and operation count visible in the artifact. |
| 157 | +- Prefer `ScrambledZipfian` over raw `Zipfian` for cross-policy comparison when |
| 158 | + hardware prefetch could bias hot-key locality. |
| 159 | +- Do not compare results across machines without CPU metadata. Tail latency and |
| 160 | + pointer-heavy policy cost are machine-sensitive. |
| 161 | + |
| 162 | +## CI and Documentation Flow |
| 163 | + |
| 164 | +The docs pipeline runs the benchmark suite, writes |
| 165 | +`target/benchmarks/<run-id>/results.json`, and renders |
| 166 | +`docs/benchmarks/latest/` plus charts. Release-tag snapshots live under |
| 167 | +`docs/benchmarks/vX.Y.Z/`. |
| 168 | + |
| 169 | +Manual workflow: |
| 170 | + |
| 171 | +```bash |
| 172 | +cargo bench --bench runner |
| 173 | +./scripts/update_benchmark_docs.sh |
| 174 | +``` |
| 175 | + |
| 176 | +The script is the high-level path for refreshing published benchmark docs. Use |
| 177 | +individual benches (`cargo bench --bench ops`, `cargo bench --bench reports -- scan`) |
| 178 | +while developing a policy. |
| 179 | + |
| 180 | +## Adding a Policy to Benchmarks |
| 181 | + |
| 182 | +1. Add the policy to `for_each_policy!` with a concrete constructor. |
| 183 | +2. Add matching `PolicyMeta` in `POLICIES`. |
| 184 | +3. Run the registry drift test. |
| 185 | +4. Run `cargo bench --bench reports -- hit_rate` for a quick sanity check. |
| 186 | +5. Run `cargo bench --bench runner` before publishing docs. |
| 187 | + |
| 188 | +Keep constructors comparable. If one policy needs `Arc<u64>` and another stores |
| 189 | +`u64`, choose the value shape that preserves fairness and document the exception |
| 190 | +in the registry comment. |
| 191 | + |
| 192 | +## Adding a Workload |
| 193 | + |
| 194 | +1. Implement the generator in `bench-support/src/workload.rs`. |
| 195 | +2. Add a `WorkloadCase` in the registry with stable id and display name. |
| 196 | +3. Add docs in [`docs/benchmarks/workloads.md`](../benchmarks/workloads.md). |
| 197 | +4. Add renderer support if the workload needs a custom section. |
| 198 | +5. Run at least one policy family expected to behave differently (for example, |
| 199 | + LRU vs S3-FIFO for scan-heavy workloads). |
| 200 | + |
| 201 | +Do not add a workload just because it is mathematically interesting. It should |
| 202 | +answer a policy-selection question. |
| 203 | + |
| 204 | +## Non-goals |
| 205 | + |
| 206 | +- Benchmarks are not formal proofs of policy optimality. |
| 207 | +- Benchmarks are not stable ABI. The JSON schema is versioned, but Criterion |
| 208 | + names and report formatting can change. |
| 209 | +- Benchmarks do not hide hardware effects. They record enough metadata for the |
| 210 | + reader to judge them. |
| 211 | +- Benchmarks do not replace fuzzing or invariant tests; they measure behaviour |
| 212 | + under selected workloads. |
| 213 | + |
| 214 | +## See Also |
| 215 | + |
| 216 | +- [Design overview](design.md) - §10 frames benchmarking at the principles level |
| 217 | +- [Metrics](metrics.md) - recorder / snapshot / exporter split |
| 218 | +- [Benchmark docs](../benchmarks/README.md) |
| 219 | +- [Workload catalog](../benchmarks/workloads.md) |
| 220 | +- [`bench-support/src/registry.rs`](../../bench-support/src/registry.rs) |
| 221 | +- [`bench-support/src/json_results.rs`](../../bench-support/src/json_results.rs) |
| 222 | +- [`benches/runner.rs`](../../benches/runner.rs) |
0 commit comments