|
| 1 | +--- |
| 2 | +name: benchmark-chonk |
| 3 | +description: Run realistic Chonk (client IVC) benchmarks using pinned protocol inputs. Covers native and WASM proving, per-circuit breakdowns, BB_BENCH instrumentation, and profiling code augmentation. Use when asked to benchmark, profile, or measure Chonk proving performance. |
| 4 | +argument-hint: <action> e.g. "run", "compare", "wasm", "instrument <area>", "per-circuit", "download-inputs" |
| 5 | +--- |
| 6 | + |
| 7 | +# Benchmark Chonk |
| 8 | + |
| 9 | +Run realistic Chonk IVC benchmarks using **pinned protocol inputs** (real transaction flows captured from end-to-end tests), not the synthetic `chonk_bench` target. The synthetic benchmark (`chonk_bench`) uses trivially small mock circuits — it is useful for quick regression checks but does NOT reflect production proving performance. Users invoking `/benchmark-chonk` want the real thing. |
| 10 | + |
| 11 | +## What makes this different from `chonk_bench` |
| 12 | + |
| 13 | +| | `chonk_bench` (synthetic) | This skill (realistic) | |
| 14 | +|---|---|---| |
| 15 | +| Input data | Mock circuits via `test_bench_shared.hpp` | Pinned msgpack from real Aztec transactions | |
| 16 | +| Circuit count | 2 or 5 tiny circuits | Full transaction flows (10+ circuits) | |
| 17 | +| Circuit variety | All identical | Mixed: app, kernel, tail, public | |
| 18 | +| BB command | `./chonk_bench --benchmark_filter=...` | `bb prove --scheme chonk --ivc_inputs_path ...` | |
| 19 | + |
| 20 | +## Step 1: Get pinned IVC inputs |
| 21 | + |
| 22 | +The real benchmark inputs are pinned to an S3 artifact keyed by a short hash. Download them: |
| 23 | + |
| 24 | +```bash |
| 25 | +cd barretenberg/cpp/scripts |
| 26 | +./test_chonk_standalone_vks_havent_changed.sh --download_pinned_inputs |
| 27 | +``` |
| 28 | + |
| 29 | +This populates `yarn-project/end-to-end/example-app-ivc-inputs-out/<flow>/ivc-inputs.msgpack`. |
| 30 | + |
| 31 | +Available flows (typical): |
| 32 | +- `ecdsar1+transfer_1_recursions+sponsored_fpc` |
| 33 | +- `schnorr+deploy_tokenContract_with_registration+sponsored_fpc` |
| 34 | +- `ecdsar1+amm_add_liquidity_1_recursions+sponsored_fpc` |
| 35 | +- `ecdsar1+transfer_1_recursions+private_fpc` |
| 36 | +- and more — run `ls yarn-project/end-to-end/example-app-ivc-inputs-out/` after downloading |
| 37 | + |
| 38 | +The pinned hash is maintained in `barretenberg/cpp/scripts/test_chonk_standalone_vks_havent_changed.sh` (variable `pinned_short_hash`). The S3 URL is: |
| 39 | +``` |
| 40 | +https://aztec-ci-artifacts.s3.us-east-2.amazonaws.com/protocol/bb-chonk-inputs-<hash>.tar.gz |
| 41 | +``` |
| 42 | + |
| 43 | +To update the pinned inputs (after protocol changes that affect VKs): |
| 44 | +```bash |
| 45 | +./test_chonk_standalone_vks_havent_changed.sh --update_inputs |
| 46 | +``` |
| 47 | + |
| 48 | +## Step 2: Build bb in release mode |
| 49 | + |
| 50 | +```bash |
| 51 | +cd barretenberg/cpp |
| 52 | +cmake --preset clang20-no-avm # AVM not needed for Chonk |
| 53 | +cmake --build --preset clang20-no-avm --target bb |
| 54 | +``` |
| 55 | + |
| 56 | +Build dir: `build-no-avm` (or `build` if using the `clang20` preset). |
| 57 | + |
| 58 | +## Step 3: Run the benchmark |
| 59 | + |
| 60 | +**Always set `HARDWARE_CONCURRENCY=8` for local runs.** The remote benchmarking machine uses 16, but local/shared machines should use 8. See `/remote-bench` for remote execution. |
| 61 | + |
| 62 | +### Native |
| 63 | + |
| 64 | +```bash |
| 65 | +cd barretenberg/cpp |
| 66 | + |
| 67 | +FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc" |
| 68 | +OUTPUT_DIR="/tmp/chonk-bench-out" |
| 69 | +mkdir -p $OUTPUT_DIR |
| 70 | + |
| 71 | +HARDWARE_CONCURRENCY=8 ./build-no-avm/bin/bb prove \ |
| 72 | + -o $OUTPUT_DIR \ |
| 73 | + --ivc_inputs_path ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack \ |
| 74 | + --scheme chonk \ |
| 75 | + -v \ |
| 76 | + --print_bench \ |
| 77 | + --bench_out_hierarchical $OUTPUT_DIR/benchmark_breakdown.json |
| 78 | +``` |
| 79 | + |
| 80 | +### WASM (via wasmtime) |
| 81 | + |
| 82 | +Build the WASM binary with threads enabled: |
| 83 | + |
| 84 | +```bash |
| 85 | +cd barretenberg/cpp |
| 86 | +cmake --preset wasm-threads |
| 87 | +cmake --build --preset wasm-threads --target bb |
| 88 | +``` |
| 89 | + |
| 90 | +Run via wasmtime (the `scripts/wasmtime.sh` wrapper sets standard flags): |
| 91 | + |
| 92 | +```bash |
| 93 | +cd barretenberg/cpp |
| 94 | + |
| 95 | +FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc" |
| 96 | +OUTPUT_DIR="/tmp/chonk-bench-wasm" |
| 97 | +mkdir -p $OUTPUT_DIR |
| 98 | + |
| 99 | +# Copy inputs to a working dir wasmtime can access |
| 100 | +cp ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack $OUTPUT_DIR/ |
| 101 | + |
| 102 | +cd $OUTPUT_DIR |
| 103 | +HARDWARE_CONCURRENCY=8 BB_BENCH=1 \ |
| 104 | + /path/to/barretenberg/cpp/scripts/wasmtime.sh \ |
| 105 | + /path/to/barretenberg/cpp/build-wasm-threads/bin/bb prove \ |
| 106 | + -o output \ |
| 107 | + --ivc_inputs_path ivc-inputs.msgpack \ |
| 108 | + --scheme chonk \ |
| 109 | + -v \ |
| 110 | + --print_bench \ |
| 111 | + --bench_out_hierarchical benchmark_breakdown.json |
| 112 | +``` |
| 113 | + |
| 114 | +The wasmtime wrapper sets: |
| 115 | +- `-Wthreads=y -Sthreads=y` — enable WASM threads and shared memory |
| 116 | +- `--env HARDWARE_CONCURRENCY` — thread count |
| 117 | +- `--env BB_BENCH` — enable operation counting (`ENABLE_WASM_BENCH=ON` is set by the `wasm-threads` preset) |
| 118 | +- `--dir=$HOME/.bb-crs --dir=.` — filesystem access for CRS and working directory |
| 119 | + |
| 120 | +## Local runs are noisy — average 3 runs |
| 121 | + |
| 122 | +Non-dedicated machines have variable CPU load. **Run the benchmark at least 3 times and average the results.** Only the remote benchmarking machine (see `/remote-bench` skill) provides stable, isolated CPU for single-run measurements. |
| 123 | + |
| 124 | +When iterating locally on profiling code changes, relative comparisons (before vs after your change) are still valid on noisy machines — just ensure you compare runs taken close together under similar load. |
| 125 | + |
| 126 | +## Using with the remote benchmarking machine |
| 127 | + |
| 128 | +For noise-free, publishable results, use the `/remote-bench` skill to run on the dedicated EC2 instance. The two skills compose naturally: |
| 129 | + |
| 130 | +1. `/benchmark-chonk download-inputs` — get pinned inputs locally |
| 131 | +2. `/remote-bench` — build locally, scp binary + inputs to remote, run there, copy results back |
| 132 | + |
| 133 | +See the `/remote-bench` skill for setup, lock management, and usage. |
| 134 | + |
| 135 | +## BB_BENCH instrumentation system |
| 136 | + |
| 137 | +### How it works |
| 138 | + |
| 139 | +`BB_BENCH` is an always-compiled, low-overhead RAII profiling system. |
| 140 | + |
| 141 | +**Header:** `barretenberg/cpp/src/barretenberg/common/bb_bench.hpp` |
| 142 | +**Implementation:** `barretenberg/cpp/src/barretenberg/common/bb_bench.cpp` |
| 143 | + |
| 144 | +**Macros:** |
| 145 | +```cpp |
| 146 | +BB_BENCH() // label = __func__ |
| 147 | +BB_BENCH_NAME("label") // custom label (preferred) |
| 148 | +BB_BENCH_ONLY_NAME("label") // no Tracy, no nesting — lightweight |
| 149 | +BB_BENCH_ENABLE_NESTING() // set parent context for child operations |
| 150 | +``` |
| 151 | + |
| 152 | +The macros create `BenchReporter` RAII objects that: |
| 153 | +1. On construction: capture parent context + start time |
| 154 | +2. On destruction: record elapsed time with parent association |
| 155 | +3. Build a hierarchical call tree automatically |
| 156 | + |
| 157 | +**Activation:** `BB_BENCH=1` env var, or `--print_bench` / `--bench_out_hierarchical` CLI flags. |
| 158 | + |
| 159 | +### Google Benchmark integration |
| 160 | + |
| 161 | +For `chonk_bench` and other `.bench.cpp` targets: |
| 162 | +```cpp |
| 163 | +#include "barretenberg/common/google_bb_bench.hpp" |
| 164 | + |
| 165 | +for (auto _ : state) { |
| 166 | + GOOGLE_BB_BENCH_REPORTER(state); // clears stats, collects on destruction |
| 167 | + // ... benchmark body ... |
| 168 | +} |
| 169 | +``` |
| 170 | + |
| 171 | +`GOOGLE_BB_BENCH_REPORTER(state)` creates a `GoogleBbBenchReporter` which: |
| 172 | +- **Constructor:** calls `GLOBAL_BENCH_STATS.clear()` — resets all accumulated stats |
| 173 | +- **Destructor:** aggregates stats into Google Benchmark counters (each operation becomes a `(s)` suffixed counter) |
| 174 | + |
| 175 | +### Per-circuit / per-accumulate breakdown |
| 176 | + |
| 177 | +**Key function:** `bb::detail::GLOBAL_BENCH_STATS.clear()` |
| 178 | +(`barretenberg/cpp/src/barretenberg/common/bb_bench.cpp`) |
| 179 | + |
| 180 | +```cpp |
| 181 | +void GlobalBenchStatsContainer::clear() |
| 182 | +{ |
| 183 | + std::unique_lock<std::mutex> lock(mutex); |
| 184 | + for (std::shared_ptr<TimeStatsEntry>& entry : entries) { |
| 185 | + entry->count = TimeStats(); // resets to zero without losing entry structure |
| 186 | + } |
| 187 | +} |
| 188 | +``` |
| 189 | + |
| 190 | +**Usage pattern for per-circuit profiling:** |
| 191 | + |
| 192 | +The `--print_bench` output aggregates across all 19 circuits. To get per-circuit timing, temporarily instrument `barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp`: |
| 193 | + |
| 194 | +1. Add `#include <chrono>` at the top |
| 195 | +2. In `ChonkAccumulate::execute()`, wrap the `accumulate()` call: |
| 196 | + |
| 197 | +```cpp |
| 198 | + info("ChonkAccumulate - accumulating circuit '", request.loaded_circuit_name, "'"); |
| 199 | + bb::detail::GLOBAL_BENCH_STATS.clear(); |
| 200 | + auto circuit_start = std::chrono::steady_clock::now(); |
| 201 | + request.ivc_in_progress->accumulate(circuit, precomputed_vk); |
| 202 | + auto circuit_end = std::chrono::steady_clock::now(); |
| 203 | + auto circuit_ms = std::chrono::duration_cast<std::chrono::milliseconds>(circuit_end - circuit_start).count(); |
| 204 | + info("PER_CIRCUIT_TIME: circuit='", |
| 205 | + request.loaded_circuit_name, |
| 206 | + "' index=", |
| 207 | + request.ivc_stack_depth, |
| 208 | + " time_ms=", |
| 209 | + circuit_ms); |
| 210 | + bb::detail::GLOBAL_BENCH_STATS.print_aggregate_counts_hierarchical(std::cerr); |
| 211 | + request.ivc_stack_depth++; |
| 212 | +``` |
| 213 | +
|
| 214 | +3. Rebuild with `cd build && ninja bb` (only recompiles the changed file + relinks) |
| 215 | +4. Run the benchmark, then grep for `PER_CIRCUIT_TIME` in the output |
| 216 | +5. **Revert the instrumentation** after collecting data: `git checkout -- barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp` |
| 217 | +
|
| 218 | +This gives wall-clock time per circuit plus a per-circuit BB_BENCH breakdown. The `GLOBAL_BENCH_STATS.clear()` resets stats before each circuit so the hierarchical print shows only that circuit's work. |
| 219 | +
|
| 220 | +The same pattern works at any granularity — clear before, print after. This is how `GOOGLE_BB_BENCH_REPORTER` works internally. |
| 221 | +
|
| 222 | +### Output formats |
| 223 | +
|
| 224 | +| Flag | Format | Use case | |
| 225 | +|------|--------|----------| |
| 226 | +| `--print_bench` | Colorized tree on stderr | Human reading in terminal | |
| 227 | +| `--bench_out <file>` | Flat JSON `{"op": time_ns}` | Simple metrics | |
| 228 | +| `--bench_out_hierarchical <file>` | Nested JSON with parent/child | Dashboard, `extract_component_benchmarks.py` | |
| 229 | +
|
| 230 | +The hierarchical JSON format: |
| 231 | +```json |
| 232 | +{ |
| 233 | + "operation_name": [ |
| 234 | + { |
| 235 | + "parent": "parent_operation", |
| 236 | + "time": 1234567890, |
| 237 | + "time_max": 1234567890, |
| 238 | + "time_mean": 1234567890.0, |
| 239 | + "time_stddev": 12345.0, |
| 240 | + "count": 5, |
| 241 | + "num_threads": 8 |
| 242 | + } |
| 243 | + ] |
| 244 | +} |
| 245 | +``` |
| 246 | + |
| 247 | +### Adding new instrumentation |
| 248 | + |
| 249 | +When profiling reveals "missing time" (parent time - sum of children > 20%), add `BB_BENCH_NAME` to the uninstrumented functions: |
| 250 | + |
| 251 | +```cpp |
| 252 | +#include "barretenberg/common/bb_bench.hpp" |
| 253 | + |
| 254 | +void MyProver::execute_phase() { |
| 255 | + BB_BENCH_NAME("MyProver::execute_phase"); |
| 256 | + BB_BENCH_ENABLE_NESTING(); // allow child operations to track this as parent |
| 257 | + // ... function body ... |
| 258 | +} |
| 259 | +``` |
| 260 | + |
| 261 | +**Rules:** |
| 262 | +- Place macro as the first statement in the scope you want to measure |
| 263 | +- Use descriptive names: `"Chonk::accumulate::oink_phase"` not `"oink"` |
| 264 | +- For templates: `BB_BENCH_NAME("ShpleminiProver<Flavor>::prove")` since `__func__` is ugly |
| 265 | +- For sub-scopes, use braces to create a new scope |
| 266 | +- `BB_BENCH_ENABLE_NESTING()` is needed when you want child `BB_BENCH_NAME` calls inside this function to show this function as their parent in the hierarchy |
| 267 | + |
| 268 | +### Extracting component benchmarks |
| 269 | + |
| 270 | +After running with `--bench_out_hierarchical`, extract key components: |
| 271 | + |
| 272 | +```bash |
| 273 | +python3 barretenberg/cpp/scripts/extract_component_benchmarks.py <output_dir> <name_path> |
| 274 | +``` |
| 275 | + |
| 276 | +This reads `benchmark_breakdown.json`, finds operations matching key components (sumcheck, pcs, pippenger, commitment, circuit, oink, compute), and appends them to `benchmarks.bench.json` with stacked chart markers for the dashboard. |
| 277 | + |
| 278 | +## A/B comparison scripts |
| 279 | + |
| 280 | +These use Google Benchmark's `compare.py` for statistical analysis. Note: these use the **remote machine** — see `/remote-bench`. |
| 281 | + |
| 282 | +| Script | What it compares | |
| 283 | +|--------|-----------------| |
| 284 | +| `scripts/compare_chonk_bench.sh` | Native ChonkBench/Full/6, branch vs baseline | |
| 285 | +| `scripts/compare_chonk_bench_wasm.sh` | WASM ChonkBench/Full/6, branch vs baseline | |
| 286 | +| `scripts/compare_branch_vs_baseline_remote.sh` | Generic native A/B | |
| 287 | +| `scripts/compare_branch_vs_baseline_remote_wasm.sh` | Generic WASM A/B | |
| 288 | + |
| 289 | +## Key scripts reference |
| 290 | + |
| 291 | +| Script | Purpose | |
| 292 | +|--------|---------| |
| 293 | +| `scripts/test_chonk_standalone_vks_havent_changed.sh` | Download/update/verify pinned inputs | |
| 294 | +| `scripts/ci_benchmark_ivc_flows.sh` | CI: proves a flow, extracts components, uploads to dashboard | |
| 295 | +| `scripts/benchmark_example_ivc_flow_remote.sh` | Proves a pinned flow on the remote machine (uses `/remote-bench`) | |
| 296 | +| `scripts/benchmark_chonk.sh` | Synthetic `chonk_bench` on remote | |
| 297 | +| `scripts/wasmtime.sh` | wasmtime wrapper with standard flags | |
| 298 | +| `scripts/extract_component_benchmarks.py` | Extract component timings from hierarchical breakdown | |
| 299 | + |
| 300 | +## Tips |
| 301 | + |
| 302 | +- **`HARDWARE_CONCURRENCY=8` for local, `16` for remote.** Always set this explicitly. Local/shared machines use 8; the remote benchmarking machine uses 16. |
| 303 | +- **Local iteration is fine** — you can build, instrument, and run locally. Just average 3 runs for reliable numbers, or use the remote machine via `/remote-bench` for single-run accuracy. |
| 304 | +- **Use `./bootstrap.sh` for initial builds** — it downloads cached artifacts and avoids build issues. Use `cmake --preset clang20 && cd build && ninja bb` for incremental rebuilds after code changes. |
| 305 | +- **Build dir is `build/`** — the `clang20` preset outputs to `build/`, not `build-no-avm`. The `clang20-no-avm` preset also uses `build/` (it disables AVM at cmake level, not via directory name). |
| 306 | +- **If the zig cache breaks** (missing `libubsan_rt.a` errors), delete `build/` and reconfigure: `rm -rf build && cmake --preset clang20`. |
| 307 | +- **WASM preset:** `wasm-threads`. Build dir is `build-wasm-threads/`. The preset enables `ENABLE_WASM_BENCH=ON` automatically. |
| 308 | +- **WASM is ~2.8x slower than native** — this ratio is consistent across all circuit types. |
| 309 | +- **CRS:** Ensure `~/.bb-crs` exists. For WASM, wasmtime needs `--dir=$HOME/.bb-crs`. |
| 310 | +- **`BB_BENCH=1` vs `--print_bench`:** Either activates profiling. `--print_bench` also triggers the hierarchical tree output to stderr. In `chonk_bench`, the `GOOGLE_BB_BENCH_REPORTER` macro enables it automatically when `BB_BENCH=1` is set. |
| 311 | +- **Dashboard:** CI uploads breakdown data to `bench/bb-breakdown/` on S3. The dashboard at `ci3/dashboard/chonk-breakdowns/` visualizes it. |
| 312 | +- **Rebuilding after instrumentation changes:** Only `ninja bb` is needed — no need to reconfigure. |
| 313 | + |
| 314 | +## Presenting results |
| 315 | + |
| 316 | +When sharing benchmark results, create an **HTML gist** with an interactive visualization. Include: |
| 317 | + |
| 318 | +- **Native vs WASM tabs** with per-circuit comparison table |
| 319 | +- **Stacked bar charts** showing time distribution across circuits |
| 320 | +- **Aggregation by circuit type** (kernel vs app vs infra) |
| 321 | +- **Summary cards** with total time, slowdown ratio, and heaviest circuit |
| 322 | +- **Color-coded circuit types**: kernel (blue), app (red), infra (gray) |
| 323 | + |
| 324 | +Use `create_gist` / `update_gist` with a `.html` file. GitHub renders HTML gists — viewers can open the raw HTML to interact with tabs and tooltips. This is much more useful than plain markdown tables for benchmark data. |
0 commit comments