Skip to content

Commit b261458

Browse files
feat: merge-train/barretenberg (#22507)
BEGIN_COMMIT_OVERRIDE feat: add benchmark tag to ECCOpQueue::mul_accumulate (#22506) fix: skip heavy recursion tests in debug builds to fix nightly CI (#22521) chore: add code comment guideline to barretenberg CLAUDE.md (#22517) chore: use fixed lookup tables in recursive IPA verifier (#22320) END_COMMIT_OVERRIDE
2 parents 3c0b870 + 7028539 commit b261458

15 files changed

Lines changed: 1196 additions & 19 deletions

File tree

Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
---
2+
name: benchmark-chonk
3+
description: Run realistic Chonk (client IVC) benchmarks using pinned protocol inputs. Covers native and WASM proving, per-circuit breakdowns, BB_BENCH instrumentation, and profiling code augmentation. Use when asked to benchmark, profile, or measure Chonk proving performance.
4+
argument-hint: <action> e.g. "run", "compare", "wasm", "instrument <area>", "per-circuit", "download-inputs"
5+
---
6+
7+
# Benchmark Chonk
8+
9+
Run realistic Chonk IVC benchmarks using **pinned protocol inputs** (real transaction flows captured from end-to-end tests), not the synthetic `chonk_bench` target. The synthetic benchmark (`chonk_bench`) uses trivially small mock circuits — it is useful for quick regression checks but does NOT reflect production proving performance. Users invoking `/benchmark-chonk` want the real thing.
10+
11+
## What makes this different from `chonk_bench`
12+
13+
| | `chonk_bench` (synthetic) | This skill (realistic) |
14+
|---|---|---|
15+
| Input data | Mock circuits via `test_bench_shared.hpp` | Pinned msgpack from real Aztec transactions |
16+
| Circuit count | 2 or 5 tiny circuits | Full transaction flows (10+ circuits) |
17+
| Circuit variety | All identical | Mixed: app, kernel, tail, public |
18+
| BB command | `./chonk_bench --benchmark_filter=...` | `bb prove --scheme chonk --ivc_inputs_path ...` |
19+
20+
## Step 1: Get pinned IVC inputs
21+
22+
The real benchmark inputs are pinned to an S3 artifact keyed by a short hash. Download them:
23+
24+
```bash
25+
cd barretenberg/cpp/scripts
26+
./test_chonk_standalone_vks_havent_changed.sh --download_pinned_inputs
27+
```
28+
29+
This populates `yarn-project/end-to-end/example-app-ivc-inputs-out/<flow>/ivc-inputs.msgpack`.
30+
31+
Available flows (typical):
32+
- `ecdsar1+transfer_1_recursions+sponsored_fpc`
33+
- `schnorr+deploy_tokenContract_with_registration+sponsored_fpc`
34+
- `ecdsar1+amm_add_liquidity_1_recursions+sponsored_fpc`
35+
- `ecdsar1+transfer_1_recursions+private_fpc`
36+
- and more — run `ls yarn-project/end-to-end/example-app-ivc-inputs-out/` after downloading
37+
38+
The pinned hash is maintained in `barretenberg/cpp/scripts/test_chonk_standalone_vks_havent_changed.sh` (variable `pinned_short_hash`). The S3 URL is:
39+
```
40+
https://aztec-ci-artifacts.s3.us-east-2.amazonaws.com/protocol/bb-chonk-inputs-<hash>.tar.gz
41+
```
42+
43+
To update the pinned inputs (after protocol changes that affect VKs):
44+
```bash
45+
./test_chonk_standalone_vks_havent_changed.sh --update_inputs
46+
```
47+
48+
## Step 2: Build bb in release mode
49+
50+
```bash
51+
cd barretenberg/cpp
52+
cmake --preset clang20-no-avm # AVM not needed for Chonk
53+
cmake --build --preset clang20-no-avm --target bb
54+
```
55+
56+
Build dir: `build-no-avm` (or `build` if using the `clang20` preset).
57+
58+
## Step 3: Run the benchmark
59+
60+
**Always set `HARDWARE_CONCURRENCY=8` for local runs.** The remote benchmarking machine uses 16, but local/shared machines should use 8. See `/remote-bench` for remote execution.
61+
62+
### Native
63+
64+
```bash
65+
cd barretenberg/cpp
66+
67+
FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc"
68+
OUTPUT_DIR="/tmp/chonk-bench-out"
69+
mkdir -p $OUTPUT_DIR
70+
71+
HARDWARE_CONCURRENCY=8 ./build-no-avm/bin/bb prove \
72+
-o $OUTPUT_DIR \
73+
--ivc_inputs_path ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack \
74+
--scheme chonk \
75+
-v \
76+
--print_bench \
77+
--bench_out_hierarchical $OUTPUT_DIR/benchmark_breakdown.json
78+
```
79+
80+
### WASM (via wasmtime)
81+
82+
Build the WASM binary with threads enabled:
83+
84+
```bash
85+
cd barretenberg/cpp
86+
cmake --preset wasm-threads
87+
cmake --build --preset wasm-threads --target bb
88+
```
89+
90+
Run via wasmtime (the `scripts/wasmtime.sh` wrapper sets standard flags):
91+
92+
```bash
93+
cd barretenberg/cpp
94+
95+
FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc"
96+
OUTPUT_DIR="/tmp/chonk-bench-wasm"
97+
mkdir -p $OUTPUT_DIR
98+
99+
# Copy inputs to a working dir wasmtime can access
100+
cp ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack $OUTPUT_DIR/
101+
102+
cd $OUTPUT_DIR
103+
HARDWARE_CONCURRENCY=8 BB_BENCH=1 \
104+
/path/to/barretenberg/cpp/scripts/wasmtime.sh \
105+
/path/to/barretenberg/cpp/build-wasm-threads/bin/bb prove \
106+
-o output \
107+
--ivc_inputs_path ivc-inputs.msgpack \
108+
--scheme chonk \
109+
-v \
110+
--print_bench \
111+
--bench_out_hierarchical benchmark_breakdown.json
112+
```
113+
114+
The wasmtime wrapper sets:
115+
- `-Wthreads=y -Sthreads=y` — enable WASM threads and shared memory
116+
- `--env HARDWARE_CONCURRENCY` — thread count
117+
- `--env BB_BENCH` — enable operation counting (`ENABLE_WASM_BENCH=ON` is set by the `wasm-threads` preset)
118+
- `--dir=$HOME/.bb-crs --dir=.` — filesystem access for CRS and working directory
119+
120+
## Local runs are noisy — average 3 runs
121+
122+
Non-dedicated machines have variable CPU load. **Run the benchmark at least 3 times and average the results.** Only the remote benchmarking machine (see `/remote-bench` skill) provides stable, isolated CPU for single-run measurements.
123+
124+
When iterating locally on profiling code changes, relative comparisons (before vs after your change) are still valid on noisy machines — just ensure you compare runs taken close together under similar load.
125+
126+
## Using with the remote benchmarking machine
127+
128+
For noise-free, publishable results, use the `/remote-bench` skill to run on the dedicated EC2 instance. The two skills compose naturally:
129+
130+
1. `/benchmark-chonk download-inputs` — get pinned inputs locally
131+
2. `/remote-bench` — build locally, scp binary + inputs to remote, run there, copy results back
132+
133+
See the `/remote-bench` skill for setup, lock management, and usage.
134+
135+
## BB_BENCH instrumentation system
136+
137+
### How it works
138+
139+
`BB_BENCH` is an always-compiled, low-overhead RAII profiling system.
140+
141+
**Header:** `barretenberg/cpp/src/barretenberg/common/bb_bench.hpp`
142+
**Implementation:** `barretenberg/cpp/src/barretenberg/common/bb_bench.cpp`
143+
144+
**Macros:**
145+
```cpp
146+
BB_BENCH() // label = __func__
147+
BB_BENCH_NAME("label") // custom label (preferred)
148+
BB_BENCH_ONLY_NAME("label") // no Tracy, no nesting — lightweight
149+
BB_BENCH_ENABLE_NESTING() // set parent context for child operations
150+
```
151+
152+
The macros create `BenchReporter` RAII objects that:
153+
1. On construction: capture parent context + start time
154+
2. On destruction: record elapsed time with parent association
155+
3. Build a hierarchical call tree automatically
156+
157+
**Activation:** `BB_BENCH=1` env var, or `--print_bench` / `--bench_out_hierarchical` CLI flags.
158+
159+
### Google Benchmark integration
160+
161+
For `chonk_bench` and other `.bench.cpp` targets:
162+
```cpp
163+
#include "barretenberg/common/google_bb_bench.hpp"
164+
165+
for (auto _ : state) {
166+
GOOGLE_BB_BENCH_REPORTER(state); // clears stats, collects on destruction
167+
// ... benchmark body ...
168+
}
169+
```
170+
171+
`GOOGLE_BB_BENCH_REPORTER(state)` creates a `GoogleBbBenchReporter` which:
172+
- **Constructor:** calls `GLOBAL_BENCH_STATS.clear()` — resets all accumulated stats
173+
- **Destructor:** aggregates stats into Google Benchmark counters (each operation becomes a `(s)` suffixed counter)
174+
175+
### Per-circuit / per-accumulate breakdown
176+
177+
**Key function:** `bb::detail::GLOBAL_BENCH_STATS.clear()`
178+
(`barretenberg/cpp/src/barretenberg/common/bb_bench.cpp`)
179+
180+
```cpp
181+
void GlobalBenchStatsContainer::clear()
182+
{
183+
std::unique_lock<std::mutex> lock(mutex);
184+
for (std::shared_ptr<TimeStatsEntry>& entry : entries) {
185+
entry->count = TimeStats(); // resets to zero without losing entry structure
186+
}
187+
}
188+
```
189+
190+
**Usage pattern for per-circuit profiling:**
191+
192+
The `--print_bench` output aggregates across all 19 circuits. To get per-circuit timing, temporarily instrument `barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp`:
193+
194+
1. Add `#include <chrono>` at the top
195+
2. In `ChonkAccumulate::execute()`, wrap the `accumulate()` call:
196+
197+
```cpp
198+
info("ChonkAccumulate - accumulating circuit '", request.loaded_circuit_name, "'");
199+
bb::detail::GLOBAL_BENCH_STATS.clear();
200+
auto circuit_start = std::chrono::steady_clock::now();
201+
request.ivc_in_progress->accumulate(circuit, precomputed_vk);
202+
auto circuit_end = std::chrono::steady_clock::now();
203+
auto circuit_ms = std::chrono::duration_cast<std::chrono::milliseconds>(circuit_end - circuit_start).count();
204+
info("PER_CIRCUIT_TIME: circuit='",
205+
request.loaded_circuit_name,
206+
"' index=",
207+
request.ivc_stack_depth,
208+
" time_ms=",
209+
circuit_ms);
210+
bb::detail::GLOBAL_BENCH_STATS.print_aggregate_counts_hierarchical(std::cerr);
211+
request.ivc_stack_depth++;
212+
```
213+
214+
3. Rebuild with `cd build && ninja bb` (only recompiles the changed file + relinks)
215+
4. Run the benchmark, then grep for `PER_CIRCUIT_TIME` in the output
216+
5. **Revert the instrumentation** after collecting data: `git checkout -- barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp`
217+
218+
This gives wall-clock time per circuit plus a per-circuit BB_BENCH breakdown. The `GLOBAL_BENCH_STATS.clear()` resets stats before each circuit so the hierarchical print shows only that circuit's work.
219+
220+
The same pattern works at any granularity — clear before, print after. This is how `GOOGLE_BB_BENCH_REPORTER` works internally.
221+
222+
### Output formats
223+
224+
| Flag | Format | Use case |
225+
|------|--------|----------|
226+
| `--print_bench` | Colorized tree on stderr | Human reading in terminal |
227+
| `--bench_out <file>` | Flat JSON `{"op": time_ns}` | Simple metrics |
228+
| `--bench_out_hierarchical <file>` | Nested JSON with parent/child | Dashboard, `extract_component_benchmarks.py` |
229+
230+
The hierarchical JSON format:
231+
```json
232+
{
233+
"operation_name": [
234+
{
235+
"parent": "parent_operation",
236+
"time": 1234567890,
237+
"time_max": 1234567890,
238+
"time_mean": 1234567890.0,
239+
"time_stddev": 12345.0,
240+
"count": 5,
241+
"num_threads": 8
242+
}
243+
]
244+
}
245+
```
246+
247+
### Adding new instrumentation
248+
249+
When profiling reveals "missing time" (parent time - sum of children > 20%), add `BB_BENCH_NAME` to the uninstrumented functions:
250+
251+
```cpp
252+
#include "barretenberg/common/bb_bench.hpp"
253+
254+
void MyProver::execute_phase() {
255+
BB_BENCH_NAME("MyProver::execute_phase");
256+
BB_BENCH_ENABLE_NESTING(); // allow child operations to track this as parent
257+
// ... function body ...
258+
}
259+
```
260+
261+
**Rules:**
262+
- Place macro as the first statement in the scope you want to measure
263+
- Use descriptive names: `"Chonk::accumulate::oink_phase"` not `"oink"`
264+
- For templates: `BB_BENCH_NAME("ShpleminiProver<Flavor>::prove")` since `__func__` is ugly
265+
- For sub-scopes, use braces to create a new scope
266+
- `BB_BENCH_ENABLE_NESTING()` is needed when you want child `BB_BENCH_NAME` calls inside this function to show this function as their parent in the hierarchy
267+
268+
### Extracting component benchmarks
269+
270+
After running with `--bench_out_hierarchical`, extract key components:
271+
272+
```bash
273+
python3 barretenberg/cpp/scripts/extract_component_benchmarks.py <output_dir> <name_path>
274+
```
275+
276+
This reads `benchmark_breakdown.json`, finds operations matching key components (sumcheck, pcs, pippenger, commitment, circuit, oink, compute), and appends them to `benchmarks.bench.json` with stacked chart markers for the dashboard.
277+
278+
## A/B comparison scripts
279+
280+
These use Google Benchmark's `compare.py` for statistical analysis. Note: these use the **remote machine** — see `/remote-bench`.
281+
282+
| Script | What it compares |
283+
|--------|-----------------|
284+
| `scripts/compare_chonk_bench.sh` | Native ChonkBench/Full/6, branch vs baseline |
285+
| `scripts/compare_chonk_bench_wasm.sh` | WASM ChonkBench/Full/6, branch vs baseline |
286+
| `scripts/compare_branch_vs_baseline_remote.sh` | Generic native A/B |
287+
| `scripts/compare_branch_vs_baseline_remote_wasm.sh` | Generic WASM A/B |
288+
289+
## Key scripts reference
290+
291+
| Script | Purpose |
292+
|--------|---------|
293+
| `scripts/test_chonk_standalone_vks_havent_changed.sh` | Download/update/verify pinned inputs |
294+
| `scripts/ci_benchmark_ivc_flows.sh` | CI: proves a flow, extracts components, uploads to dashboard |
295+
| `scripts/benchmark_example_ivc_flow_remote.sh` | Proves a pinned flow on the remote machine (uses `/remote-bench`) |
296+
| `scripts/benchmark_chonk.sh` | Synthetic `chonk_bench` on remote |
297+
| `scripts/wasmtime.sh` | wasmtime wrapper with standard flags |
298+
| `scripts/extract_component_benchmarks.py` | Extract component timings from hierarchical breakdown |
299+
300+
## Tips
301+
302+
- **`HARDWARE_CONCURRENCY=8` for local, `16` for remote.** Always set this explicitly. Local/shared machines use 8; the remote benchmarking machine uses 16.
303+
- **Local iteration is fine** — you can build, instrument, and run locally. Just average 3 runs for reliable numbers, or use the remote machine via `/remote-bench` for single-run accuracy.
304+
- **Use `./bootstrap.sh` for initial builds** — it downloads cached artifacts and avoids build issues. Use `cmake --preset clang20 && cd build && ninja bb` for incremental rebuilds after code changes.
305+
- **Build dir is `build/`** — the `clang20` preset outputs to `build/`, not `build-no-avm`. The `clang20-no-avm` preset also uses `build/` (it disables AVM at cmake level, not via directory name).
306+
- **If the zig cache breaks** (missing `libubsan_rt.a` errors), delete `build/` and reconfigure: `rm -rf build && cmake --preset clang20`.
307+
- **WASM preset:** `wasm-threads`. Build dir is `build-wasm-threads/`. The preset enables `ENABLE_WASM_BENCH=ON` automatically.
308+
- **WASM is ~2.8x slower than native** — this ratio is consistent across all circuit types.
309+
- **CRS:** Ensure `~/.bb-crs` exists. For WASM, wasmtime needs `--dir=$HOME/.bb-crs`.
310+
- **`BB_BENCH=1` vs `--print_bench`:** Either activates profiling. `--print_bench` also triggers the hierarchical tree output to stderr. In `chonk_bench`, the `GOOGLE_BB_BENCH_REPORTER` macro enables it automatically when `BB_BENCH=1` is set.
311+
- **Dashboard:** CI uploads breakdown data to `bench/bb-breakdown/` on S3. The dashboard at `ci3/dashboard/chonk-breakdowns/` visualizes it.
312+
- **Rebuilding after instrumentation changes:** Only `ninja bb` is needed — no need to reconfigure.
313+
314+
## Presenting results
315+
316+
When sharing benchmark results, create an **HTML gist** with an interactive visualization. Include:
317+
318+
- **Native vs WASM tabs** with per-circuit comparison table
319+
- **Stacked bar charts** showing time distribution across circuits
320+
- **Aggregation by circuit type** (kernel vs app vs infra)
321+
- **Summary cards** with total time, slowdown ratio, and heaviest circuit
322+
- **Color-coded circuit types**: kernel (blue), app (red), infra (gray)
323+
324+
Use `create_gist` / `update_gist` with a `.html` file. GitHub renders HTML gists — viewers can open the raw HTML to interact with tabs and tooltips. This is much more useful than plain markdown tables for benchmark data.

0 commit comments

Comments
 (0)