Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
324 changes: 324 additions & 0 deletions barretenberg/.claude/skills/benchmark-chonk/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,324 @@
---
name: benchmark-chonk
description: Run realistic Chonk (client IVC) benchmarks using pinned protocol inputs. Covers native and WASM proving, per-circuit breakdowns, BB_BENCH instrumentation, and profiling code augmentation. Use when asked to benchmark, profile, or measure Chonk proving performance.
argument-hint: <action> e.g. "run", "compare", "wasm", "instrument <area>", "per-circuit", "download-inputs"
---

# Benchmark Chonk

Run realistic Chonk IVC benchmarks using **pinned protocol inputs** (real transaction flows captured from end-to-end tests), not the synthetic `chonk_bench` target. The synthetic benchmark (`chonk_bench`) uses trivially small mock circuits — it is useful for quick regression checks but does NOT reflect production proving performance. Users invoking `/benchmark-chonk` want the real thing.

## What makes this different from `chonk_bench`

| | `chonk_bench` (synthetic) | This skill (realistic) |
|---|---|---|
| Input data | Mock circuits via `test_bench_shared.hpp` | Pinned msgpack from real Aztec transactions |
| Circuit count | 2 or 5 tiny circuits | Full transaction flows (10+ circuits) |
| Circuit variety | All identical | Mixed: app, kernel, tail, public |
| BB command | `./chonk_bench --benchmark_filter=...` | `bb prove --scheme chonk --ivc_inputs_path ...` |

## Step 1: Get pinned IVC inputs

The real benchmark inputs are pinned to an S3 artifact keyed by a short hash. Download them:

```bash
cd barretenberg/cpp/scripts
./test_chonk_standalone_vks_havent_changed.sh --download_pinned_inputs
```

This populates `yarn-project/end-to-end/example-app-ivc-inputs-out/<flow>/ivc-inputs.msgpack`.

Available flows (typical):
- `ecdsar1+transfer_1_recursions+sponsored_fpc`
- `schnorr+deploy_tokenContract_with_registration+sponsored_fpc`
- `ecdsar1+amm_add_liquidity_1_recursions+sponsored_fpc`
- `ecdsar1+transfer_1_recursions+private_fpc`
- and more — run `ls yarn-project/end-to-end/example-app-ivc-inputs-out/` after downloading

The pinned hash is maintained in `barretenberg/cpp/scripts/test_chonk_standalone_vks_havent_changed.sh` (variable `pinned_short_hash`). The S3 URL is:
```
https://aztec-ci-artifacts.s3.us-east-2.amazonaws.com/protocol/bb-chonk-inputs-<hash>.tar.gz
```

To update the pinned inputs (after protocol changes that affect VKs):
```bash
./test_chonk_standalone_vks_havent_changed.sh --update_inputs
```

## Step 2: Build bb in release mode

```bash
cd barretenberg/cpp
cmake --preset clang20-no-avm # AVM not needed for Chonk
cmake --build --preset clang20-no-avm --target bb
```

Build dir: `build-no-avm` (or `build` if using the `clang20` preset).

## Step 3: Run the benchmark

**Always set `HARDWARE_CONCURRENCY=8` for local runs.** The remote benchmarking machine uses 16, but local/shared machines should use 8. See `/remote-bench` for remote execution.

### Native

```bash
cd barretenberg/cpp

FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc"
OUTPUT_DIR="/tmp/chonk-bench-out"
mkdir -p $OUTPUT_DIR

HARDWARE_CONCURRENCY=8 ./build-no-avm/bin/bb prove \
-o $OUTPUT_DIR \
--ivc_inputs_path ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack \
--scheme chonk \
-v \
--print_bench \
--bench_out_hierarchical $OUTPUT_DIR/benchmark_breakdown.json
```

### WASM (via wasmtime)

Build the WASM binary with threads enabled:

```bash
cd barretenberg/cpp
cmake --preset wasm-threads
cmake --build --preset wasm-threads --target bb
```

Run via wasmtime (the `scripts/wasmtime.sh` wrapper sets standard flags):

```bash
cd barretenberg/cpp

FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc"
OUTPUT_DIR="/tmp/chonk-bench-wasm"
mkdir -p $OUTPUT_DIR

# Copy inputs to a working dir wasmtime can access
cp ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack $OUTPUT_DIR/

cd $OUTPUT_DIR
HARDWARE_CONCURRENCY=8 BB_BENCH=1 \
/path/to/barretenberg/cpp/scripts/wasmtime.sh \
/path/to/barretenberg/cpp/build-wasm-threads/bin/bb prove \
-o output \
--ivc_inputs_path ivc-inputs.msgpack \
--scheme chonk \
-v \
--print_bench \
--bench_out_hierarchical benchmark_breakdown.json
```

The wasmtime wrapper sets:
- `-Wthreads=y -Sthreads=y` — enable WASM threads and shared memory
- `--env HARDWARE_CONCURRENCY` — thread count
- `--env BB_BENCH` — enable operation counting (`ENABLE_WASM_BENCH=ON` is set by the `wasm-threads` preset)
- `--dir=$HOME/.bb-crs --dir=.` — filesystem access for CRS and working directory

## Local runs are noisy — average 3 runs

Non-dedicated machines have variable CPU load. **Run the benchmark at least 3 times and average the results.** Only the remote benchmarking machine (see `/remote-bench` skill) provides stable, isolated CPU for single-run measurements.

When iterating locally on profiling code changes, relative comparisons (before vs after your change) are still valid on noisy machines — just ensure you compare runs taken close together under similar load.

## Using with the remote benchmarking machine

For noise-free, publishable results, use the `/remote-bench` skill to run on the dedicated EC2 instance. The two skills compose naturally:

1. `/benchmark-chonk download-inputs` — get pinned inputs locally
2. `/remote-bench` — build locally, scp binary + inputs to remote, run there, copy results back

See the `/remote-bench` skill for setup, lock management, and usage.

## BB_BENCH instrumentation system

### How it works

`BB_BENCH` is an always-compiled, low-overhead RAII profiling system.

**Header:** `barretenberg/cpp/src/barretenberg/common/bb_bench.hpp`
**Implementation:** `barretenberg/cpp/src/barretenberg/common/bb_bench.cpp`

**Macros:**
```cpp
BB_BENCH() // label = __func__
BB_BENCH_NAME("label") // custom label (preferred)
BB_BENCH_ONLY_NAME("label") // no Tracy, no nesting — lightweight
BB_BENCH_ENABLE_NESTING() // set parent context for child operations
```

The macros create `BenchReporter` RAII objects that:
1. On construction: capture parent context + start time
2. On destruction: record elapsed time with parent association
3. Build a hierarchical call tree automatically

**Activation:** `BB_BENCH=1` env var, or `--print_bench` / `--bench_out_hierarchical` CLI flags.

### Google Benchmark integration

For `chonk_bench` and other `.bench.cpp` targets:
```cpp
#include "barretenberg/common/google_bb_bench.hpp"

for (auto _ : state) {
GOOGLE_BB_BENCH_REPORTER(state); // clears stats, collects on destruction
// ... benchmark body ...
}
```

`GOOGLE_BB_BENCH_REPORTER(state)` creates a `GoogleBbBenchReporter` which:
- **Constructor:** calls `GLOBAL_BENCH_STATS.clear()` — resets all accumulated stats
- **Destructor:** aggregates stats into Google Benchmark counters (each operation becomes a `(s)` suffixed counter)

### Per-circuit / per-accumulate breakdown

**Key function:** `bb::detail::GLOBAL_BENCH_STATS.clear()`
(`barretenberg/cpp/src/barretenberg/common/bb_bench.cpp`)

```cpp
void GlobalBenchStatsContainer::clear()
{
std::unique_lock<std::mutex> lock(mutex);
for (std::shared_ptr<TimeStatsEntry>& entry : entries) {
entry->count = TimeStats(); // resets to zero without losing entry structure
}
}
```

**Usage pattern for per-circuit profiling:**

The `--print_bench` output aggregates across all 19 circuits. To get per-circuit timing, temporarily instrument `barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp`:

1. Add `#include <chrono>` at the top
2. In `ChonkAccumulate::execute()`, wrap the `accumulate()` call:

```cpp
info("ChonkAccumulate - accumulating circuit '", request.loaded_circuit_name, "'");
bb::detail::GLOBAL_BENCH_STATS.clear();
auto circuit_start = std::chrono::steady_clock::now();
request.ivc_in_progress->accumulate(circuit, precomputed_vk);
auto circuit_end = std::chrono::steady_clock::now();
auto circuit_ms = std::chrono::duration_cast<std::chrono::milliseconds>(circuit_end - circuit_start).count();
info("PER_CIRCUIT_TIME: circuit='",
request.loaded_circuit_name,
"' index=",
request.ivc_stack_depth,
" time_ms=",
circuit_ms);
bb::detail::GLOBAL_BENCH_STATS.print_aggregate_counts_hierarchical(std::cerr);
request.ivc_stack_depth++;
```

3. Rebuild with `cd build && ninja bb` (only recompiles the changed file + relinks)
4. Run the benchmark, then grep for `PER_CIRCUIT_TIME` in the output
5. **Revert the instrumentation** after collecting data: `git checkout -- barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp`

This gives wall-clock time per circuit plus a per-circuit BB_BENCH breakdown. The `GLOBAL_BENCH_STATS.clear()` resets stats before each circuit so the hierarchical print shows only that circuit's work.

The same pattern works at any granularity — clear before, print after. This is how `GOOGLE_BB_BENCH_REPORTER` works internally.

### Output formats

| Flag | Format | Use case |
|------|--------|----------|
| `--print_bench` | Colorized tree on stderr | Human reading in terminal |
| `--bench_out <file>` | Flat JSON `{"op": time_ns}` | Simple metrics |
| `--bench_out_hierarchical <file>` | Nested JSON with parent/child | Dashboard, `extract_component_benchmarks.py` |

The hierarchical JSON format:
```json
{
"operation_name": [
{
"parent": "parent_operation",
"time": 1234567890,
"time_max": 1234567890,
"time_mean": 1234567890.0,
"time_stddev": 12345.0,
"count": 5,
"num_threads": 8
}
]
}
```

### Adding new instrumentation

When profiling reveals "missing time" (parent time - sum of children > 20%), add `BB_BENCH_NAME` to the uninstrumented functions:

```cpp
#include "barretenberg/common/bb_bench.hpp"

void MyProver::execute_phase() {
BB_BENCH_NAME("MyProver::execute_phase");
BB_BENCH_ENABLE_NESTING(); // allow child operations to track this as parent
// ... function body ...
}
```

**Rules:**
- Place macro as the first statement in the scope you want to measure
- Use descriptive names: `"Chonk::accumulate::oink_phase"` not `"oink"`
- For templates: `BB_BENCH_NAME("ShpleminiProver<Flavor>::prove")` since `__func__` is ugly
- For sub-scopes, use braces to create a new scope
- `BB_BENCH_ENABLE_NESTING()` is needed when you want child `BB_BENCH_NAME` calls inside this function to show this function as their parent in the hierarchy

### Extracting component benchmarks

After running with `--bench_out_hierarchical`, extract key components:

```bash
python3 barretenberg/cpp/scripts/extract_component_benchmarks.py <output_dir> <name_path>
```

This reads `benchmark_breakdown.json`, finds operations matching key components (sumcheck, pcs, pippenger, commitment, circuit, oink, compute), and appends them to `benchmarks.bench.json` with stacked chart markers for the dashboard.

## A/B comparison scripts

These use Google Benchmark's `compare.py` for statistical analysis. Note: these use the **remote machine** — see `/remote-bench`.

| Script | What it compares |
|--------|-----------------|
| `scripts/compare_chonk_bench.sh` | Native ChonkBench/Full/6, branch vs baseline |
| `scripts/compare_chonk_bench_wasm.sh` | WASM ChonkBench/Full/6, branch vs baseline |
| `scripts/compare_branch_vs_baseline_remote.sh` | Generic native A/B |
| `scripts/compare_branch_vs_baseline_remote_wasm.sh` | Generic WASM A/B |

## Key scripts reference

| Script | Purpose |
|--------|---------|
| `scripts/test_chonk_standalone_vks_havent_changed.sh` | Download/update/verify pinned inputs |
| `scripts/ci_benchmark_ivc_flows.sh` | CI: proves a flow, extracts components, uploads to dashboard |
| `scripts/benchmark_example_ivc_flow_remote.sh` | Proves a pinned flow on the remote machine (uses `/remote-bench`) |
| `scripts/benchmark_chonk.sh` | Synthetic `chonk_bench` on remote |
| `scripts/wasmtime.sh` | wasmtime wrapper with standard flags |
| `scripts/extract_component_benchmarks.py` | Extract component timings from hierarchical breakdown |

## Tips

- **`HARDWARE_CONCURRENCY=8` for local, `16` for remote.** Always set this explicitly. Local/shared machines use 8; the remote benchmarking machine uses 16.
- **Local iteration is fine** — you can build, instrument, and run locally. Just average 3 runs for reliable numbers, or use the remote machine via `/remote-bench` for single-run accuracy.
- **Use `./bootstrap.sh` for initial builds** — it downloads cached artifacts and avoids build issues. Use `cmake --preset clang20 && cd build && ninja bb` for incremental rebuilds after code changes.
- **Build dir is `build/`** — the `clang20` preset outputs to `build/`, not `build-no-avm`. The `clang20-no-avm` preset also uses `build/` (it disables AVM at cmake level, not via directory name).
- **If the zig cache breaks** (missing `libubsan_rt.a` errors), delete `build/` and reconfigure: `rm -rf build && cmake --preset clang20`.
- **WASM preset:** `wasm-threads`. Build dir is `build-wasm-threads/`. The preset enables `ENABLE_WASM_BENCH=ON` automatically.
- **WASM is ~2.8x slower than native** — this ratio is consistent across all circuit types.
- **CRS:** Ensure `~/.bb-crs` exists. For WASM, wasmtime needs `--dir=$HOME/.bb-crs`.
- **`BB_BENCH=1` vs `--print_bench`:** Either activates profiling. `--print_bench` also triggers the hierarchical tree output to stderr. In `chonk_bench`, the `GOOGLE_BB_BENCH_REPORTER` macro enables it automatically when `BB_BENCH=1` is set.
- **Dashboard:** CI uploads breakdown data to `bench/bb-breakdown/` on S3. The dashboard at `ci3/dashboard/chonk-breakdowns/` visualizes it.
- **Rebuilding after instrumentation changes:** Only `ninja bb` is needed — no need to reconfigure.

## Presenting results

When sharing benchmark results, create an **HTML gist** with an interactive visualization. Include:

- **Native vs WASM tabs** with per-circuit comparison table
- **Stacked bar charts** showing time distribution across circuits
- **Aggregation by circuit type** (kernel vs app vs infra)
- **Summary cards** with total time, slowdown ratio, and heaviest circuit
- **Color-coded circuit types**: kernel (blue), app (red), infra (gray)

Use `create_gist` / `update_gist` with a `.html` file. GitHub renders HTML gists — viewers can open the raw HTML to interact with tabs and tooltips. This is much more useful than plain markdown tables for benchmark data.
Loading
Loading