diff --git a/barretenberg/.claude/skills/benchmark-chonk/SKILL.md b/barretenberg/.claude/skills/benchmark-chonk/SKILL.md new file mode 100644 index 000000000000..ec480bcf79a0 --- /dev/null +++ b/barretenberg/.claude/skills/benchmark-chonk/SKILL.md @@ -0,0 +1,324 @@ +--- +name: benchmark-chonk +description: Run realistic Chonk (client IVC) benchmarks using pinned protocol inputs. Covers native and WASM proving, per-circuit breakdowns, BB_BENCH instrumentation, and profiling code augmentation. Use when asked to benchmark, profile, or measure Chonk proving performance. +argument-hint: e.g. "run", "compare", "wasm", "instrument ", "per-circuit", "download-inputs" +--- + +# Benchmark Chonk + +Run realistic Chonk IVC benchmarks using **pinned protocol inputs** (real transaction flows captured from end-to-end tests), not the synthetic `chonk_bench` target. The synthetic benchmark (`chonk_bench`) uses trivially small mock circuits — it is useful for quick regression checks but does NOT reflect production proving performance. Users invoking `/benchmark-chonk` want the real thing. + +## What makes this different from `chonk_bench` + +| | `chonk_bench` (synthetic) | This skill (realistic) | +|---|---|---| +| Input data | Mock circuits via `test_bench_shared.hpp` | Pinned msgpack from real Aztec transactions | +| Circuit count | 2 or 5 tiny circuits | Full transaction flows (10+ circuits) | +| Circuit variety | All identical | Mixed: app, kernel, tail, public | +| BB command | `./chonk_bench --benchmark_filter=...` | `bb prove --scheme chonk --ivc_inputs_path ...` | + +## Step 1: Get pinned IVC inputs + +The real benchmark inputs are pinned to an S3 artifact keyed by a short hash. Download them: + +```bash +cd barretenberg/cpp/scripts +./test_chonk_standalone_vks_havent_changed.sh --download_pinned_inputs +``` + +This populates `yarn-project/end-to-end/example-app-ivc-inputs-out//ivc-inputs.msgpack`. + +Available flows (typical): +- `ecdsar1+transfer_1_recursions+sponsored_fpc` +- `schnorr+deploy_tokenContract_with_registration+sponsored_fpc` +- `ecdsar1+amm_add_liquidity_1_recursions+sponsored_fpc` +- `ecdsar1+transfer_1_recursions+private_fpc` +- and more — run `ls yarn-project/end-to-end/example-app-ivc-inputs-out/` after downloading + +The pinned hash is maintained in `barretenberg/cpp/scripts/test_chonk_standalone_vks_havent_changed.sh` (variable `pinned_short_hash`). The S3 URL is: +``` +https://aztec-ci-artifacts.s3.us-east-2.amazonaws.com/protocol/bb-chonk-inputs-.tar.gz +``` + +To update the pinned inputs (after protocol changes that affect VKs): +```bash +./test_chonk_standalone_vks_havent_changed.sh --update_inputs +``` + +## Step 2: Build bb in release mode + +```bash +cd barretenberg/cpp +cmake --preset clang20-no-avm # AVM not needed for Chonk +cmake --build --preset clang20-no-avm --target bb +``` + +Build dir: `build-no-avm` (or `build` if using the `clang20` preset). + +## Step 3: Run the benchmark + +**Always set `HARDWARE_CONCURRENCY=8` for local runs.** The remote benchmarking machine uses 16, but local/shared machines should use 8. See `/remote-bench` for remote execution. + +### Native + +```bash +cd barretenberg/cpp + +FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc" +OUTPUT_DIR="/tmp/chonk-bench-out" +mkdir -p $OUTPUT_DIR + +HARDWARE_CONCURRENCY=8 ./build-no-avm/bin/bb prove \ + -o $OUTPUT_DIR \ + --ivc_inputs_path ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack \ + --scheme chonk \ + -v \ + --print_bench \ + --bench_out_hierarchical $OUTPUT_DIR/benchmark_breakdown.json +``` + +### WASM (via wasmtime) + +Build the WASM binary with threads enabled: + +```bash +cd barretenberg/cpp +cmake --preset wasm-threads +cmake --build --preset wasm-threads --target bb +``` + +Run via wasmtime (the `scripts/wasmtime.sh` wrapper sets standard flags): + +```bash +cd barretenberg/cpp + +FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc" +OUTPUT_DIR="/tmp/chonk-bench-wasm" +mkdir -p $OUTPUT_DIR + +# Copy inputs to a working dir wasmtime can access +cp ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack $OUTPUT_DIR/ + +cd $OUTPUT_DIR +HARDWARE_CONCURRENCY=8 BB_BENCH=1 \ + /path/to/barretenberg/cpp/scripts/wasmtime.sh \ + /path/to/barretenberg/cpp/build-wasm-threads/bin/bb prove \ + -o output \ + --ivc_inputs_path ivc-inputs.msgpack \ + --scheme chonk \ + -v \ + --print_bench \ + --bench_out_hierarchical benchmark_breakdown.json +``` + +The wasmtime wrapper sets: +- `-Wthreads=y -Sthreads=y` — enable WASM threads and shared memory +- `--env HARDWARE_CONCURRENCY` — thread count +- `--env BB_BENCH` — enable operation counting (`ENABLE_WASM_BENCH=ON` is set by the `wasm-threads` preset) +- `--dir=$HOME/.bb-crs --dir=.` — filesystem access for CRS and working directory + +## Local runs are noisy — average 3 runs + +Non-dedicated machines have variable CPU load. **Run the benchmark at least 3 times and average the results.** Only the remote benchmarking machine (see `/remote-bench` skill) provides stable, isolated CPU for single-run measurements. + +When iterating locally on profiling code changes, relative comparisons (before vs after your change) are still valid on noisy machines — just ensure you compare runs taken close together under similar load. + +## Using with the remote benchmarking machine + +For noise-free, publishable results, use the `/remote-bench` skill to run on the dedicated EC2 instance. The two skills compose naturally: + +1. `/benchmark-chonk download-inputs` — get pinned inputs locally +2. `/remote-bench` — build locally, scp binary + inputs to remote, run there, copy results back + +See the `/remote-bench` skill for setup, lock management, and usage. + +## BB_BENCH instrumentation system + +### How it works + +`BB_BENCH` is an always-compiled, low-overhead RAII profiling system. + +**Header:** `barretenberg/cpp/src/barretenberg/common/bb_bench.hpp` +**Implementation:** `barretenberg/cpp/src/barretenberg/common/bb_bench.cpp` + +**Macros:** +```cpp +BB_BENCH() // label = __func__ +BB_BENCH_NAME("label") // custom label (preferred) +BB_BENCH_ONLY_NAME("label") // no Tracy, no nesting — lightweight +BB_BENCH_ENABLE_NESTING() // set parent context for child operations +``` + +The macros create `BenchReporter` RAII objects that: +1. On construction: capture parent context + start time +2. On destruction: record elapsed time with parent association +3. Build a hierarchical call tree automatically + +**Activation:** `BB_BENCH=1` env var, or `--print_bench` / `--bench_out_hierarchical` CLI flags. + +### Google Benchmark integration + +For `chonk_bench` and other `.bench.cpp` targets: +```cpp +#include "barretenberg/common/google_bb_bench.hpp" + +for (auto _ : state) { + GOOGLE_BB_BENCH_REPORTER(state); // clears stats, collects on destruction + // ... benchmark body ... +} +``` + +`GOOGLE_BB_BENCH_REPORTER(state)` creates a `GoogleBbBenchReporter` which: +- **Constructor:** calls `GLOBAL_BENCH_STATS.clear()` — resets all accumulated stats +- **Destructor:** aggregates stats into Google Benchmark counters (each operation becomes a `(s)` suffixed counter) + +### Per-circuit / per-accumulate breakdown + +**Key function:** `bb::detail::GLOBAL_BENCH_STATS.clear()` +(`barretenberg/cpp/src/barretenberg/common/bb_bench.cpp`) + +```cpp +void GlobalBenchStatsContainer::clear() +{ + std::unique_lock lock(mutex); + for (std::shared_ptr& entry : entries) { + entry->count = TimeStats(); // resets to zero without losing entry structure + } +} +``` + +**Usage pattern for per-circuit profiling:** + +The `--print_bench` output aggregates across all 19 circuits. To get per-circuit timing, temporarily instrument `barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp`: + +1. Add `#include ` at the top +2. In `ChonkAccumulate::execute()`, wrap the `accumulate()` call: + +```cpp + info("ChonkAccumulate - accumulating circuit '", request.loaded_circuit_name, "'"); + bb::detail::GLOBAL_BENCH_STATS.clear(); + auto circuit_start = std::chrono::steady_clock::now(); + request.ivc_in_progress->accumulate(circuit, precomputed_vk); + auto circuit_end = std::chrono::steady_clock::now(); + auto circuit_ms = std::chrono::duration_cast(circuit_end - circuit_start).count(); + info("PER_CIRCUIT_TIME: circuit='", + request.loaded_circuit_name, + "' index=", + request.ivc_stack_depth, + " time_ms=", + circuit_ms); + bb::detail::GLOBAL_BENCH_STATS.print_aggregate_counts_hierarchical(std::cerr); + request.ivc_stack_depth++; +``` + +3. Rebuild with `cd build && ninja bb` (only recompiles the changed file + relinks) +4. Run the benchmark, then grep for `PER_CIRCUIT_TIME` in the output +5. **Revert the instrumentation** after collecting data: `git checkout -- barretenberg/cpp/src/barretenberg/bbapi/bbapi_chonk.cpp` + +This gives wall-clock time per circuit plus a per-circuit BB_BENCH breakdown. The `GLOBAL_BENCH_STATS.clear()` resets stats before each circuit so the hierarchical print shows only that circuit's work. + +The same pattern works at any granularity — clear before, print after. This is how `GOOGLE_BB_BENCH_REPORTER` works internally. + +### Output formats + +| Flag | Format | Use case | +|------|--------|----------| +| `--print_bench` | Colorized tree on stderr | Human reading in terminal | +| `--bench_out ` | Flat JSON `{"op": time_ns}` | Simple metrics | +| `--bench_out_hierarchical ` | Nested JSON with parent/child | Dashboard, `extract_component_benchmarks.py` | + +The hierarchical JSON format: +```json +{ + "operation_name": [ + { + "parent": "parent_operation", + "time": 1234567890, + "time_max": 1234567890, + "time_mean": 1234567890.0, + "time_stddev": 12345.0, + "count": 5, + "num_threads": 8 + } + ] +} +``` + +### Adding new instrumentation + +When profiling reveals "missing time" (parent time - sum of children > 20%), add `BB_BENCH_NAME` to the uninstrumented functions: + +```cpp +#include "barretenberg/common/bb_bench.hpp" + +void MyProver::execute_phase() { + BB_BENCH_NAME("MyProver::execute_phase"); + BB_BENCH_ENABLE_NESTING(); // allow child operations to track this as parent + // ... function body ... +} +``` + +**Rules:** +- Place macro as the first statement in the scope you want to measure +- Use descriptive names: `"Chonk::accumulate::oink_phase"` not `"oink"` +- For templates: `BB_BENCH_NAME("ShpleminiProver::prove")` since `__func__` is ugly +- For sub-scopes, use braces to create a new scope +- `BB_BENCH_ENABLE_NESTING()` is needed when you want child `BB_BENCH_NAME` calls inside this function to show this function as their parent in the hierarchy + +### Extracting component benchmarks + +After running with `--bench_out_hierarchical`, extract key components: + +```bash +python3 barretenberg/cpp/scripts/extract_component_benchmarks.py +``` + +This reads `benchmark_breakdown.json`, finds operations matching key components (sumcheck, pcs, pippenger, commitment, circuit, oink, compute), and appends them to `benchmarks.bench.json` with stacked chart markers for the dashboard. + +## A/B comparison scripts + +These use Google Benchmark's `compare.py` for statistical analysis. Note: these use the **remote machine** — see `/remote-bench`. + +| Script | What it compares | +|--------|-----------------| +| `scripts/compare_chonk_bench.sh` | Native ChonkBench/Full/6, branch vs baseline | +| `scripts/compare_chonk_bench_wasm.sh` | WASM ChonkBench/Full/6, branch vs baseline | +| `scripts/compare_branch_vs_baseline_remote.sh` | Generic native A/B | +| `scripts/compare_branch_vs_baseline_remote_wasm.sh` | Generic WASM A/B | + +## Key scripts reference + +| Script | Purpose | +|--------|---------| +| `scripts/test_chonk_standalone_vks_havent_changed.sh` | Download/update/verify pinned inputs | +| `scripts/ci_benchmark_ivc_flows.sh` | CI: proves a flow, extracts components, uploads to dashboard | +| `scripts/benchmark_example_ivc_flow_remote.sh` | Proves a pinned flow on the remote machine (uses `/remote-bench`) | +| `scripts/benchmark_chonk.sh` | Synthetic `chonk_bench` on remote | +| `scripts/wasmtime.sh` | wasmtime wrapper with standard flags | +| `scripts/extract_component_benchmarks.py` | Extract component timings from hierarchical breakdown | + +## Tips + +- **`HARDWARE_CONCURRENCY=8` for local, `16` for remote.** Always set this explicitly. Local/shared machines use 8; the remote benchmarking machine uses 16. +- **Local iteration is fine** — you can build, instrument, and run locally. Just average 3 runs for reliable numbers, or use the remote machine via `/remote-bench` for single-run accuracy. +- **Use `./bootstrap.sh` for initial builds** — it downloads cached artifacts and avoids build issues. Use `cmake --preset clang20 && cd build && ninja bb` for incremental rebuilds after code changes. +- **Build dir is `build/`** — the `clang20` preset outputs to `build/`, not `build-no-avm`. The `clang20-no-avm` preset also uses `build/` (it disables AVM at cmake level, not via directory name). +- **If the zig cache breaks** (missing `libubsan_rt.a` errors), delete `build/` and reconfigure: `rm -rf build && cmake --preset clang20`. +- **WASM preset:** `wasm-threads`. Build dir is `build-wasm-threads/`. The preset enables `ENABLE_WASM_BENCH=ON` automatically. +- **WASM is ~2.8x slower than native** — this ratio is consistent across all circuit types. +- **CRS:** Ensure `~/.bb-crs` exists. For WASM, wasmtime needs `--dir=$HOME/.bb-crs`. +- **`BB_BENCH=1` vs `--print_bench`:** Either activates profiling. `--print_bench` also triggers the hierarchical tree output to stderr. In `chonk_bench`, the `GOOGLE_BB_BENCH_REPORTER` macro enables it automatically when `BB_BENCH=1` is set. +- **Dashboard:** CI uploads breakdown data to `bench/bb-breakdown/` on S3. The dashboard at `ci3/dashboard/chonk-breakdowns/` visualizes it. +- **Rebuilding after instrumentation changes:** Only `ninja bb` is needed — no need to reconfigure. + +## Presenting results + +When sharing benchmark results, create an **HTML gist** with an interactive visualization. Include: + +- **Native vs WASM tabs** with per-circuit comparison table +- **Stacked bar charts** showing time distribution across circuits +- **Aggregation by circuit type** (kernel vs app vs infra) +- **Summary cards** with total time, slowdown ratio, and heaviest circuit +- **Color-coded circuit types**: kernel (blue), app (red), infra (gray) + +Use `create_gist` / `update_gist` with a `.html` file. GitHub renders HTML gists — viewers can open the raw HTML to interact with tabs and tooltips. This is much more useful than plain markdown tables for benchmark data. diff --git a/barretenberg/.claude/skills/remote-bench/SKILL.md b/barretenberg/.claude/skills/remote-bench/SKILL.md new file mode 100644 index 000000000000..56bab0fa04f0 --- /dev/null +++ b/barretenberg/.claude/skills/remote-bench/SKILL.md @@ -0,0 +1,248 @@ +--- +name: remote-bench +description: Run benchmarks on the dedicated remote EC2 benchmarking machine for noise-free, single-run results. Handles env var validation, lock management, binary transfer, and result collection. Use with /benchmark-chonk or any BB benchmark target. +argument-hint: e.g. "bb", "chonk_bench", "ultra_honk_bench", "wasm bb" +--- + +# Remote Bench + +Run barretenberg benchmarks on the dedicated remote EC2 instance for stable, noise-free measurements. This machine is isolated from shared workloads — results from a single run are publishable without averaging. + +**This skill is a transport layer.** It handles: env validation, locking, binary transfer, remote execution, result retrieval. It does NOT know what to benchmark — combine it with `/benchmark-chonk` or other benchmark skills for the actual workload. + +## Prerequisites — environment check + +**MANDATORY: Refuse to proceed if these are not set.** + +Before doing anything, verify the three required environment variables exist: + +```bash +# Check all three are set +if [[ -z "$BB_SSH_KEY" || -z "$BB_SSH_INSTANCE" || -z "$BB_SSH_CPP_PATH" ]]; then + echo "ERROR: Remote benchmarking environment not configured." + echo "Required variables:" + echo " BB_SSH_KEY — SSH key flag (e.g. -i /path/to/key.pem)" + echo " BB_SSH_INSTANCE — EC2 hostname" + echo " BB_SSH_CPP_PATH — Remote repo path (e.g. /home/ubuntu/aztec-packages/barretenberg/cpp)" + echo "" + echo "See barretenberg/cpp/scripts/README.md for setup instructions." + echo "Ask a crypto eng team member for the SSH key and hostname." + exit 1 +fi + +# Verify connectivity +ssh $BB_SSH_KEY $BB_SSH_INSTANCE "echo ok" || { + echo "ERROR: Cannot connect to remote machine. Check BB_SSH_KEY and BB_SSH_INSTANCE." + exit 1 +} +``` + +**If the env is not set up, stop and tell the user.** Do not attempt to run benchmarks locally as a fallback — the user invoked `/remote-bench` because they want stable results. + +### Setup (one-time) + +Add to `~/.zshrc` (ask a crypto eng team member for actual values): +```bash +export BB_SSH_KEY="-i /mnt/user-data//remote-bb-worker.pem" +export BB_SSH_INSTANCE="" +export BB_SSH_CPP_PATH="/home/ubuntu/aztec-packages/barretenberg/cpp" +``` + +Full setup and troubleshooting: `barretenberg/cpp/scripts/README.md` + +## Lock mechanism + +The remote machine is a **shared resource**. A file lock (`~/BENCHMARK_IN_PROGRESS`) prevents concurrent benchmarks from corrupting results. + +### Acquiring the lock + +```bash +source barretenberg/cpp/scripts/_benchmark_remote_lock.sh +``` + +This script (meant to be **sourced**, not run): +1. Polls `~/BENCHMARK_IN_PROGRESS` on the remote machine (10 retries, 10s apart) +2. If still locked after 100s, **exits the calling script** with an error +3. Creates the lock file +4. Registers a trap to delete the lock on exit (including Ctrl-C) + +The lock auto-releases when the sourcing script exits. + +### When the lock is stuck + +If a previous session crashed without cleaning up: + +```bash +# Check if something is actually running +ssh $BB_SSH_KEY $BB_SSH_INSTANCE "pgrep -a bb || echo 'nothing running'" + +# If nothing is running, safe to remove the stale lock +ssh $BB_SSH_KEY $BB_SSH_INSTANCE "rm ~/BENCHMARK_IN_PROGRESS" +``` + +### Multi-session coordination + +When multiple Claude sessions need the remote machine: +- The lock is first-come-first-served. If locked, the session should **tell the user** and suggest waiting or doing other work. +- Sessions should **not** loop/poll for the lock — just report it's busy and let the user decide. + +## Usage patterns + +### Pattern 1: Native benchmark (any target) + +The standard flow used by `scripts/benchmark_remote.sh`: + +```bash +cd barretenberg/cpp + +BENCHMARK="bb" # or chonk_bench, ultra_honk_bench, etc. +PRESET="clang20-no-avm" # or clang20 +BUILD_DIR="build-no-avm" # matches preset + +# 1. Build locally +cmake --preset $PRESET +cmake --build --preset $PRESET --target $BENCHMARK + +# 2. Acquire lock + transfer +source scripts/_benchmark_remote_lock.sh +scp $BB_SSH_KEY ./$BUILD_DIR/bin/$BENCHMARK $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build/ + +# 3. Run remotely +ssh $BB_SSH_KEY $BB_SSH_INSTANCE \ + "cd $BB_SSH_CPP_PATH/build && HARDWARE_CONCURRENCY=16 " + +# 4. Copy results back +scp $BB_SSH_KEY $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build/ . +``` + +Or use the convenience script: +```bash +./scripts/benchmark_remote.sh "" +``` + +### Pattern 2: Chonk with pinned inputs on remote + +Combines with `/benchmark-chonk`: + +```bash +cd barretenberg/cpp + +FLOW="schnorr+deploy_tokenContract_with_registration+sponsored_fpc" + +# 1. Build bb locally +cmake --preset clang20-no-avm +cmake --build --preset clang20-no-avm --target bb + +# 2. Transfer inputs + binary +source scripts/_benchmark_remote_lock.sh +scp $BB_SSH_KEY \ + ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack \ + $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build/ +scp $BB_SSH_KEY ./build-no-avm/bin/bb $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build/ + +# 3. Run with full profiling +ssh $BB_SSH_KEY $BB_SSH_INSTANCE \ + "cd $BB_SSH_CPP_PATH/build && \ + HARDWARE_CONCURRENCY=16 BB_BENCH=1 ./bb prove \ + -o output \ + --ivc_inputs_path ivc-inputs.msgpack \ + --scheme chonk \ + -v \ + --print_bench \ + --bench_out_hierarchical benchmark_breakdown.json" + +# 4. Retrieve results +scp $BB_SSH_KEY $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build/benchmark_breakdown.json . +``` + +Or use the convenience script: +```bash +./scripts/benchmark_example_ivc_flow_remote.sh bb "$FLOW" +``` + +### Pattern 3: WASM benchmark on remote + +```bash +cd barretenberg/cpp + +# 1. Build WASM locally +cmake --preset wasm-threads +cmake --build --preset wasm-threads --target bb + +# 2. Transfer +source scripts/_benchmark_remote_lock.sh +ssh $BB_SSH_KEY $BB_SSH_INSTANCE "mkdir -p $BB_SSH_CPP_PATH/build-wasm-threads" +scp $BB_SSH_KEY ./build-wasm-threads/bin/bb $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build-wasm-threads/ + +# 3. Also transfer inputs if benchmarking Chonk +scp $BB_SSH_KEY \ + ../../yarn-project/end-to-end/example-app-ivc-inputs-out/$FLOW/ivc-inputs.msgpack \ + $BB_SSH_INSTANCE:$BB_SSH_CPP_PATH/build-wasm-threads/ + +# 4. Run via wasmtime on remote +ssh $BB_SSH_KEY $BB_SSH_INSTANCE \ + "cd $BB_SSH_CPP_PATH/build-wasm-threads && \ + HARDWARE_CONCURRENCY=16 \ + /home/ubuntu/.wasmtime/bin/wasmtime run \ + -Wthreads=y -Sthreads=y \ + --env HARDWARE_CONCURRENCY \ + --env HOME \ + --env BB_BENCH=1 \ + --dir=\$HOME/.bb-crs \ + --dir=. \ + ./bb " +``` + +Or use the convenience script: +```bash +./scripts/benchmark_wasm_remote.sh "" +``` + +### Pattern 4: A/B branch comparison + +Compare current branch vs baseline (builds and runs both on remote): + +```bash +# Native +./scripts/compare_chonk_bench.sh # ChonkBench/Full/6 +./scripts/compare_branch_vs_baseline_remote.sh '' + +# WASM +./scripts/compare_chonk_bench_wasm.sh # ChonkBench/Full/6 +./scripts/compare_branch_vs_baseline_remote_wasm.sh '' +``` + +These use Google Benchmark's `compare.py` for statistical analysis. Note: comparison scripts check out the baseline branch locally, so your working tree must be clean. + +## Scripts reference + +| Script | Purpose | +|--------|---------| +| `scripts/benchmark_remote.sh` | Generic: build locally, scp, run remotely | +| `scripts/benchmark_wasm_remote.sh` | Same for WASM (wasmtime on remote) | +| `scripts/benchmark_example_ivc_flow_remote.sh` | Chonk with pinned inputs on remote | +| `scripts/benchmark_chonk.sh` | Synthetic chonk_bench on remote | +| `scripts/compare_chonk_bench.sh` | A/B native comparison | +| `scripts/compare_chonk_bench_wasm.sh` | A/B WASM comparison | +| `scripts/compare_branch_vs_baseline_remote.sh` | Generic A/B native | +| `scripts/compare_branch_vs_baseline_remote_wasm.sh` | Generic A/B WASM | +| `scripts/_benchmark_remote_lock.sh` | Lock mechanism (source it, don't run it) | + +## Remote machine details + +| Property | Value | +|----------|-------| +| User | `ubuntu` | +| wasmtime | `/home/ubuntu/.wasmtime/bin/wasmtime` | +| CRS | `~/.bb-crs` | +| Native build dir | `$BB_SSH_CPP_PATH/build` | +| WASM build dir | `$BB_SSH_CPP_PATH/build-wasm-threads` | +| Default `HARDWARE_CONCURRENCY` | `16` | + +## Tips + +- **Always check the env first.** If `BB_SSH_KEY`, `BB_SSH_INSTANCE`, or `BB_SSH_CPP_PATH` are missing, stop and tell the user. +- **One run is enough** on the remote machine — it's isolated. No need to average 3 runs like on shared machines. +- **Release the lock promptly** — don't hold it while analyzing results locally. +- **Build locally, run remotely** — the remote machine is for execution only. Never build on it. +- **HARDWARE_CONCURRENCY=16** is the standard on the remote machine. Match it for WASM comparisons. diff --git a/barretenberg/CLAUDE.md b/barretenberg/CLAUDE.md index 9e273c9da61a..bae8bdc0fe5b 100644 --- a/barretenberg/CLAUDE.md +++ b/barretenberg/CLAUDE.md @@ -46,6 +46,10 @@ Add GitHub labels to PRs to control what CI runs. Choose based on what changed: - **`ci-barretenberg-full`** or **`ci-full`** — Full builds including cross-compilation (macOS, iOS, ARM64 Linux), SMT verification, ASAN, and GCC syntax checks. Use when changing CMake presets, bootstrap.sh, or build infrastructure. - **`ci-release-pr`** — Creates a test release tag for pre-release validation. Use when changing release packaging or publish workflows. +## Code comments + +Comments must describe the code as it is, not relative to what it used to be. Never write comments like "replaces the old X", "no longer needs Y", "previously this was Z", or "eliminates the need for W". These become stale immediately after the commit lands. Instead, describe what the code does and why. + ## Handling noir/noir-repo submodule If `git status` shows `noir/noir-repo` as modified but your changes have nothing to do with updating noir, run: diff --git a/barretenberg/cpp/bootstrap.sh b/barretenberg/cpp/bootstrap.sh index 060f41086dba..b9c2f351d656 100755 --- a/barretenberg/cpp/bootstrap.sh +++ b/barretenberg/cpp/bootstrap.sh @@ -245,11 +245,12 @@ function test_cmds_native { awk '/^[a-zA-Z]/ {suite=$1} /^[ ]/ {print suite$1}' | \ grep -v 'DISABLED_' | \ while read -r test; do - # Skip heavy recursion tests in debug builds — they take 400-600s and the same + # Skip heavy recursion tests in debug builds — they take 400-600s+ and the same # code paths are already exercised (with assertions) by faster tests in the suite. # Keep WithoutPredicate/1.GenerateVKFromConstraints (241s) so that the debug-only # native_verification_debug path in honk_recursion_constraint.cpp is still exercised. - if [[ "$native_preset" == *debug* ]] && [[ "$test" =~ ^(HonkRecursionConstraintTest|ChonkRecursionConstraintTest|AvmRecursionInnerCircuitTests) ]]; then + # None of the other skipped suites exercise unique debug-only (#ifndef NDEBUG) code paths. + if [[ "$native_preset" == *debug* ]] && [[ "$test" =~ ^(HonkRecursionConstraintTest|ChonkRecursionConstraintTest|AvmRecursionInnerCircuitTests|AvmRecursionConstraintTest|AvmRecursiveTests\.TwoLayer|PaddingVariants/AvmRecursiveTestsParameterized\.TwoLayer|BoomerangTwoLayerAvmRecursiveVerifierTests|ECCVMRecursiveTests|GoblinRecursiveVerifierTests|GoblinAvmRecursiveVerifierTests|BoomerangGoblinRecursiveVerifierTests|BoomerangGoblinAvmRecursiveVerifierTests) ]]; then if [[ "$test" != "HonkRecursionConstraintTestWithoutPredicate/1.GenerateVKFromConstraints" ]]; then continue fi diff --git a/barretenberg/cpp/src/barretenberg/commitment_schemes/ipa/ipa.hpp b/barretenberg/cpp/src/barretenberg/commitment_schemes/ipa/ipa.hpp index fe0b8bc09c0b..42868eaf5498 100644 --- a/barretenberg/cpp/src/barretenberg/commitment_schemes/ipa/ipa.hpp +++ b/barretenberg/cpp/src/barretenberg/commitment_schemes/ipa/ipa.hpp @@ -817,10 +817,19 @@ template class IPA } // Compute G_zero - // In the native verifier, this uses pippenger. Here we use batch_mul. + // In the native verifier, this uses pippenger. Here we use fixed_batch_mul since all SRS points are + // circuit constants, which uses plookup tables instead of ROM tables and is significantly cheaper. + // We use 8-bit tables (table_bits=8, 32 rounds) to minimise gate count. However, with N=32768 SRS points + // and 8-bit tables, the total table rows = 32768 × 256 = 2^23 exactly. The 5 mandatory overhead rows + // (NUM_DISABLED_ROWS_IN_SUMCHECK=4, NUM_ZERO_ROWS=1) push the total to 2^23+5, forcing dyadic_size = 2^24. + // To stay within 2^23 we handle the first SRS point separately using operator*. std::vector srs_elements = vk.get_monomial_points(); srs_elements.resize(poly_length); - Commitment computed_G_zero = Commitment::batch_mul(srs_elements, s_vec); + std::vector remaining_srs(srs_elements.begin() + 1, srs_elements.end()); + std::vector remaining_s(s_vec.begin() + 1, s_vec.end()); + Commitment first_term = srs_elements[0] * s_vec[0]; + Commitment remaining_term = Commitment::fixed_batch_mul(remaining_srs, remaining_s, {}, /*table_bits=*/8); + Commitment computed_G_zero = first_term + remaining_term; // check the computed G_zero and the claimed G_zero are the same. // The circuit constraint enforces correctness; mismatched witnesses will produce an unsatisfiable circuit. claimed_G_zero.assert_equal(computed_G_zero, "G_zero doesn't match received G_zero."); diff --git a/barretenberg/cpp/src/barretenberg/dsl/acir_format/gate_count_constants.hpp b/barretenberg/cpp/src/barretenberg/dsl/acir_format/gate_count_constants.hpp index 6950442ec33b..76dbef81782f 100644 --- a/barretenberg/cpp/src/barretenberg/dsl/acir_format/gate_count_constants.hpp +++ b/barretenberg/cpp/src/barretenberg/dsl/acir_format/gate_count_constants.hpp @@ -55,7 +55,7 @@ template inline constexpr size_t ASSERT_EQUALITY = ZERO_GATE // Honk Recursion Constants // ======================================== -inline constexpr size_t ROOT_ROLLUP_GATE_COUNT = 12904895; +inline constexpr size_t ROOT_ROLLUP_GATE_COUNT = 6351604; template constexpr std::tuple HONK_RECURSION_CONSTANTS( diff --git a/barretenberg/cpp/src/barretenberg/op_queue/ecc_op_queue.hpp b/barretenberg/cpp/src/barretenberg/op_queue/ecc_op_queue.hpp index 5526fdb82c16..efed42f0cb39 100644 --- a/barretenberg/cpp/src/barretenberg/op_queue/ecc_op_queue.hpp +++ b/barretenberg/cpp/src/barretenberg/op_queue/ecc_op_queue.hpp @@ -6,6 +6,7 @@ #pragma once +#include "barretenberg/common/bb_bench.hpp" #include "barretenberg/ecc/curves/bn254/bn254.hpp" #include "barretenberg/eccvm/eccvm_builder_types.hpp" #include "barretenberg/op_queue/ecc_ops_table.hpp" @@ -214,6 +215,7 @@ class ECCOpQueue { */ UltraOp mul_accumulate(const Point& to_mul, const Fr& scalar) { + BB_BENCH_NAME("ECCOpQueue::mul_accumulate"); // Update the accumulator natively accumulator = accumulator + to_mul * scalar; EccOpCode op_code{ .mul = true }; diff --git a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.cpp b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.cpp index cd1c8acfbe54..11b3280f80ad 100644 --- a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.cpp +++ b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.cpp @@ -7,6 +7,7 @@ #include "../field/field.hpp" #include "../field/field_utils.hpp" #include "barretenberg/common/assert.hpp" +#include "barretenberg/common/thread.hpp" #include "barretenberg/common/zip_view.hpp" #include "barretenberg/crypto/pedersen_commitment/pedersen.hpp" #include "barretenberg/ecc/curves/grumpkin/grumpkin.hpp" @@ -968,6 +969,220 @@ typename cycle_group::batch_mul_internal_output cycle_group::_ return { accumulator, offset_generator_accumulator }; } +/** + * @brief Internal algorithm to perform a fixed-base batch mul using plookup tables. + * + * @details Computes a batch mul of constant base points using the Straus multiscalar multiplication algorithm. + * For each constant base point, a plookup table (BasicTable) is created with (1 << ROM_TABLE_BITS) entries. + * Unlike ROM tables, plookup tables have zero construction cost and zero finalization overhead. + * Each table read costs exactly 1 lookup gate. + * + * @param scalars Witness scalars to multiply with base points + * @param base_points Constant affine points (SRS elements or similar) + * @param offset_generators Offset points to prevent infinity edge cases (size = base_points.size() + 1) + * @return {accumulator, offset_generator_delta} where result = accumulator - offset_generator_delta + */ +template +typename cycle_group::batch_mul_internal_output cycle_group::_fixed_base_plookup_batch_mul_internal( + const std::span scalars, + const std::span base_points, + const std::span offset_generators, + const size_t table_bits) +{ + BB_ASSERT_EQ(!scalars.empty(), true, "Empty scalars provided to fixed base plookup batch mul!"); + BB_ASSERT_EQ(scalars.size(), base_points.size(), "Points/scalars size mismatch in fixed base plookup batch mul!"); + BB_ASSERT_EQ(offset_generators.size(), base_points.size() + 1, "Too few offset generators provided!"); + const size_t num_points = scalars.size(); + + Builder* context = nullptr; + for (const auto& scalar : scalars) { + if (context = scalar.get_context(); context != nullptr) { + break; + } + } + BB_ASSERT(context != nullptr); + BB_ASSERT_EQ(cycle_scalar::LO_BITS % table_bits, + 0UL, + "table_bits must evenly divide cycle_scalar::LO_BITS. The Straus algorithm splits the scalar " + "into lo/hi limbs and decomposes each separately; if LO_BITS is not a multiple of table_bits, " + "the hi-limb slices start at the wrong bit-offset and the MSM result is incorrect. " + "Valid values for table_bits (given LO_BITS=128) are: 1, 2, 4, 8, 16, 32, 64, 128."); + + const size_t num_rounds = numeric::ceil_div(cycle_scalar::NUM_BITS, table_bits); + + // Decompose each scalar into table_bits-bit slices (also enforces range constraints) + std::vector scalar_slices; + scalar_slices.reserve(num_points); + for (const auto& scalar : scalars) { + scalar_slices.emplace_back(context, scalar, table_bits); + } + + // Create plookup tables for each constant base point (zero gate cost). + // Phase 1 (parallel): compute native table entries and BasicTable column data — no builder access. + // Phase 2 (serial): register each BasicTable with the builder (builder is not thread-safe). + std::vector precomputed_tables(num_points); + parallel_for(num_points, [&](size_t i) { + precomputed_tables[i] = + straus_plookup_table::build_precomputed_data(base_points[i], offset_generators[i + 1], table_bits); + }); + std::vector point_tables; + point_tables.reserve(num_points); + for (size_t i = 0; i < num_points; ++i) { + point_tables.emplace_back(context, std::move(precomputed_tables[i])); + } + + // Compute all intermediate points natively for use as hints in the in-circuit Straus algorithm. + // Using projective coordinates + batch normalize to avoid per-operation modular inversions. + // The per-point lookup tables (point_tables) already hold the precomputed affine entries; we reuse + // them directly rather than rebuilding projective copies. + std::vector operation_transcript; + Element offset_generator_accumulator = offset_generators[0]; + { + // Perform Straus algorithm natively + Element accumulator = offset_generators[0]; + for (size_t i = 0; i < num_rounds; ++i) { + if (i != 0) { + for (size_t j = 0; j < table_bits; ++j) { + accumulator = accumulator.dbl(); + operation_transcript.push_back(accumulator); + offset_generator_accumulator = offset_generator_accumulator.dbl(); + } + } + for (size_t j = 0; j < num_points; ++j) { + auto slice_value = static_cast(scalar_slices[j].slices_native[num_rounds - i - 1]); + const Element point(point_tables[j].get_native_table()[slice_value]); + accumulator += point; + operation_transcript.push_back(accumulator); + offset_generator_accumulator += Element(offset_generators[j + 1]); + } + } + } + + // Batch-normalize all hint points + Element::batch_normalize(operation_transcript.data(), operation_transcript.size()); + std::vector operation_hints; + operation_hints.reserve(operation_transcript.size()); + for (const Element& element : operation_transcript) { + operation_hints.emplace_back(element.x, element.y); + } + + // Execute Straus algorithm in-circuit using plookup reads and precomputed hints + AffineElement* hint_ptr = operation_hints.data(); + cycle_group accumulator = offset_generators[0]; + + for (size_t i = 0; i < num_rounds; ++i) { + if (i != 0) { + for (size_t j = 0; j < table_bits; ++j) { + accumulator = accumulator.dbl(*hint_ptr); + hint_ptr++; + } + } + for (size_t j = 0; j < num_points; ++j) { + const field_t scalar_slice = scalar_slices[j][num_rounds - i - 1]; + const cycle_group point = point_tables[j].read(scalar_slice); + // Safe to use unconditional_add: all base points are constants hence linearly independent of offset + // generators + accumulator = accumulator.unconditional_add(point, *hint_ptr); + hint_ptr++; + } + } + + accumulator.set_origin_tag(OriginTag::constant()); + return { accumulator, AffineElement(offset_generator_accumulator) }; +} + +/** + * @brief Fixed-base multiscalar multiplication using plookup tables. + * + * @details Optimized MSM for the case where all base points are circuit constants (e.g. SRS elements). + * Uses plookup tables instead of ROM tables, eliminating table construction gates and finalization overhead. + * All base points MUST be constants; witness base points will trigger an assertion failure. + * + * @param constant_points Vector of constant cycle_group points + * @param scalars Vector of cycle_scalar values (may be witnesses or constants) + * @param context Generator context for offset generators + * @return cycle_group The result of sum(scalars[i] * constant_points[i]) + */ +template +cycle_group cycle_group::fixed_batch_mul(const std::vector& constant_points, + const std::vector& scalars, + const GeneratorContext& context, + const size_t table_bits) +{ + BB_ASSERT_EQ(scalars.size(), constant_points.size(), "Points/scalars size mismatch in fixed_batch_mul!"); + + if (scalars.empty()) { + return cycle_group{ Group::point_at_infinity }; + } + + // Merge all tags + OriginTag result_tag = OriginTag::constant(); + for (auto [point, scalar] : zip_view(constant_points, scalars)) { + result_tag = OriginTag(result_tag, OriginTag(point.get_origin_tag(), scalar.get_origin_tag())); + } + + std::vector plookup_scalars; + std::vector plookup_points; + bool has_non_constant_component = false; + Element constant_acc = Group::point_at_infinity; + + for (const auto [point, scalar] : zip_view(constant_points, scalars)) { + BB_ASSERT(point.is_constant()); + if (scalar.is_constant()) { + // Both constant: compute natively + constant_acc += point.get_value() * scalar.get_value(); + } else { + if (point.get_value().is_point_at_infinity()) { + // Constant infinity contributes nothing, but still need range constraints on scalar + auto* ctx = scalar.get_context(); + ctx->create_limbed_range_constraint(scalar.lo().get_witness_index(), + cycle_scalar::LO_BITS, + table_bits, + "fixed_batch_mul: lo range constraint for scalar with constant " + "infinity"); + ctx->create_limbed_range_constraint(scalar.hi().get_witness_index(), + cycle_scalar::HI_BITS, + table_bits, + "fixed_batch_mul: hi range constraint for scalar with constant " + "infinity"); + continue; + } + plookup_scalars.push_back(scalar); + plookup_points.push_back(point.get_value()); + has_non_constant_component = true; + } + } + + if (!has_non_constant_component) { + auto result = cycle_group(constant_acc); + result.set_origin_tag(result_tag); + return result; + } + + // Compute offset generators + const size_t num_offset_generators = plookup_points.size() + 1; + const std::span offset_generators = + context.generators->get(num_offset_generators, 0, OFFSET_GENERATOR_DOMAIN_SEPARATOR); + + // Run the plookup-based Straus algorithm + Element offset_accumulator = -constant_acc; + const auto [accumulator, offset_generator_delta] = + _fixed_base_plookup_batch_mul_internal(plookup_scalars, plookup_points, offset_generators, table_bits); + offset_accumulator += offset_generator_delta; + + // Subtract offset. Since all points are constants and linearly independent of offset generators, + // we can safely use unconditional_add when constant_acc is non-trivial. + cycle_group result; + if (!constant_acc.is_point_at_infinity()) { + result = accumulator.unconditional_add(AffineElement(-offset_accumulator)); + } else { + result = accumulator - cycle_group(AffineElement(offset_accumulator)); + } + + result.set_origin_tag(result_tag); + return result; +} + /** * @brief Multiscalar multiplication algorithm. * diff --git a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.hpp b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.hpp index 641f52e2fc1f..4d1cf831e3cc 100644 --- a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.hpp +++ b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.hpp @@ -14,6 +14,7 @@ #include "barretenberg/stdlib/primitives/field/field.hpp" #include "barretenberg/stdlib/primitives/group/cycle_scalar.hpp" #include "barretenberg/stdlib/primitives/group/straus_lookup_table.hpp" +#include "barretenberg/stdlib/primitives/group/straus_plookup_table.hpp" #include "barretenberg/stdlib/primitives/group/straus_scalar_slice.hpp" #include "barretenberg/stdlib_circuit_builders/plookup_tables/fixed_base/fixed_base_params.hpp" #include "barretenberg/transcript/origin_tag.hpp" @@ -52,6 +53,7 @@ template class cycle_group { using BigScalarField = stdlib::bigfield; using cycle_scalar = ::bb::stdlib::cycle_scalar; using straus_lookup_table = ::bb::stdlib::straus_lookup_table; + using straus_plookup_table = ::bb::stdlib::straus_plookup_table; using straus_scalar_slices = ::bb::stdlib::straus_scalar_slices; // Bit-size for scalars represented in the ROM lookup tables used in the variable-base MSM algorithm @@ -128,6 +130,22 @@ template class cycle_group { static cycle_group batch_mul(const std::vector& base_points, const std::vector& scalars, const GeneratorContext& context = {}); + + static cycle_group fixed_batch_mul(const std::vector& constant_points, + const std::vector& scalars, + GeneratorContext context = {}, + size_t table_bits = ROM_TABLE_BITS) + { + std::vector cycle_scalars; + for (auto scalar : scalars) { + cycle_scalars.emplace_back(scalar); + } + return fixed_batch_mul(constant_points, cycle_scalars, context, table_bits); + } + static cycle_group fixed_batch_mul(const std::vector& constant_points, + const std::vector& scalars, + const GeneratorContext& context = {}, + size_t table_bits = ROM_TABLE_BITS); cycle_group operator*(const cycle_scalar& scalar) const; cycle_group& operator*=(const cycle_scalar& scalar); cycle_group operator*(const BigScalarField& scalar) const; @@ -205,8 +223,9 @@ template class cycle_group { } private: - // Allow straus_lookup_table to access the private constructor for efficiency + // Allow straus_lookup_table and straus_plookup_table to access the private constructor for efficiency friend class ::bb::stdlib::straus_lookup_table; + friend class ::bb::stdlib::straus_plookup_table; // Private constructor that allows explicit control over infinity flag. // Use public constructors or factory methods instead - they auto-detect infinity from coordinates. @@ -225,6 +244,12 @@ template class cycle_group { static batch_mul_internal_output _fixed_base_batch_mul_internal(std::span scalars, std::span base_points); + static batch_mul_internal_output _fixed_base_plookup_batch_mul_internal( + std::span scalars, + std::span base_points, + std::span offset_generators, + size_t table_bits = ROM_TABLE_BITS); + // Internal implementation for unconditional_add and unconditional_subtract cycle_group _unconditional_add_or_subtract(const cycle_group& other, bool is_addition, diff --git a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.test.cpp b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.test.cpp index 15416d1a5988..a8482d37cebb 100644 --- a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.test.cpp +++ b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/cycle_group.test.cpp @@ -2080,4 +2080,79 @@ TYPED_TEST(CycleGroupTest, TestInfinityAutoDetectionInConstructor) EXPECT_FALSE(builder.failed()); EXPECT_TRUE(CircuitChecker::check(builder)); } + +/** + * @brief Test fixed_batch_mul correctness with constant points and witness scalars + */ +TYPED_TEST(CycleGroupTest, TestFixedBatchMul) +{ + STDLIB_TYPE_ALIASES; + auto builder = Builder(); + + constexpr size_t num_points = 8; + std::vector points; + std::vector scalars; + Element expected = Group::point_at_infinity; + + for (size_t i = 0; i < num_points; ++i) { + auto element = TestFixture::generators[i]; + typename Group::Fr scalar = Group::Fr::random_element(&engine); + expected += (element * scalar); + // Points are constant, scalars are witnesses + points.emplace_back(cycle_group_ct(element)); + scalars.emplace_back(cycle_group_ct::cycle_scalar::from_witness(&builder, scalar)); + } + + auto result = cycle_group_ct::fixed_batch_mul(points, scalars); + EXPECT_EQ(result.get_value(), AffineElement(expected)); + + EXPECT_FALSE(builder.failed()); + EXPECT_TRUE(CircuitChecker::check(builder)); +} + +/** + * @brief Test fixed_batch_mul with a single constant point + */ +TYPED_TEST(CycleGroupTest, TestFixedBatchMulSinglePoint) +{ + STDLIB_TYPE_ALIASES; + auto builder = Builder(); + + auto element = TestFixture::generators[0]; + typename Group::Fr scalar = Group::Fr::random_element(&engine); + Element expected = element * scalar; + + std::vector points{ cycle_group_ct(element) }; + std::vector scalars{ cycle_group_ct::cycle_scalar::from_witness(&builder, + scalar) }; + + auto result = cycle_group_ct::fixed_batch_mul(points, scalars); + EXPECT_EQ(result.get_value(), AffineElement(expected)); + + EXPECT_FALSE(builder.failed()); + EXPECT_TRUE(CircuitChecker::check(builder)); +} + +/** + * @brief Test fixed_batch_mul with a zero scalar + */ +TYPED_TEST(CycleGroupTest, TestFixedBatchMulZeroScalar) +{ + STDLIB_TYPE_ALIASES; + auto builder = Builder(); + + auto element = TestFixture::generators[0]; + typename Group::Fr zero_scalar = 0; + + std::vector points{ cycle_group_ct(element) }; + std::vector scalars{ cycle_group_ct::cycle_scalar::from_witness( + &builder, zero_scalar) }; + + auto result = cycle_group_ct::fixed_batch_mul(points, scalars); + EXPECT_TRUE(result.is_point_at_infinity().get_value()); + + EXPECT_FALSE(builder.failed()); + EXPECT_TRUE(CircuitChecker::check(builder)); +} + #pragma GCC diagnostic pop diff --git a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/straus_plookup_table.cpp b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/straus_plookup_table.cpp new file mode 100644 index 000000000000..0516e8439924 --- /dev/null +++ b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/straus_plookup_table.cpp @@ -0,0 +1,150 @@ +#include "./straus_plookup_table.hpp" +#include "./cycle_group.hpp" +#include "barretenberg/stdlib/primitives/circuit_builders/circuit_builders.hpp" + +namespace bb::stdlib { + +/** + * @brief Compute native table entries and BasicTable column data without touching the circuit builder. + * + * @details This is the parallelizable part of table construction. It builds: + * - native_table: affine points { offset_generator + i * base_point } for i in [0, table_size) + * - basic_table: a BasicTable with columns populated but table_index NOT yet assigned + * + * @param base_point Constant base point + * @param offset_generator Offset to prevent point-at-infinity edge cases + * @param table_bits Number of bits per table (table has 1 << table_bits entries) + * @return PrecomputedData Contains native_table and basic_table (without table_index) + */ +template +typename straus_plookup_table::PrecomputedData straus_plookup_table::build_precomputed_data( + const AffineElement& base_point, const AffineElement& offset_generator, size_t table_bits) +{ + const size_t table_size = 1UL << table_bits; + + // Compute native table entries using projective coordinates, then batch-normalize + std::vector projective_points(table_size); + projective_points[0] = Element(offset_generator); + Element base_proj(base_point); + for (size_t i = 1; i < table_size; ++i) { + projective_points[i] = projective_points[i - 1] + base_proj; + } + Element::batch_normalize(projective_points.data(), table_size); + + PrecomputedData result; + result.native_table.resize(table_size); + for (size_t i = 0; i < table_size; ++i) { + result.native_table[i] = AffineElement(projective_points[i].x, projective_points[i].y); + } + + // Populate BasicTable columns (table_index is NOT set here — that requires the builder) + result.basic_table.id = plookup::BasicTableId::STRAUS_EC_POINT; + result.basic_table.use_twin_keys = false; + result.basic_table.column_1_step_size = bb::fr(0); + result.basic_table.column_2_step_size = bb::fr(0); + result.basic_table.column_3_step_size = bb::fr(0); + result.basic_table.get_values_from_key = nullptr; + + result.basic_table.column_1.resize(table_size); + result.basic_table.column_2.resize(table_size); + result.basic_table.column_3.resize(table_size); + for (size_t i = 0; i < table_size; ++i) { + result.basic_table.column_1[i] = bb::fr(i); + result.basic_table.column_2[i] = result.native_table[i].x; + result.basic_table.column_3[i] = result.native_table[i].y; + } + + return result; +} + +/** + * @brief Construct from precomputed data — serial Phase 2, only touches the circuit builder. + * + * @details Assigns table_index and pushes the BasicTable into the builder's lookup_tables deque. + * Must be called serially (builder is not thread-safe). + * + * @param context The circuit builder + * @param data Precomputed native table + BasicTable columns + */ +template +straus_plookup_table::straus_plookup_table(Builder* context, PrecomputedData data) + : _context(context) + , native_table(std::move(data.native_table)) +{ + _table = context->register_basic_lookup_table(std::move(data.basic_table)); + + // This table is built entirely from native constants, so the tag is pure constant. + tag = OriginTag::constant(); +} + +/** + * @brief Construct a plookup-based Straus lookup table for a constant base point. + * + * @details Creates a BasicTable with (1 << table_bits) entries of the form: + * { offset_generator + i * base_point } for i in [0, 1 << table_bits) + * + * The table is pushed directly into the builder's lookup_tables deque. Table data becomes part of the + * proving polynomial (zero gate cost). Each subsequent read costs exactly 1 lookup gate. + * + * @param context The circuit builder + * @param base_point Constant base point (must not be a witness) + * @param offset_generator Offset to prevent point-at-infinity edge cases + * @param table_bits Number of bits per table (table has 1 << table_bits entries) + */ +template +straus_plookup_table::straus_plookup_table(Builder* context, + const AffineElement& base_point, + const AffineElement& offset_generator, + size_t table_bits) + : straus_plookup_table(context, build_precomputed_data(base_point, offset_generator, table_bits)) +{} + +/** + * @brief Read from the plookup table at the given index. + * + * @details Creates a single lookup gate constraining (index, x, y) to a valid row in this table. + * The index witness is reused as the lookup key so the scalar slice is directly constrained. + * + * @param _index The lookup index (witness or constant, typically a scalar slice) + * @return cycle_group The point at native_table[index] + */ +template cycle_group straus_plookup_table::read(const field_t& _index) +{ + // A plookup gate key must be a witness; convert constants to a witness constrained to the constant value + // (mirrors the same pattern in straus_lookup_table::read and create_gates_from_plookup_accumulators). + field_t index(_index); + if (index.is_constant()) { + index = field_t::from_witness(_context, _index.get_value()); + index.assert_equal(_index.get_value()); + } + + // Get native index value and look up the corresponding point + auto native_index = static_cast(uint256_t(index.get_value())); + BB_ASSERT(native_index < native_table.size()); + const auto& point = native_table[native_index]; + + // Create witnesses for x and y outputs + auto x_idx = _context->add_variable(point.x); + auto y_idx = _context->add_variable(point.y); + + // Create a standalone lookup gate constraining (index, x, y) to a valid table row. + plookup::BasicTable::LookupEntry entry; + entry.key = { uint256_t(native_index), 0 }; + entry.value = { point.x, point.y }; + _context->create_lookup_gate(index.get_witness_index(), x_idx, y_idx, *_table, entry); + + // Wrap output witnesses in field_t and propagate origin tag from the index + field_t x = field_t::from_witness_index(_context, x_idx); + field_t y = field_t::from_witness_index(_context, y_idx); + OriginTag merged_tag(tag, index.get_origin_tag()); + x.set_origin_tag(merged_tag); + y.set_origin_tag(merged_tag); + + // Result is never at infinity due to offset generator in every table entry + return cycle_group(x, y, /*is_infinity=*/bool_t(_context, false), /*assert_on_curve=*/false); +} + +template class straus_plookup_table; +template class straus_plookup_table; + +} // namespace bb::stdlib diff --git a/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/straus_plookup_table.hpp b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/straus_plookup_table.hpp new file mode 100644 index 000000000000..0ac12c819855 --- /dev/null +++ b/barretenberg/cpp/src/barretenberg/stdlib/primitives/group/straus_plookup_table.hpp @@ -0,0 +1,72 @@ +#pragma once + +#include "barretenberg/stdlib/primitives/field/field.hpp" +#include "barretenberg/stdlib_circuit_builders/plookup_tables/types.hpp" +#include "barretenberg/transcript/origin_tag.hpp" +#include + +namespace bb::stdlib { + +// Forward declaration +template class cycle_group; + +/** + * @brief straus_plookup_table computes a plookup-based lookup table of size 1 << table_bits + * + * @details For a CONSTANT base_point [P] and offset_generator point [G], where N = 1 << table_bits, + * the following is computed: + * + * { [G] + 0.[P], [G] + 1.[P], ..., [G] + (N - 1).[P] } + * + * Unlike straus_lookup_table (which uses ROM tables), this class creates plookup BasicTable entries. + * Plookup tables have zero construction cost (table data is part of the proving polynomial) and each + * read costs exactly 1 lookup gate with no finalization overhead. This makes them significantly cheaper + * than ROM tables for fixed/constant base points. + * + * @note This class requires the base point to be a circuit constant (not a witness). For witness base + * points, use straus_lookup_table instead. + * + * @note The offset generator [G] prevents point-at-infinity edge cases, same as in straus_lookup_table. + */ +template class straus_plookup_table { + public: + using field_t = stdlib::field_t; + using bool_t = stdlib::bool_t; + using Curve = typename Builder::EmbeddedCurve; + using Group = typename Curve::Group; + using Element = typename Curve::Element; + using AffineElement = typename Curve::AffineElement; + + /** + * @brief Precomputed data for two-phase construction. Contains all data computed without builder access. + */ + struct PrecomputedData { + std::vector native_table; + plookup::BasicTable basic_table; // columns populated; table_index is NOT yet assigned + }; + + straus_plookup_table() = default; + straus_plookup_table(Builder* context, + const AffineElement& base_point, + const AffineElement& offset_generator, + size_t table_bits); + // Construct from precomputed data — only performs builder registration (serial phase) + straus_plookup_table(Builder* context, PrecomputedData data); + + // Compute native table + BasicTable columns without touching the builder (parallelizable) + static PrecomputedData build_precomputed_data(const AffineElement& base_point, + const AffineElement& offset_generator, + size_t table_bits); + + cycle_group read(const field_t& index); + + const std::vector& get_native_table() const { return native_table; } + + private: + Builder* _context = nullptr; + plookup::BasicTable* _table = nullptr; // pointer into builder's lookup_tables deque + std::vector native_table; // precomputed table entries for witness generation + OriginTag tag; +}; + +} // namespace bb::stdlib diff --git a/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/plookup_tables/types.hpp b/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/plookup_tables/types.hpp index 0d9f7996a712..e755ff0434d1 100644 --- a/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/plookup_tables/types.hpp +++ b/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/plookup_tables/types.hpp @@ -86,6 +86,9 @@ enum BasicTableId { KECCAK_RHO_7, KECCAK_RHO_8, KECCAK_RHO_9, + // Used by straus_plookup_table for fixed-base MSM with constant EC points (e.g. IPA verifier SRS elements). + // Each table instance gets this id; uniqueness within a circuit is ensured by table_index, not id. + STRAUS_EC_POINT, }; enum MultiTableId { diff --git a/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.cpp b/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.cpp index da88199681e8..df881a274c62 100644 --- a/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.cpp +++ b/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.cpp @@ -513,6 +513,43 @@ plookup::BasicTable& UltraCircuitBuilder_::get_table(const plook return lookup_tables.back(); } +/** @copydoc UltraCircuitBuilder_::register_basic_lookup_table */ +template +plookup::BasicTable* UltraCircuitBuilder_::register_basic_lookup_table(plookup::BasicTable&& table) +{ + table.table_index = lookup_tables.size(); + lookup_tables.emplace_back(std::move(table)); + return &lookup_tables.back(); +} + +/** @copydoc UltraCircuitBuilder_::create_lookup_gate */ +template +void UltraCircuitBuilder_::create_lookup_gate(const uint32_t key_idx, + const uint32_t val1_idx, + const uint32_t val2_idx, + plookup::BasicTable& table, + const plookup::BasicTable::LookupEntry& entry, + const FF column_1_step_size, + const FF column_2_step_size, + const FF column_3_step_size) +{ + this->assert_valid_variables({ key_idx, val1_idx, val2_idx }); + + table.lookup_gates.emplace_back(entry); + + blocks.lookup.populate_wires(key_idx, val1_idx, val2_idx, this->zero_idx()); + blocks.lookup.set_gate_selector(1); + blocks.lookup.q_3().emplace_back(FF(table.table_index)); + blocks.lookup.q_2().emplace_back(column_1_step_size); + blocks.lookup.q_m().emplace_back(column_2_step_size); + blocks.lookup.q_c().emplace_back(column_3_step_size); + blocks.lookup.q_1().emplace_back(0); + blocks.lookup.q_4().emplace_back(0); + + check_selector_length_consistency(); + this->increment_num_gates(); +} + /** * @brief Create gates from pre-computed accumulator values which simultaneously establish individual basic-table * lookups and the reconstruction of the desired result from those components. @@ -559,7 +596,6 @@ plookup::ReadData UltraCircuitBuilder_::create_gates_f // Get basic lookup table; construct and add to builder.lookup_tables if not already present plookup::BasicTable& table = get_table(multi_table.basic_table_ids[i]); - table.lookup_gates.emplace_back(read_values.lookup_entries[i]); // Create witness variables: first lookup reuses user's input indices, subsequent create new variables const auto first_idx = is_first_lookup ? key_a_index : this->add_variable(read_values[ColumnIdx::C1][i]); @@ -571,21 +607,14 @@ plookup::ReadData UltraCircuitBuilder_::create_gates_f read_data[ColumnIdx::C1].push_back(first_idx); read_data[ColumnIdx::C2].push_back(second_idx); read_data[ColumnIdx::C3].push_back(third_idx); - this->assert_valid_variables({ first_idx, second_idx, third_idx }); - // Populate lookup gate: wire values and selectors - blocks.lookup.populate_wires(first_idx, second_idx, third_idx, this->zero_idx()); - blocks.lookup.set_gate_selector(1); // mark as lookup gate - blocks.lookup.q_3().emplace_back(FF(table.table_index)); // unique table identifier // Step size coefficients: zero for last lookup (no next accumulator), negative step sizes otherwise - blocks.lookup.q_2().emplace_back(is_last_lookup ? 0 : -multi_table.column_1_step_sizes[i + 1]); - blocks.lookup.q_m().emplace_back(is_last_lookup ? 0 : -multi_table.column_2_step_sizes[i + 1]); - blocks.lookup.q_c().emplace_back(is_last_lookup ? 0 : -multi_table.column_3_step_sizes[i + 1]); - blocks.lookup.q_1().emplace_back(0); // unused - blocks.lookup.q_4().emplace_back(0); // unused + const FF col1_step = is_last_lookup ? FF(0) : -multi_table.column_1_step_sizes[i + 1]; + const FF col2_step = is_last_lookup ? FF(0) : -multi_table.column_2_step_sizes[i + 1]; + const FF col3_step = is_last_lookup ? FF(0) : -multi_table.column_3_step_sizes[i + 1]; - check_selector_length_consistency(); - this->increment_num_gates(); + create_lookup_gate( + first_idx, second_idx, third_idx, table, read_values.lookup_entries[i], col1_step, col2_step, col3_step); } return read_data; } diff --git a/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.hpp b/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.hpp index 2bcfae4938f7..f9cb3b16423d 100644 --- a/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.hpp +++ b/barretenberg/cpp/src/barretenberg/stdlib_circuit_builders/ultra_circuit_builder.hpp @@ -432,6 +432,26 @@ class UltraCircuitBuilder_ : public CircuitBuilderBase& get_lookup_tables() { return lookup_tables; } size_t get_num_lookup_tables() const { return lookup_tables.size(); } + /** + * @brief Register a BasicTable with the builder, assigning it a unique table_index. + * @return Stable pointer into the builder's lookup_tables deque. + */ + plookup::BasicTable* register_basic_lookup_table(plookup::BasicTable&& table); + + /** + * @brief Create a single plookup lookup gate. + * @details Records the lookup entry, populates one row of the lookup block with the given wire indices and + * step-size selectors, and increments the gate count. Step sizes are 0 for standalone or last-in-chain lookups. + */ + void create_lookup_gate(uint32_t key_idx, + uint32_t val1_idx, + uint32_t val2_idx, + plookup::BasicTable& table, + const plookup::BasicTable::LookupEntry& entry, + FF column_1_step_size = 0, + FF column_2_step_size = 0, + FF column_3_step_size = 0); + plookup::ReadData create_gates_from_plookup_accumulators( const plookup::MultiTableId& id, const plookup::ReadData& read_values,