Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
7847c67
Add streaming one-wesolowski compaction APIs.
hoffmang9 Feb 24, 2026
7be0752
Fix non-x86 build break in vdf_fast_pairindex.
hoffmang9 Feb 24, 2026
3755be2
Ensure cmake is present on macOS CI runners.
hoffmang9 Feb 24, 2026
073427a
Improve fast-path batch replay handling and harden pairindex slot all…
hoffmang9 Feb 24, 2026
fd000ab
Clarify batch iteration indexing in streaming callback.
hoffmang9 Feb 24, 2026
3f82dc2
Add streaming tuner diagnostics and batch fast-wrapper APIs.
hoffmang9 Feb 24, 2026
95f8ff1
Make fast-thread counter slots build-configurable.
hoffmang9 Feb 24, 2026
746ba2e
Fix fast pairindex allocator state across translation units.
hoffmang9 Feb 24, 2026
707b2f4
Guard streaming prover bucket shifts against invalid k.
hoffmang9 Feb 24, 2026
0c11002
Harden fast counter slot safety and macOS cmake setup.
hoffmang9 May 13, 2026
1e73421
Remove stale patch artifact and refine tuner update cost.
hoffmang9 May 13, 2026
61e9280
Adapt streaming callback checkpoint scheduling from fb0e2c2.
hoffmang9 May 13, 2026
91a2af9
Address slot reuse and logging consistency in vdf fast path.
hoffmang9 May 13, 2026
1e19548
trick 2
Ealrann Jan 21, 2026
81b7b0d
Event queue
Ealrann Jan 28, 2026
ea0a0ab
Fix unbounded RSS growth
Ealrann Feb 13, 2026
813e5b8
Fix batch fast-wrapper exception cleanup
hoffmang9 May 13, 2026
b15f01b
Fail closed on streaming replay corruption
hoffmang9 May 13, 2026
e508cce
Fix batch callback lifetimes across exception cleanup
hoffmang9 May 13, 2026
88602ba
Restore batch finalizer reducer API
hoffmang9 May 13, 2026
8e76d33
Remove unused batch progress dead code
hoffmang9 May 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .github/workflows/build-c-libraries.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,18 @@ jobs:
fetch-depth: 1
path: mpir_gc_x64

- name: Ensure cmake available (macOS)
if: matrix.os.matrix == 'macos'
shell: bash
run: |
brew ls --versions cmake >/dev/null 2>&1 || brew install cmake
CMAKE_BIN="$(brew --prefix cmake)/bin"
if [ -d "$CMAKE_BIN" ]; then
echo "$CMAKE_BIN" >> "$GITHUB_PATH"
export PATH="$CMAKE_BIN:$PATH"
fi
cmake --version

- name: Build
working-directory: src
env:
Expand Down
12 changes: 12 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,18 @@ jobs:
with:
python-version: ${{ matrix.python.major-dot-minor }}

- name: Ensure cmake available (macOS)
if: matrix.os.matrix == 'macos'
shell: bash
run: |
brew ls --versions cmake >/dev/null 2>&1 || brew install cmake
CMAKE_BIN="$(brew --prefix cmake)/bin"
if [ -d "$CMAKE_BIN" ]; then
echo "$CMAKE_BIN" >> "$GITHUB_PATH"
export PATH="$CMAKE_BIN:$PATH"
fi
cmake --version

- name: Install pipx
run: |
pip install pipx
Expand Down
308 changes: 308 additions & 0 deletions BBR_BLUEBOX_COMPACTION_OVERVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
# BBR / Chia Bluebox VDF Proof Compaction — Overview + Implemented Performance Tricks

This document summarizes how Chia “bluebox” compaction jobs are computed in this repo, and the performance tweaks we implemented in `bbr_chiavdf/` (a fork/copy of `chiavdf/`).

It is written to be understandable even if you’re not already fluent with Chia’s VDF / classgroup implementation details.

## 1) What “bluebox compaction” computes

Chia blocks include a VDF output (“VDFInfo”) for some VDF “slot” (end-of-slot, signage point, infusion point, etc.). A full node can accept a **compact proof of time** (“VDFProof”) for a given VDFInfo:

- Input element `x` (for compaction jobs this is always the **default/canonical classgroup element**, i.e. identity-ish input used by Chia)
- Discriminant `D` derived from the VDF challenge
- Number of iterations `T`
- Output element `y` (already known from the block’s VDFInfo)

The bluebox worker’s job is to compute the **compact Wesolowski witness** `π` (witness_type = 0) such that the proof verifies for `(D, x, y, T)`.

### Inputs (per proof job)

All values are byte strings unless noted otherwise.

- `challenge`: 32 bytes (`VDFInfo.challenge`)
- `T`: `u64` (`VDFInfo.number_of_iterations`)
- `y_ref`: 100 bytes serialized classgroup element (`VDFInfo.output`)
- `size_bits`: discriminant size in bits (typically 1024; Chia consensus constant)
- `x0`: canonical input element for compaction:
- `x0_bytes = ClassgroupElement.get_default_element().data` (100 bytes)

### Outputs (per proof job)

- `y`: serialized output element (should equal `y_ref`)
- `proof`: serialized witness element `π` (same size as `y`)
- In our C ABI wrappers we return `y || proof` (concatenation, typically 200 bytes for 1024-bit discriminants).

## 2) Underlying primitives (high-level)

Chia’s VDF uses the class group of binary quadratic forms for a negative discriminant `D`:

- `D` is derived deterministically from `(challenge, size_bits)` via `CreateDiscriminant(...)`.
- Elements are represented as reduced forms `form { a, b, c }` (each is a GMP big integer).
- The VDF evaluation is the deterministic repeated squaring chain:
- `f(0) = x0`
- `f(t+1) = square(f(t))` (with reduction)
- `y = f(T) = x0^(2^T)` in the class group

The compact proof is a Wesolowski proof, which (in this implementation) uses a per-proof prime `B` derived from the input and output:

- `B = GetB(D, x0_form, y_ref_form)` where `GetB` hashes serialized forms then runs `HashPrime(...)`.
- Because `B` depends on `y_ref`, if `y_ref` is known up front then **`B` is known before squaring starts**.

## 3) Baseline chiavdf “one-weso” compaction (two-phase)

The upstream chiavdf compact witness path (“one-weso”) is:

1. **Squaring phase**
- Run the VDF evaluation (sequential squaring) from `x0` to iteration `T`.
- Store many intermediate forms (“checkpoints”) in an array at a fixed cadence.

2. **Proof phase**
- After squaring is finished, scan those stored checkpoints.
- Multiply them into “buckets” `ys[j][b]` using a mapping (`GetBlock`) that depends on `B`.
- Fold the bucket structure into a final `proof_form`.

### Proof parameters `k, l, kl`

chiavdf computes parameters:

- `(k, l) = ApproximateParameters(T)`
- `kl = k * l`
- Number of checkpoint indices:
- `limit = ceil(T / kl)` (the number of checkpoint positions that may be used)

These parameters control how many checkpoints are used and how bucket folding is structured.

### Costs (baseline)

- Memory: stores `O(ceil(T/kl))` checkpoint forms (each form holds several GMP big integers).
- Time: wall-clock is essentially `t_total = t_square + t_proof` because proof work happens after squaring.

## 4) Tweak / Trick 1 — “Streaming one-weso” using known output (`y_ref`)

### Key idea

For bluebox compaction, `y_ref` is already known from the block. Because `B` depends on `y_ref`, we can compute `B` before starting squaring.

That lets us avoid storing checkpoint forms and instead update the proof buckets **as soon as each checkpoint is reached**, using the current `f(t)` value.

### Algorithm (single job, streaming buckets)

Inputs: `(challenge, size_bits, x0_bytes, y_ref_bytes, T)`

1. Compute `D = CreateDiscriminant(challenge, size_bits)` and `L = root(-D, 4)` (chiavdf convention).
2. Deserialize:
- `x0_form = DeserializeForm(D, x0_bytes)`
- `y_ref_form = DeserializeForm(D, y_ref_bytes)`
3. Compute:
- `B = GetB(D, x0_form, y_ref_form)`
- `(k, l) = ApproximateParameters(T)` (fallback `k=10,l=1` for small `T`)
- `kl = k*l`
- `limit = ceil(T/kl)`
4. Allocate buckets:
- `ys[j][b]` for `j ∈ [0, l)` and `b ∈ [0, 2^k)`
- Initialize all buckets to the identity form.
5. Run the VDF squaring chain up to `T`, but:
- At each checkpoint time `t = i*kl`, compute `checkpoint = f(t)` and call `process_checkpoint(i, checkpoint)`:
- For each `j ∈ [0, l)`:
- `p = i*l + j`
- If `T >= k*(p+1)`, compute `b = GetBlock(p, k, T, B)`
- Multiply `ys[j][b] *= checkpoint` (via `nucomp_form`).
6. At the end, compute `y = f(T)` and check `y == y_ref_form` (debug/safety guard).
7. Fold buckets to compute the final proof form (same folding logic as chiavdf).
8. Serialize `y` and `proof` and return `y || proof`.

### What changed vs baseline

- We no longer store an array of checkpoint forms.
- Bucket multiplication occurs “online” during squaring.
- Folding/finalization stays the same as chiavdf.

### Costs / tradeoffs

- Memory becomes `O(l * 2^k)` forms (the bucket table) instead of `O(ceil(T/kl))` checkpoint forms.
- Runtime can sometimes overlap bucket updates with squaring, but in practice the speedup depends on which part dominates (squaring vs `nucomp_form` multiplications).

### Where this lives in `bbr_chiavdf/`

- C ABI entrypoints:
- `chiavdf_prove_one_weso_fast_streaming(...)`
- `chiavdf_prove_one_weso_fast_streaming_with_progress(...)`
- Implementation:
- `bbr_chiavdf/src/c_bindings/fast_wrapper.cpp`
- `StreamingOneWesolowskiCallback` and the bucket helper (`StreamingWesolowskiBuckets`).

## 5) GetBlock optimization (precompute `GetBlock(p)` table per job)

In streaming (and in baseline), for each checkpoint update we need:

- `b = GetBlock(p, k, T, B)`

Naively this uses per-`p` modular exponentiation and division, which is expensive with GMP big integers.

### Optimization idea

For fixed `(T, k, B)`, define:

- `r_p = 2^{T - k*(p+1)} mod B`
- `b_p = floor((r_p * 2^k) / B)` (integer division)

Then:

- `r_{p+1} = r_p * inv(2^k) mod B` where `inv(2^k)` is the modular inverse of `2^k mod B`

So we can compute all `b_p` iteratively in `O(#p)` time with one modular inverse, instead of `O(#p)` modular exponentiations.

### Tradeoff

We store `precomputed_blocks[p]` for all `p` used by the proof:

- Memory: `O(limit * l)` `u32` values per job.
- For typical compaction-scale `T` this is often a few MB per job.

### Where this lives

- `bbr_chiavdf/src/c_bindings/fast_wrapper.cpp`:
- `build_precomputed_getblocks(...)`
- Used by:
- `chiavdf_prove_one_weso_fast_streaming_getblock_opt(...)`
- `chiavdf_prove_one_weso_fast_streaming_getblock_opt_with_progress(...)`

## 6) Trick 2 — discriminant reuse (“multi-target VDF engine”)

### Key observation

For a fixed group key `(challenge, size_bits, x0_bytes)`, the discriminant `D` and the entire squaring trajectory `f(t)` are identical for all jobs:

- Only `T_j` and `y_ref_j` differ across jobs.

Therefore, if you have `N` jobs sharing a group key:

- Without reuse: total squaring work is `Σ T_j`
- With reuse: total squaring work is exactly `T_max = max(T_j)`

### Grouping key

Jobs can be grouped if and only if:

- Same `challenge`
- Same `size_bits`
- Same `x0_bytes`

For bluebox compaction, `x0_bytes` is always the default element, so grouping is mostly “same challenge”.

### Algorithm (batch)

Inputs (shared):

- `challenge`, `x0_bytes`, `size_bits`

Inputs (per job `j`):

- `T_j`, `y_ref_j`

Per job setup (done before squaring starts):

1. Deserialize `y_ref_form_j`
2. Compute `B_j = GetB(D, x0_form, y_ref_form_j)`
3. Compute `(k_j, l_j)`, `kl_j`, `limit_j`
4. Allocate `ys_j` buckets (Trick 1)
5. Precompute `GetBlock` table for that job (GetBlock opt)

Shared squaring run:

- Run `repeated_square(T_max, ...)` once to generate `f(t)` for all times up to `T_max`.
- Maintain per job:
- `next_checkpoint_t_j` initialized to `kl_j` (we process `i=0` immediately at `t=0`)
- completion time `T_j`
- At each “event time” `t`:
1. For every job where `t == next_checkpoint_t_j`:
- `i = t / kl_j`
- `ys_j` bucket update with checkpoint form `f(t)` (Trick 1)
- `next_checkpoint_t_j += kl_j`
2. For every job where `t == T_j`:
- Debug check: `f(T_j) == y_ref_form_j`
- If mismatch: abort (signals backend grouping/data bug).
- Finalize proof for that job (fold buckets → proof form) and serialize result.
- Free job state (buckets, GetBlock table) to reduce peak RAM.

### Concurrency / offloading finalization

- The shared squaring chain itself is sequential by definition.
- Bucket updates are triggered by exact `f(t)` values; in our implementation they are done on the squaring callback thread to avoid copying forms or storing a large checkpoint history.
- Finalization (folding + serialization) is **per job** and can be offloaded:
- Once a job reaches `T_j` and passes the `f(T_j)==y_ref_j` check, its proof no longer depends on future squaring.
- We offload finalization to a `std::thread` per completed job so the squaring run can continue toward larger `T`.

### Where this lives

- New C ABI:
- `bbr_chiavdf/src/c_bindings/fast_wrapper.h`:
- `ChiavdfBatchJob`
- `chiavdf_prove_one_weso_fast_streaming_getblock_opt_batch(...)`
- `chiavdf_free_byte_array_batch(...)`
- Implementation:
- `bbr_chiavdf/src/c_bindings/fast_wrapper.cpp`:
- `BatchOneWesolowskiCallback`
- `BatchJobState`
- Uses `StreamingWesolowskiBuckets` per job

### Error policy (mismatch)

We keep a strict mismatch check specifically for debugging backend grouping / job data issues:

- If the computed checkpoint `f(T_j)` differs from `y_ref_form_j`, the batch function returns `NULL` (fatal error).

This is expected to be “should never happen” in normal operation, but is useful to detect wrong grouping inputs early.

## 7) Rough resource model (what consumes time and RAM)

### Time

Three main contributors:

1. **Squaring chain** (`repeated_square(...)`): inherently sequential per group.
2. **Bucket updates**: `nucomp_form` multiplications at checkpoint times; scales with number of jobs and number of checkpoints.
3. **Finalization**: folding buckets into a proof; per job.

Trick 2 reduces (1) across jobs by reusing squaring work.

### Memory (per job, within a group)

Dominant memory terms:

- Buckets: `l * 2^k` forms (each form holds multiple GMP big ints) — often several MB per job.
- GetBlock precompute: `limit * l` `u32` values — often a few MB per job.

Peak memory per group is roughly linear in the number of jobs active at the same time (and drops as jobs complete and are freed).

## 8) Things to look at next (possible improvement areas)

This section is intentionally a “menu” for further investigation.

1. **Hotspots inside classgroup arithmetic**
- If perf shows most time in `nucomp_form` / GMP, then:
- reduce allocations (GMP mpz churn) with pooling or reuse
- explore alternative big-int backends / tuned GMP build / CPU-specific flags
- reduce constant factors in `nucomp_form` (algorithmic / assembly improvements)

2. **Reduce per-iteration callback overhead**
- Today `OnIteration` is called for every iteration, even though we only act on sparse “event times”.
- If this overhead becomes visible at huge `T`, consider:
- extending the core loop to support “next event” iteration skipping (intrusive change)
- or internal batching in the callback path

3. **Finalization optimization**
- Each job finalization constructs a reducer and folds buckets.
- Potential wins:
- reuse reducers per thread
- reduce intermediate `form` temporaries and copies

4. **Group sizing / scheduling**
- For Trick 2, there’s a throughput vs RAM tradeoff.
- Consider dynamic group size based on memory budget and `T` distribution.

5. **Optional: parallelize bucket updates (hard)**
- Bucket updates need the checkpoint form `f(t)` at exact times.
- Parallelizing this without copying/storing forms requires careful design (e.g. immutable snapshots, reference counting, or storing a checkpoint history).
- This is the next “big step” if per-job proof work becomes the bottleneck even after squaring reuse.

49 changes: 49 additions & 0 deletions docs/bluebox_compaction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Bluebox Compaction Optimizations

This document describes the compaction-oriented proving path exposed by
`src/c_bindings/fast_wrapper.h` and implemented in
`src/c_bindings/fast_wrapper.cpp`.

## Scope

These APIs are intended for workloads where the expected VDF output (`y_ref`) is
already known up front (for example, bluebox compaction jobs). They are additive
and do not change the existing `c_wrapper` APIs.

## Optimization 1: Streaming one-wesolowski

Given `y_ref`, the prover computes:

- `B = GetB(D, x, y_ref)` before squaring starts

This enables a streaming algorithm that updates proof buckets at each
checkpoint during repeated squaring, instead of materializing the full
intermediate checkpoint array and scanning it after the loop. In practice this
substantially reduces memory usage for compaction workloads.

## Optimization 2: Incremental GetBlock mapping

For streaming checkpoint updates, bucket index selection repeatedly calls
`GetBlock(p, k, T, B)`. The optimized mode keeps a rolling modular state and
advances sequential `p` values incrementally, avoiding full modular
exponentiation per call and avoiding a large lookup table.

## Optimization 3: Memory-budgeted (k, l) tuning

The wrapper can tune `(k, l)` under a configured memory budget:

- `chiavdf_set_bucket_memory_budget_bytes(...)`

If no tuned candidate is found, the code falls back to the standard parameter
heuristics.

## Operational Notes

- The `fast_wrapper` code path sets one-wesolowski mode and uses `quiet_mode` to
avoid unsolicited stdout noise when embedded in multi-worker clients.
- Thread-slot assignment for the fast VDF counters is per-thread via
`vdf_fast_pairindex()`, avoiding slot collisions when multiple VDF computations
run in one process.
- The production default for `enable_threads` in `parameters.h` is unchanged from
upstream to preserve timelord expectations.

Loading
Loading