|
| 1 | +# Batched Private Kernels Handoff |
| 2 | + |
| 3 | +## Current Branch |
| 4 | + |
| 5 | +Branch: `lde/n1-apps` |
| 6 | + |
| 7 | +Base observed locally: `merge-train/barretenberg` at `f2fd2bfbcc` |
| 8 | + |
| 9 | +Commits on top of the base: |
| 10 | + |
| 11 | +- `f172996cf5 additional test` |
| 12 | +- `ab8c0cfdc9 PoC N=3 test suite` |
| 13 | +- `cf50be9ea1 init_3 prototype` |
| 14 | +- `46529d485a inner_3` |
| 15 | +- `fd2b57bed1 share logic` |
| 16 | +- `4771dd8ee2 tests showing equivalence of *_3 kernels with old equivalents` |
| 17 | +- `ed18d04d03 more failure tests` |
| 18 | +- `c4e40a9c35 inner_6 PoC - no surprises` |
| 19 | + |
| 20 | +The shared next-call execution cleanup, `_3` equivalence tests, extra `_3` negative tests, and explicit `inner_6` |
| 21 | +compile/profile spike are all committed on the branch. This handoff doc is currently untracked. |
| 22 | + |
| 23 | +## How We Got Here |
| 24 | + |
| 25 | +The initial design direction was to make the first app call in the init kernel mirror the later app calls more closely: |
| 26 | +construct an output unconstrained, then validate that constructed output with constrained logic. The goal was to reduce |
| 27 | +the conceptual difference between init slot 0 and subsequent inner-call slots. |
| 28 | + |
| 29 | +That path was tried and then discarded. It increased gate counts and did not materially simplify the implementation. |
| 30 | +The asymmetry in init is real: slot 0 establishes transaction-wide state, handles protocol nullifier injection, and |
| 31 | +sets init-only fields. Treating it as just another app-slot transition obscured more than it helped. |
| 32 | + |
| 33 | +After dropping that path, the work moved to direct fixed-width prototypes: |
| 34 | + |
| 35 | +1. Build tests around init-kernel behavior with multiple apps. |
| 36 | +2. Extend the tests from two apps to three apps. |
| 37 | +3. Add a concrete `private_kernel_init_3` prototype. |
| 38 | +4. Add a concrete `private_kernel_inner_3` prototype. |
| 39 | +5. Notice that the post-slot-0 transition logic is identical for `init_3` slots 1 and 2 and all `inner_3` slots. |
| 40 | +6. Extract that common transition into a shared helper. |
| 41 | + |
| 42 | +## What The Branch Adds |
| 43 | + |
| 44 | +### Test Coverage |
| 45 | + |
| 46 | +`f172996cf5` adds an inner output-composition test: |
| 47 | + |
| 48 | +- `expiration_timestamp_pick_contract_update_horizon` |
| 49 | + |
| 50 | +This pins the rule that expiration timestamp reduction includes the contract update horizon derived from the anchor |
| 51 | +block timestamp plus `DEFAULT_UPDATE_DELAY - 1`. |
| 52 | + |
| 53 | +`ab8c0cfdc9` adds `private_kernel_batch_spike` tests. These exercise fixed three-call behavior without changing PXE |
| 54 | +scheduling: |
| 55 | + |
| 56 | +- `batch_3_accumulates_side_effects_across_slots` |
| 57 | +- `batch_3_linear_chain_consumes_all_calls` |
| 58 | +- `batch_3_linear_chain_matches_sequential_kernels` |
| 59 | +- `batch_3_depth_first_child_keeps_sibling_on_stack` |
| 60 | +- `batch_3_depth_first_with_sibling_matches_sequential_kernels` |
| 61 | +- `batch_3_second_call_must_match_first_call_stack_fails` |
| 62 | +- `batch_3_third_call_must_match_second_call_stack_fails` |
| 63 | +- `batch_3_fee_payer_conflict_fails` |
| 64 | +- `batch_3_public_teardown_conflict_fails` |
| 65 | +- `batch_3_min_revertible_side_effect_counter_conflict_fails` |
| 66 | +- `batch_3_static_call_requires_static_nested_private_call_fails` |
| 67 | +- `batch_3_static_call_restrictions_apply_to_next_slot_fails` |
| 68 | +- `inner_3_accumulates_side_effects_after_previous_kernel` |
| 69 | +- `inner_3_with_previous_side_effects_matches_sequential_kernels` |
| 70 | +- `inner_3_linear_chain_consumes_all_calls` |
| 71 | +- `inner_3_linear_chain_matches_sequential_kernels` |
| 72 | + |
| 73 | +The tests cover the main properties that first make batching interesting: accumulated side effects across slots, |
| 74 | +slot-to-slot private-call-stack chaining, depth-first ordering with a sibling left on the stack, and an intra-batch |
| 75 | +set-once aggregate conflicts. |
| 76 | + |
| 77 | +The negative tests now pin the main cross-slot constraints for the fixed `N = 3` prototype: |
| 78 | + |
| 79 | +- slot 1 must consume the request produced or exposed by slot 0; |
| 80 | +- slot 2 must consume the request produced or exposed by slot 1; |
| 81 | +- fee payer cannot be set twice in one batch; |
| 82 | +- public teardown request cannot be set twice in one batch; |
| 83 | +- non-zero `min_revertible_side_effect_counter` cannot be set twice in one batch; |
| 84 | +- static calls can only create static nested private calls; |
| 85 | +- static-call side-effect restrictions apply to a later slot reached through a static request. |
| 86 | + |
| 87 | +The relation-equivalence tests compare full `PrivateKernelCircuitPublicInputs` field-by-field: |
| 88 | + |
| 89 | +- `init_3(private_call_0, private_call_1, private_call_2)` equals existing `init(private_call_0)` followed by two |
| 90 | + existing `inner` executions; |
| 91 | +- `inner_3(previous_kernel, private_call_0, private_call_1, private_call_2)` equals three existing `inner` executions; |
| 92 | +- coverage includes a linear chain, a depth-first shape with a sibling left on the stack, and an inner case with |
| 93 | + previous accumulated side effects. |
| 94 | + |
| 95 | +Validation performed: |
| 96 | + |
| 97 | +- `/mnt/user-data/luke/aztec-packages/noir/noir-repo/target/release/nargo test --package private_kernel_lib --silence-warnings --skip-brillig-constraints-check` |
| 98 | +- result: 833 tests passed. |
| 99 | + |
| 100 | +### Init 3 Prototype |
| 101 | + |
| 102 | +`cf50be9ea1` adds: |
| 103 | + |
| 104 | +- `crates/private-kernel-init-3/Nargo.toml` |
| 105 | +- `crates/private-kernel-init-3/src/main.nr` |
| 106 | +- `private_kernel_init_3.nr` |
| 107 | +- workspace wiring in `Nargo.template.toml` |
| 108 | + |
| 109 | +The `private-kernel-init-3` entrypoint accepts: |
| 110 | + |
| 111 | +- init scalars: `tx_request`, `vk_tree_root`, `protocol_contracts`, `is_private_only`, |
| 112 | + `first_nullifier_hint`, and `revertible_counter_hint`; |
| 113 | +- three `PrivateCallDataWithoutPublicInputs` values; |
| 114 | +- three app public input databus columns: `call_data(1)`, `call_data(2)`, and `call_data(3)`. |
| 115 | + |
| 116 | +The library implementation runs the existing one-app init kernel for `private_call_0`, then applies two inner-call |
| 117 | +transitions for `private_call_1` and `private_call_2`. |
| 118 | + |
| 119 | +### Inner 3 Prototype |
| 120 | + |
| 121 | +`46529d485a` adds: |
| 122 | + |
| 123 | +- `crates/private-kernel-inner-3/Nargo.toml` |
| 124 | +- `crates/private-kernel-inner-3/src/main.nr` |
| 125 | +- `private_kernel_inner_3.nr` |
| 126 | +- workspace wiring in `Nargo.template.toml` |
| 127 | + |
| 128 | +The `private-kernel-inner-3` entrypoint accepts: |
| 129 | + |
| 130 | +- one previous kernel, with public inputs on `call_data(0)`; |
| 131 | +- three `PrivateCallDataWithoutPublicInputs` values; |
| 132 | +- three app public input databus columns: `call_data(1)`, `call_data(2)`, and `call_data(3)`. |
| 133 | + |
| 134 | +The library implementation verifies the previous kernel, validates its VK against the allowed previous-circuit set, then |
| 135 | +applies three inner-call transitions in sequence. |
| 136 | + |
| 137 | +### Inner 6 Compile/Profile Spike |
| 138 | + |
| 139 | +An explicit `private_kernel_inner_6` spike has been added without introducing a Noir-level array loop: |
| 140 | + |
| 141 | +- `crates/private-kernel-inner-6/Nargo.toml` |
| 142 | +- `crates/private-kernel-inner-6/src/main.nr` |
| 143 | +- `private_kernel_inner_6.nr` |
| 144 | +- workspace wiring in `Nargo.template.toml` |
| 145 | + |
| 146 | +The implementation follows the same explicit pattern as `inner_3`: verify the external previous kernel once, validate |
| 147 | +its VK, then call `execute_next_private_call` six times. |
| 148 | + |
| 149 | +Validation performed: |
| 150 | + |
| 151 | +- `/mnt/user-data/luke/aztec-packages/noir/noir-repo/target/release/nargo compile --package private_kernel_inner_6 --force --silence-warnings --skip-brillig-constraints-check` |
| 152 | +- `/mnt/user-data/luke/aztec-packages/noir/noir-repo/target/release/noir-profiler opcodes --artifact-path target/private_kernel_inner_6.json --output /tmp/private-kernel-inner-6-opcodes` |
| 153 | +- `/mnt/user-data/luke/aztec-packages/noir/noir-repo/target/release/nargo test --package private_kernel_lib --silence-warnings --skip-brillig-constraints-check` |
| 154 | + |
| 155 | +Results: |
| 156 | + |
| 157 | +- `target/private_kernel_inner_6.json`: about 2.4 MiB, bytecode length `1413780` |
| 158 | +- `private_kernel_inner_6`: `main` has `89566` ACIR opcodes |
| 159 | +- `private_kernel_lib`: 833 tests passed |
| 160 | + |
| 161 | +The `inner_6` ACIR count matches the linear projection from `inner_1` and `inner_3`: |
| 162 | + |
| 163 | +- `inner_1`: `18256` main ACIR opcodes |
| 164 | +- `inner_3`: `46780` main ACIR opcodes |
| 165 | +- projected `inner_6`: `18256 + 5 * ((46780 - 18256) / 2) = 89566` |
| 166 | +- measured `inner_6`: `89566` |
| 167 | + |
| 168 | +This confirms that the current explicit repeated-transition design scales linearly per additional app slot at the ACIR |
| 169 | +level. |
| 170 | + |
| 171 | +### Shared Transition Helper |
| 172 | + |
| 173 | +`fd2b57bed1` adds `private_kernel_batch.nr` and wires it as `pub(crate)` from `private-kernel-lib`. |
| 174 | + |
| 175 | +The helper: |
| 176 | + |
| 177 | +1. validates the next app as an inner call against the previous kernel public inputs; |
| 178 | +2. unconstrained-composes the next output by cloning the previous output, popping the top private call request, and |
| 179 | + appending the current private call effects; |
| 180 | +3. optionally validates the composed output with `PrivateKernelCircuitOutputValidator::validate_as_inner_call`. |
| 181 | + |
| 182 | +Both `private_kernel_init_3` and `private_kernel_inner_3` now call this helper for every post-init app transition. |
| 183 | + |
| 184 | +### Entrypoint Compile / ACIR Integration Proof |
| 185 | + |
| 186 | +The actual `_3` circuit packages compile through their `main.nr` entrypoints with databus public inputs, not just |
| 187 | +through library tests: |
| 188 | + |
| 189 | +- `/mnt/user-data/luke/aztec-packages/noir/noir-repo/target/release/nargo compile --package private_kernel_init_3 --force --silence-warnings --skip-brillig-constraints-check` |
| 190 | +- `/mnt/user-data/luke/aztec-packages/noir/noir-repo/target/release/nargo compile --package private_kernel_inner_3 --force --silence-warnings --skip-brillig-constraints-check` |
| 191 | + |
| 192 | +Artifacts: |
| 193 | + |
| 194 | +- `target/private_kernel_init_3.json`: about 1.4 MiB, bytecode length `574152` |
| 195 | +- `target/private_kernel_inner_3.json`: about 1.6 MiB, bytecode length `756008` |
| 196 | + |
| 197 | +`noir-profiler opcodes` can inspect both artifacts: |
| 198 | + |
| 199 | +- `private_kernel_init_3`: `main` has `37381` ACIR opcodes |
| 200 | +- `private_kernel_inner_3`: `main` has `46780` ACIR opcodes |
| 201 | + |
| 202 | +BB gate counts are not currently available for these Chonk artifacts. `bb gates --scheme chonk` fails on the current |
| 203 | +artifacts, so ACIR opcode counts are the useful local inspection tool until the required barretenberg support exists. |
| 204 | + |
| 205 | +### Related BB Work: PR #22640 |
| 206 | + |
| 207 | +PR `#22640` (`29a4f46c95 Multi app scaffolding`) is relevant but not sufficient by itself. It starts generalizing |
| 208 | +barretenberg's Chonk databus shape from one secondary app calldata column to indexed app calldata slots: |
| 209 | + |
| 210 | +- introduces `NUM_APP_PER_KERNEL`; |
| 211 | +- renames the databus layout to kernel calldata, app calldata, and return data; |
| 212 | +- changes kernel public inputs to carry an array of app return-data commitments; |
| 213 | +- allows ACIR `call_data(id)` with app ids in `[1, NUM_APP_PER_KERNEL]`; |
| 214 | +- threads an app return-data index through Chonk recursive verification. |
| 215 | + |
| 216 | +The current PR still has `NUM_APP_PER_KERNEL = 1` and explicitly asserts that multiple app calldata witness columns are |
| 217 | +not wired yet. So it does not make `private_kernel_inner_3` or `private_kernel_inner_6` work under Chonk today. It does |
| 218 | +identify the next BB integration seam: raise `NUM_APP_PER_KERNEL` and finish wiring multiple app calldata witness |
| 219 | +commitments through Mega/Chonk, then retry `bb gates --scheme chonk` on the `_3` and `_6` artifacts. |
| 220 | + |
| 221 | +## Current Interpretation |
| 222 | + |
| 223 | +The branch is a fixed-width circuit prototype, not a final batching implementation. |
| 224 | + |
| 225 | +It now includes a committed `N = 3` relation proof suite and an explicit `inner_6` compile/profile spike. It |
| 226 | +intentionally does not yet include: |
| 227 | + |
| 228 | +- dynamic `num_apps`; |
| 229 | +- inactive-slot padding; |
| 230 | +- reset-aware batch selection; |
| 231 | +- PXE scheduling changes; |
| 232 | +- TypeScript input classes or witness conversion for the new circuits; |
| 233 | +- artifact naming or VK integration decisions; |
| 234 | +- the full fixed-width `N = 1..6` family beyond the committed `init_3`, `inner_3`, and `inner_6` artifacts. |
| 235 | + |
| 236 | +The useful result so far is narrower and clearer: after the init-only first slot, the app transition logic is the same |
| 237 | +for init-derived and inner-derived batches. That shared logic can be factored without pretending that init slot 0 is |
| 238 | +symmetrical with later slots. |
| 239 | + |
| 240 | +## Reset-Aware Scheduling Notes |
| 241 | + |
| 242 | +The current PXE loop already uses lookahead before processing a non-first app. It builds a reset input builder from the |
| 243 | +latest kernel output and the still-unpopped execution stack. If the top pending app would overflow one of the |
| 244 | +resettable dimensions, PXE runs one or more reset kernels first, then processes that app with an inner kernel. |
| 245 | + |
| 246 | +From Chonk's accumulated circuit-chain perspective, a mid-flow reset still always comes after a kernel has processed |
| 247 | +some prior app: |
| 248 | + |
| 249 | +- `app_{i-1}` |
| 250 | +- `inner_{i-1}(previous_kernel, app_{i-1})` |
| 251 | +- `reset(inner_{i-1})` |
| 252 | +- `app_i` |
| 253 | +- `inner_i(reset, app_i)` |
| 254 | + |
| 255 | +So the reset is "before app_i's inner" only from the PXE planner's perspective. It is not inserted between `app_i` and |
| 256 | +the kernel that consumes `app_i`; Chonk should see each app circuit immediately before the kernel that recursively |
| 257 | +verifies/consumes it. |
| 258 | + |
| 259 | +For fixed-width kernels, the intended rule is: |
| 260 | + |
| 261 | +- choose the largest contiguous prefix up to width 6 that can be processed without needing a reset before any app in |
| 262 | + that prefix; |
| 263 | +- emit the corresponding `init_N` or `inner_N`; |
| 264 | +- if the next app would overflow, emit one or more reset kernels after the batch; |
| 265 | +- continue with the next app after reset. |
| 266 | + |
| 267 | +Equivalently, a batch may end before a reset, but it must not cross a reset boundary. |
| 268 | + |
| 269 | +The extra artifact names and VKs for widths 1 through 6 are plumbing, not a conceptual blocker. The real scheduling |
| 270 | +risk is lookahead correctness. The existing `PrivateKernelResetPrivateInputsBuilder.needsReset()` can already answer |
| 271 | +"would this next app require a reset?" without oracle work, but it only accepts one `nextIteration` from the current |
| 272 | +top of the execution stack. A width planner needs to repeatedly ask that question against a hypothetical accumulated |
| 273 | +kernel state while tentatively appending apps to the candidate batch. |
| 274 | + |
| 275 | +There is not currently a production TypeScript equivalent of Noir's `PrivateKernelCircuitOutputComposer`. A planner |
| 276 | +therefore needs a small dry-run accumulator that mirrors enough of init/inner output composition to support reset |
| 277 | +lookahead: |
| 278 | + |
| 279 | +- append note hash read requests, nullifier read requests, key validation requests, note hashes, nullifiers, logs, |
| 280 | + public calls, and private call requests; |
| 281 | +- pop the private call request consumed by each tentative inner slot and push nested private calls in the same |
| 282 | + depth-first order as the current PXE loop; |
| 283 | +- track fee payer, public teardown request, min revertible counter, and expiration timestamp consistently with the |
| 284 | + Noir composer; |
| 285 | +- for `init_N`, account for init-only setup and possible protocol-nullifier injection before applying later slots. |
| 286 | + |
| 287 | +This should be efficient because the maximum lookahead width is 6 and the expensive reset-builder work happens in |
| 288 | +`build()`, not in `needsReset()`. The main implementation risk is divergence between the TypeScript dry-run accumulator |
| 289 | +and the Noir composer, not asymptotic cost. |
| 290 | + |
| 291 | +## Recommended Next Phase |
| 292 | + |
| 293 | +The Noir-level `_3` prototype is now hardened enough to move from "prove the relation" to "complete the fixed-width |
| 294 | +family and unblock real Chonk measurements." The next phase should keep the explicit fixed-width design and avoid a |
| 295 | +fused/dynamic circuit redesign. |
| 296 | + |
| 297 | +### 1. Scale Remaining Fixed Widths Mechanically |
| 298 | + |
| 299 | +Do not introduce a Noir-level generic fixed-array loop unless there is a clear measured reason. It may change the |
| 300 | +compiled circuit shape through array/indexing/loop lowering. Prefer explicit source in each circuit, either written |
| 301 | +manually or produced by a generator that emits explicit calls: |
| 302 | + |
| 303 | +- `let output_1 = execute_next_private_call(output_0, inputs.private_call_1);` |
| 304 | +- `let output_2 = execute_next_private_call(output_1, inputs.private_call_2);` |
| 305 | +- and so on. |
| 306 | + |
| 307 | +The branch already has `init_3`, `inner_3`, and `inner_6`. Add the remaining fixed-width wrappers: |
| 308 | + |
| 309 | +- `private_kernel_init_1` if product integration wants a width-dispatched family rather than treating existing init as |
| 310 | + width 1; |
| 311 | +- `private_kernel_init_2`, `private_kernel_init_4`, `private_kernel_init_5`, `private_kernel_init_6`; |
| 312 | +- `private_kernel_inner_2`, `private_kernel_inner_4`, `private_kernel_inner_5`; |
| 313 | +- keep existing `private_kernel_inner` as width 1 or add an alias/package if the TypeScript dispatch layer wants a |
| 314 | + uniform `inner_1..6` naming scheme; |
| 315 | +- workspace wiring and minimal smoke/equivalence coverage for each. |
| 316 | + |
| 317 | +The implementation should be mostly mechanical: explicit entrypoints and structs per circuit, shared single-step |
| 318 | +transition execution in the library. |
| 319 | + |
| 320 | +### 2. Continue BB Multi-App Databus Work |
| 321 | + |
| 322 | +PR `#22640` gives a concrete BB starting point but leaves `NUM_APP_PER_KERNEL = 1`. The next useful BB spike is: |
| 323 | + |
| 324 | +1. apply or rebase onto the PR's multi-app scaffolding; |
| 325 | +2. raise `NUM_APP_PER_KERNEL` locally, preferably to `6`; |
| 326 | +3. fix the resulting witness-commitment/databus failures; |
| 327 | +4. run `bb gates --scheme chonk` against `private_kernel_init_3`, `private_kernel_inner_3`, and |
| 328 | + `private_kernel_inner_6`. |
| 329 | + |
| 330 | +This is the path to real Chonk gate counts. ACIR opcode counts are already enough to show linear Noir-level scaling, |
| 331 | +but not enough to decide product economics. |
| 332 | + |
| 333 | +### 3. Measure Before PXE Product Integration |
| 334 | + |
| 335 | +Before adding PXE or TypeScript integration, the fixed-width family should eventually be measured against the current |
| 336 | +one-app path: |
| 337 | + |
| 338 | +- bytecode size and compiled artifact size; |
| 339 | +- gate counts; |
| 340 | +- relevant ACIR opcode deltas; |
| 341 | +- proving-key or VK-size impact if available; |
| 342 | +- whether `init_N` is cheaper than `init + (N - 1) * inner`; |
| 343 | +- whether `inner_N` is cheaper than `N * inner`. |
| 344 | + |
| 345 | +Those measurements are easier said than done for these prototypes because realistic Chonk gate/proving measurements |
| 346 | +still require the BB multi-app databus work above. Do not block the remaining mechanical Noir wrappers on that, but do |
| 347 | +treat Chonk measurements as the gate before PXE/product integration. |
| 348 | + |
| 349 | +### 4. Prototype Reset-Aware TS Planning |
| 350 | + |
| 351 | +Once the fixed-width family and BB support are in place, the next product-facing task is a planner that chooses the |
| 352 | +largest safe prefix up to width 6 without crossing reset boundaries. The main new code should be a TypeScript dry-run |
| 353 | +accumulator that mirrors enough of Noir's private-kernel output composer to ask the existing reset builder whether the |
| 354 | +next candidate app would require a reset. |
| 355 | + |
| 356 | +This should be tested with synthetic app public inputs that force boundaries such as: |
| 357 | + |
| 358 | +- `init_3 -> reset -> inner_3`; |
| 359 | +- `init_2 -> reset -> inner_4`; |
| 360 | +- consecutive resets before the next batch; |
| 361 | +- depth-first nested-call ordering where processing one app exposes new candidate apps. |
| 362 | + |
| 363 | +## Open Questions |
| 364 | + |
| 365 | +- Should the shared helper keep the `private_kernel_batch` name, or use a more literal transition name until dynamic |
| 366 | + batching exists? |
| 367 | +- Should the remaining fixed-width Noir sources be maintained manually, or generated by a small source generator that |
| 368 | + emits explicit calls? |
| 369 | +- Should batched init/inner VKs get distinct named indexes, or use a reset-style range abstraction for allowed previous |
| 370 | + kernels? |
| 371 | +- What Chonk gate/prover threshold justifies moving from fixed-width experiments into PXE scheduling work? |
0 commit comments