Skip to content

Commit e0a892a

Browse files
georgehaogeorgehao
andauthored
halo srs (#1805)
Co-authored-by: georgehao <hongfan@scroll.io>
1 parent 443d03c commit e0a892a

2 files changed

Lines changed: 144 additions & 0 deletions

File tree

tests/shadow-testing/docs/LESSONS_LEARNED.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1152,3 +1152,112 @@ After the fix, all bundles 13450–13454 finalized successfully on-chain:
11521152
- [ ] Verifier digests match proofs
11531153
- [ ] Relayer started with `--config <path>` and `--min-codec-version 10`
11541154
- [ ] Target bundles/batches reset to `rollup_status = 1`
1155+
1156+
---
1157+
1158+
## 2026-06-09: Mainnet Shadow Fork — Re-prove + Real On-Chain Finalize (Bundles 17297–17301)
1159+
1160+
Full end-to-end run on a **mainnet** Anvil fork using the dockerized coordinator/prover images
1161+
(`zhuoatscroll/{coordinator-api,prover}:v4.7.13-openvm16`): imported bundles 17297–17301
1162+
(batches 517761–517765, codec v10, single-batch each) from mainnet RDS, cleared production proofs,
1163+
re-proved all 20 chunks + 5 batches + 5 bundles locally on 4×RTX 3090, deployed a fresh
1164+
`ZkEvmVerifierPostFeynman`, and finalized all 5 bundles with **real** `finalizeBundlePostEuclidV2`
1165+
transactions. `lastFinalizedBatchIndex` advanced 517760 → 517765 and `finalizedStateRoots[517765]`
1166+
matched the DB batch state root. Three new traps surfaced — documented below.
1167+
1168+
### Trap A: halo2 SRS files must live in `~/.openvm/params/`, NOT `~/.openvm/`
1169+
1170+
**Symptom**: chunk and batch proofs succeed, but the **first bundle proof** crashes the prover at the
1171+
halo2 wrapping stage:
1172+
```
1173+
thread 'tokio-rt-worker' panicked at .../halo2/utils.rs:127:
1174+
Params file "/root/.openvm/params/kzg_bn254_23.srs" does not exist
1175+
```
1176+
Container exits; the bundle stays stuck at `proving_status = 2`.
1177+
1178+
**Root cause**: `CacheHalo2ParamsReader` reads the KZG SRS from `$HOME/.openvm/params/kzg_bn254_{k}.srs`
1179+
(openvm `extensions/native/recursion/src/halo2/utils.rs`). Only the bundle proof's `halo2_outer` /
1180+
`halo2_wrapper` stages need it (k = 22/23/24); chunk/batch proofs use smaller in-tree params, so the
1181+
problem stays hidden until the first bundle reaches halo2. If the `.srs` files are downloaded/placed at
1182+
`~/.openvm/` root (or any other dir), they are silently not found.
1183+
1184+
**Fix**: ensure the SRS files are under the `params/` subdir of the mounted openvm dir:
1185+
```bash
1186+
mkdir -p ~/.openvm/params
1187+
mv ~/.openvm/kzg_bn254_2{2,3,4}.srs ~/.openvm/params/ # if they landed in the wrong place
1188+
# files: kzg_bn254_22.srs (~513MB), _23.srs (~1.1GB), _24.srs (~2.1GB)
1189+
```
1190+
When running the prover in Docker, mount the host openvm dir to `/root/.openvm` (writable) and confirm
1191+
`/root/.openvm/params/kzg_bn254_23.srs` resolves inside the container.
1192+
1193+
### Trap B: prover Docker `--gpus device=N` renumbers the GPU to index 0 inside the container
1194+
1195+
**Symptom**: prover container exits immediately (code 139) with:
1196+
```
1197+
CudaError { code: 100, name: "cudaErrorNoDevice", message: "no CUDA-capable device is detected" }
1198+
```
1199+
Only the prover on GPU 0 works; provers for GPUs 1/2/3 crash on boot.
1200+
1201+
**Root cause**: `docker run --gpus "device=N"` exposes **only** that one GPU to the container and
1202+
**renumbers it to index 0** inside. Setting `CUDA_VISIBLE_DEVICES=N` (the host index) then points at a
1203+
device that doesn't exist in the container.
1204+
1205+
**Fix**: pair `--gpus "device=$i"` with `CUDA_VISIBLE_DEVICES=0` (the only visible device in-container):
1206+
```bash
1207+
docker run -d --name shadow-prover-$i --network host \
1208+
--gpus "device=$i" -e CUDA_VISIBLE_DEVICES=0 -e RUST_MIN_STACK=16777216 \
1209+
-v .../prover-$i.json:/prover/conf/config.json:ro \
1210+
-v .../prover-$i:/prover/.work -v ~/.openvm:/root/.openvm \
1211+
zhuoatscroll/prover:v4.7.13-openvm16 --config /prover/conf/config.json
1212+
```
1213+
(Alternative: `--gpus all` + `CUDA_VISIBLE_DEVICES=$i`.)
1214+
1215+
### Trap C: galileoV2 verifier assets are under S3 `v0.8.0/`, prover circuits under `galileov2/`
1216+
1217+
The coordinator verifier assets (`openVmVk.json`, `verifier.bin`, `root_verifier_vk`) for the galileoV2
1218+
fork are served from `scroll-zkvm/v0.8.0/verifier/` (the `galileov2/verifier/` path returns **403**),
1219+
while the prover downloads its circuits (`{chunk,batch,bundle}/<vk_hash>/app.vmexe`) from
1220+
`scroll-zkvm/galileov2/`. They are nonetheless consistent: the VK hashes in
1221+
`v0.8.0/verifier/openVmVk.json` (`chunk 64cf16…`, `batch e9d653…`, `bundle 6b155f…`) match the circuit
1222+
objects available under `galileov2/`. Download coordinator assets from `v0.8.0/verifier/`; point the
1223+
prover `circuits.galileoV2.base_url` at `…/scroll-zkvm/galileov2/`.
1224+
1225+
### Other notes from this run
1226+
1227+
- **Coordinator/prover are run via the prebuilt Docker images** (native prover build fails on CUDA on
1228+
this host). Run with `--network host` so the coordinator reaches the shadow DB on `localhost:5433`,
1229+
the prover reaches the coordinator on `localhost:8390`, and the relayer reaches Anvil on
1230+
`localhost:18545`. Coordinator entrypoint is `/bin/coordinator_api`; `LD_LIBRARY_PATH` for `libzkp.so`
1231+
is already baked into the image. Coordinator config: `l2.chain_id = 534352` (Scroll **mainnet** L2),
1232+
`l2.l2geth.endpoint` = internal debug-enabled proxy, one `verifiers[]` entry with
1233+
`fork_name: galileoV2` + low `min_prover_version`.
1234+
- **`l2_block` export by JOIN on `chunk_hash` is pathologically slow** against the prod RDS (full scan of
1235+
a huge table). Export by **block-number range** instead (`WHERE number BETWEEN <min_start> AND
1236+
<max_end>`, PK-indexed, ~tens of seconds). The block range is the min `start_block_number` / max
1237+
`end_block_number` across the target batches' chunks.
1238+
- **`chunk_proofs_status` / `batch_proofs_status` may not auto-promote** in the shadow setup. A small
1239+
watcher loop that sets `batch.chunk_proofs_status = 2` once all of a batch's chunks reach
1240+
`proving_status = 4`, and `bundle.batch_proofs_status = 2` once all of a bundle's batches reach
1241+
`proving_status = 4`, keeps the chunk→batch→bundle pipeline flowing without stalls.
1242+
- **The batch committer fails harmlessly during a finalize-only run**: the relayer's commit sender
1243+
(derived from `commit_sender_signer_config`, e.g. `0xBC732a76…`) is unfunded and not a sequencer, so
1244+
`commitBatch` loops with "Insufficient funds"/`ErrorCallerIsNotSequencer`. This is expected and does
1245+
**not** affect the bundle finalizer, which runs independently and uses the (funded, prover-authorized)
1246+
finalize sender.
1247+
- **Set `bundle_index_seq` above the imported max** (e.g. `SELECT setval('bundle_index_seq', 18000)`)
1248+
before starting the relayer, so any proposer-created bundle gets a higher index and cannot block
1249+
`GetFirstPendingBundle` (orders by `index ASC`). Also clear stale `finalize_tx_hash` on imported
1250+
bundles/batches.
1251+
1252+
### Successful finalize transactions (mainnet fork)
1253+
1254+
| Bundle | Batch | Finalize Tx | Status |
1255+
|--------|-------|-------------|--------|
1256+
| 17297 | 517761 | `0x5e8a7e01…cd1b` ||
1257+
| 17298 | 517762 | `0xf55e9f00…c407f` ||
1258+
| 17299 | 517763 | `0xa6466c0a…412f` ||
1259+
| 17300 | 517764 | `0x1deca7d1…2d8f` ||
1260+
| 17301 | 517765 | `0x204b28de…a100` ||
1261+
1262+
Final `lastFinalizedBatchIndex = 517765`; verifier deployed at `0xf74BcAA17bbb3B0a996aF04a7b301E69501C4bf0`
1263+
(plonk `0x1d710357818776073705b29482486AbCF586f33b`), digests `0x00398b78…` / `0x0021785a…`.

tests/shadow-testing/docs/TROUBLESHOOTING.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,37 @@ Before executing a single command:
127127
./rollup_relayer --config /path/to/config.json --min-codec-version 10
128128
```
129129

130+
### Trap 12: halo2 SRS Not in `~/.openvm/params/`
131+
- **Symptom**: chunk/batch proofs succeed; the **first bundle proof** crashes the prover with
132+
`Params file ".../.openvm/params/kzg_bn254_23.srs" does not exist`. Bundle stuck at `proving_status=2`.
133+
- **Cause**: openvm reads the KZG SRS from `$HOME/.openvm/params/kzg_bn254_{22,23,24}.srs` only at the
134+
bundle proof's halo2 stage; if the `.srs` files sit in `~/.openvm/` root (or anywhere else) they are
135+
silently not found.
136+
- **Rule**: `mkdir -p ~/.openvm/params && mv ~/.openvm/kzg_bn254_2{2,3,4}.srs ~/.openvm/params/`. Mount the
137+
host openvm dir to `/root/.openvm` (writable) for the prover container and confirm the path resolves.
138+
139+
### Trap 13: Prover Docker `--gpus device=N` + Wrong `CUDA_VISIBLE_DEVICES`
140+
- **Symptom**: prover container exits (code 139) with `cudaErrorNoDevice: no CUDA-capable device is detected`;
141+
only the GPU-0 prover works.
142+
- **Cause**: `--gpus "device=N"` exposes only that GPU and **renumbers it to index 0** inside the container,
143+
so `CUDA_VISIBLE_DEVICES=N` points at a nonexistent device.
144+
- **Rule**: use `--gpus "device=$i"` with `CUDA_VISIBLE_DEVICES=0` (or `--gpus all` with `CUDA_VISIBLE_DEVICES=$i`).
145+
146+
### Trap 14: Coordinator Verifier Assets vs Prover Circuit S3 Paths (galileoV2)
147+
- **Symptom**: coordinator asset download 403s on `scroll-zkvm/galileov2/verifier/openVmVk.json`.
148+
- **Cause**: galileoV2 verifier assets live under `scroll-zkvm/v0.8.0/verifier/`, while prover circuits live
149+
under `scroll-zkvm/galileov2/{chunk,batch,bundle}/<vk_hash>/`. Different prefixes, same VK hashes.
150+
- **Rule**: download coordinator `openVmVk.json`/`verifier.bin`/`root_verifier_vk` from `v0.8.0/verifier/`;
151+
set prover `circuits.galileoV2.base_url` to `…/scroll-zkvm/galileov2/`.
152+
153+
### Trap 15: Slow `l2_block` Export by `chunk_hash` JOIN
154+
- **Symptom**: `00-import-bundle-range.sh` hangs for minutes on the `l2_block` export (0-byte CSV) — the
155+
`l2_block ⋈ chunk ON chunk_hash` JOIN full-scans the huge prod table.
156+
- **Rule**: export `l2_block` by **block-number range** instead:
157+
`COPY (SELECT * FROM l2_block WHERE number BETWEEN <min_start_block> AND <max_end_block>) TO STDOUT …`
158+
(PK-indexed, seconds). Derive the range from the target batches' chunks' `start_block_number` /
159+
`end_block_number`.
160+
130161
## Step-by-Step Checklist
131162

132163
### Phase 0: Environment Validation
@@ -208,6 +239,10 @@ Before executing a single command:
208239
| Relayer exits with `Required flag "min-codec-version" not set` | Missing CLI flags | Trap 11 |
209240
| Coordinator assigns but prover gets nothing | L2 RPC missing `debug_executionWitness` | README.md |
210241
| `CoordinatorEmptyProofData` | Prover crashed; reset stuck tasks | README.md |
242+
| `Params file ".../kzg_bn254_23.srs" does not exist` (bundle proof crash) | halo2 SRS not in `~/.openvm/params/` | Trap 12 |
243+
| Prover exits 139 `cudaErrorNoDevice` | `--gpus device=N` + wrong `CUDA_VISIBLE_DEVICES` | Trap 13 |
244+
| Coordinator asset download 403 (`galileov2/verifier/...`) | Wrong S3 prefix; use `v0.8.0/verifier/` | Trap 14 |
245+
| `l2_block` export hangs for minutes | Slow `chunk_hash` JOIN; export by block-number range | Trap 15 |
211246

212247
## Documentation Priority
213248

0 commit comments

Comments
 (0)