Skip to content

Commit cc38945

Browse files
taratorioclaude
andauthored
claude: add skill for running local kurtosis testnets (#20876)
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 109d9d9 commit cc38945

1 file changed

Lines changed: 377 additions & 0 deletions

File tree

Lines changed: 377 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,377 @@
1+
---
2+
name: kurtosis-test
3+
description: Run a local Kurtosis Ethereum testnet against a locally-built erigon image, monitor EL/CL/assertoor/spamoor health, triage failures with a cross-client comparison methodology, and auto-iterate fix → rebuild → rerun. Use when the user wants to reproduce, debug, or validate erigon against an `ethereum-package` config locally — equivalent to the `test-kurtosis-assertoor` CI workflow but interactive. Handles image build, enclave lifecycle, block-progress + assertoor + log watching, log dumping on failure, and the erigon-source fix loop.
4+
argument-hint: "<config-yaml-path> [enclave-name] [duration=Nm] [auto=true|false] [max-attempts=N]"
5+
allowed-tools: Bash, Read, Write, Edit, Glob, Grep, WebFetch, Skill
6+
---
7+
8+
# Run a local Kurtosis Ethereum testnet against erigon
9+
10+
This skill mirrors the CI workflow at `.github/workflows/test-kurtosis-assertoor.yml`
11+
but runs **locally** via the raw `kurtosis` CLI. The CI uses the
12+
`ethpandaops/kurtosis-assertoor-github-action@v1` wrapper, which is not portable outside
13+
GitHub Actions; this skill drives `kurtosis run`, `kurtosis enclave inspect`,
14+
`kurtosis service logs`, and `kurtosis enclave dump` directly.
15+
16+
The skill takes an `ethereum-package` YAML config, builds the local
17+
`test/erigon:current` Docker image, starts a Kurtosis enclave, monitors
18+
EL/CL/assertoor/spamoor health, triages failures (challenging peer clients against
19+
erigon to identify the offender), and iterates a fix → rebuild → rerun loop until the
20+
testnet is stable or `max-attempts` is reached.
21+
22+
## Inputs
23+
24+
The model parses these arguments and binds them to the shell variables used in the
25+
bash blocks below: `$1`/`$2` are positional; `duration=Nm``duration_secs`,
26+
`auto=true|false``auto`, `max-attempts=N``max_attempts`.
27+
28+
| Argument | Default | Notes |
29+
|---|---|---|
30+
| `$1` config path | required | Path to an `ethereum-package` args YAML. The reference set lives in `.github/workflows/kurtosis/`, but that directory also contains assertoor playbooks (`id:` / `tasks:` schema) — only the files whose top-level keys are `participants:` or `participants_matrix:` are valid here. To list candidates: `grep -lE '^participants(_matrix)?:' .github/workflows/kurtosis/*.io`. |
31+
| `$2` enclave name | `kurtosis-test-<unix-ts>` | Used for `kurtosis run --enclave`. Each rerun gets a fresh timestamp. |
32+
| `duration=Nm` | `20m` | Wall-clock window the monitor watches before declaring "stable" if no failures trip. |
33+
| `auto=true\|false` | `true` | If `true`, the fix-rebuild-rerun loop runs autonomously up to `max-attempts`. If `false`, pause for user approval before each fix. |
34+
| `max-attempts=N` | `5` | Cap on fix-loop iterations. After hitting the cap, halt and surface the per-attempt triage history. |
35+
36+
## Prerequisites
37+
38+
1. **Docker** running: `docker info >/dev/null` should succeed.
39+
2. **Kurtosis CLI** installed: `kurtosis version`. Install from
40+
https://docs.kurtosis.com/install if missing. CLI **≥ 1.18.1** is only needed when
41+
the `--package@branch` you run includes the `GpuConfig` Starlark built-in (i.e.
42+
`ethereum-package` `main` post commit `835dd9b`). The pinned branches in the
43+
mapping table below — including `glamsterdam`'s `6.1.0` — predate that change, so
44+
they work on older CLIs. See Troubleshooting if you hit a `GpuConfig` Starlark
45+
error.
46+
3. **Erigon source tree** at the cwd: `Makefile` exists and `go.mod` contains
47+
`module github.com/erigontech/erigon`.
48+
4. **`curl` and `jq`** on `$PATH` — used by the monitor / triage snippets below to
49+
poll the EL JSON-RPC endpoint and the assertoor API.
50+
5. **Fork detection** — skim the YAML for `_fork_epoch` keys. The repo's configs use
51+
the CL-side fork names: `deneb_fork_epoch` (Cancun on EL), `electra_fork_epoch`
52+
(Prague), `fulu_fork_epoch` (Osaka), `gloas_fork_epoch` (Amsterdam). Future fork
53+
keys will follow the same CL-naming convention. Feeds the spec-lookup section below.
54+
55+
## Spec lookup (when debugging unfamiliar forks/EIPs)
56+
57+
If the YAML enables a fork under development, invoke `/erigon-implement-eip` Steps 2–4
58+
to fetch:
59+
60+
- **Step 2** — referenced/dependent EIPs.
61+
- **Step 3** — the meta EIP enumerating which EIPs the fork includes (CFI/SFI/PFI/DFI lists).
62+
- **Step 4** — the latest devnet specification at `https://notes.ethereum.org/@ethpandaops/<devnet>`.
63+
64+
Use these as ground truth when triaging. For specific opcodes / state transitions,
65+
also pull the EIP body via Step 1. If anything in the spec is contradictory or
66+
ambiguous, **stop and ask the user** rather than guessing — the same rule the EIP skill
67+
enforces.
68+
69+
## Build the erigon docker image
70+
71+
The image tag must be **exactly** `test/erigon:current` because every
72+
`.github/workflows/kurtosis/*.io` config references that tag.
73+
74+
```bash
75+
docker build -t test/erigon:current --build-arg BINARIES="erigon caplin" .
76+
```
77+
78+
`caplin` is required in `BINARIES` because some configs (e.g.
79+
`caplin-minimal-assertoor.io`) use erigon as the CL via the same image.
80+
81+
Always rebuild before each run — the same approach the CI uses. BuildKit's layer cache
82+
makes the no-op rebuild fast, and the fix → rebuild → rerun loop necessarily picks up
83+
uncommitted source edits this way (a freshness check against `git log` would miss them
84+
and silently run a stale image).
85+
86+
If the user asks for a from-scratch binary build instead of docker, point them at
87+
`/erigon-build`; this skill itself uses docker because the kurtosis configs reference a
88+
docker image tag.
89+
90+
## Suite → ethereum-package branch mapping (from CI)
91+
92+
The CI matrix pins different package branches per suite. Use the same pinning when the
93+
config matches a known CI file; for unknown configs, default to `5.0.1` and ask the
94+
user to confirm.
95+
96+
| Config file | `--package@branch` |
97+
|---|---|
98+
| `regular-assertoor.io` | `github.com/ethpandaops/ethereum-package@5.0.1` |
99+
| `pectra.io` | `github.com/ethpandaops/ethereum-package@5.0.1` |
100+
| `glamsterdam.io` | `github.com/ethpandaops/ethereum-package@6.1.0` |
101+
| `caplin-assertoor.io` | `github.com/erigontech/ethereum-package@erigontech/fix-caplin-launcher` |
102+
| `caplin-minimal-assertoor.io` | `github.com/erigontech/ethereum-package@erigontech/fix-caplin-launcher` |
103+
| (other / user-supplied) | default `5.0.1`, prompt user if unsure |
104+
105+
Note: `glamsterdam` is pinned to `6.1.0` rather than `main` because `main` introduced
106+
the `GpuConfig` Starlark built-in which requires kurtosis CLI ≥ 1.18.1. Caplin suites
107+
(`caplin-assertoor.io`, `caplin-minimal-assertoor.io`) require the `erigontech` fork —
108+
do not let them fall back to the default `5.0.1`.
109+
110+
## Start the testnet
111+
112+
```bash
113+
ENCLAVE="${2:-kurtosis-test-$(date +%s)}"
114+
CONFIG="$1"
115+
116+
# Map config basename → ethereum-package branch (mirrors the table above).
117+
case "$(basename "$CONFIG")" in
118+
glamsterdam.io)
119+
PACKAGE_REF="github.com/ethpandaops/ethereum-package@6.1.0" ;;
120+
caplin-assertoor.io|caplin-minimal-assertoor.io)
121+
PACKAGE_REF="github.com/erigontech/ethereum-package@erigontech/fix-caplin-launcher" ;;
122+
regular-assertoor.io|pectra.io)
123+
PACKAGE_REF="github.com/ethpandaops/ethereum-package@5.0.1" ;;
124+
*)
125+
PACKAGE_REF="github.com/ethpandaops/ethereum-package@5.0.1" ;;
126+
esac
127+
128+
kurtosis run \
129+
"$PACKAGE_REF" \
130+
--enclave "$ENCLAVE" \
131+
--args-file "$CONFIG" \
132+
--verbosity detailed --cli-log-level trace
133+
```
134+
135+
Once `kurtosis run` returns, capture service names and host-mapped ports:
136+
137+
```bash
138+
kurtosis enclave inspect "$ENCLAVE" --full-uuids
139+
140+
# Pick the first erigon EL service. `kurtosis enclave inspect` prints columnar
141+
# rows (UUID first, then service name), so we match against field 2.
142+
EL_SERVICE=$(kurtosis enclave inspect "$ENCLAVE" --full-uuids 2>/dev/null \
143+
| awk '$2 ~ /^el-[0-9]+-erigon-[a-z]+$/ {print $2; exit}')
144+
EL_RPC_PORT=$(kurtosis port print "$ENCLAVE" "$EL_SERVICE" rpc 2>/dev/null \
145+
| sed -E 's|.*:([0-9]+).*|\1|')
146+
147+
# CL endpoint (whichever client paired with that EL)
148+
CL_SERVICE=$(kurtosis enclave inspect "$ENCLAVE" --full-uuids 2>/dev/null \
149+
| awk '$2 ~ /^cl-[0-9]+-[a-z]+-erigon$/ {print $2; exit}')
150+
CL_HTTP_PORT=$(kurtosis port print "$ENCLAVE" "$CL_SERVICE" http 2>/dev/null \
151+
| sed -E 's|.*:([0-9]+).*|\1|')
152+
153+
# Optional services
154+
ASSERTOOR_PORT=$(kurtosis port print "$ENCLAVE" assertoor http 2>/dev/null | sed -E 's|.*:([0-9]+).*|\1|')
155+
DORA_PORT=$(kurtosis port print "$ENCLAVE" dora http 2>/dev/null | sed -E 's|.*:([0-9]+).*|\1|')
156+
SPAMOOR_PORT=$(kurtosis port print "$ENCLAVE" spamoor http 2>/dev/null | sed -E 's|.*:([0-9]+).*|\1|')
157+
```
158+
159+
Print the assertoor / dora URLs so the user can open the dashboards in a browser.
160+
161+
## Monitor
162+
163+
Three checks run on a polling loop until either (a) `duration` elapses, or (b) any
164+
failure trips. All three are captured in the run history.
165+
166+
### Check A — Block height progress
167+
168+
`duration_secs` is the parsed `duration=Nm` input (default 1200). Three terminal
169+
outcomes:
170+
171+
- **STABLE** — chain produced blocks and the duration elapsed without a stall.
172+
- **STALL** — chain produced at least one block, then stopped advancing for
173+
`>3 × seconds_per_slot`.
174+
- **NO_PROGRESS** — chain never produced a block within `duration_secs` (e.g.
175+
validators didn't start).
176+
177+
```bash
178+
duration_secs=${duration_secs:-1200}
179+
prev=0
180+
slot_secs=$(grep -E '^\s*seconds_per_slot:' "$CONFIG" | awk '{print $2}'); slot_secs=${slot_secs:-12}
181+
poll_interval=$(( slot_secs * 2 ))
182+
stall_window=$(( slot_secs * 3 ))
183+
start=$(date +%s)
184+
end=$(( start + duration_secs ))
185+
stall_deadline=$(( start + stall_window ))
186+
outcome=""
187+
188+
while [ "$(date +%s)" -lt "$end" ]; do
189+
height_hex=$(curl -s --max-time 5 "http://127.0.0.1:${EL_RPC_PORT}" \
190+
-H 'Content-Type: application/json' \
191+
-d '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
192+
| jq -r '.result // empty')
193+
if [[ "$height_hex" =~ ^0x[0-9a-fA-F]+$ ]]; then
194+
height=$(printf '%d\n' "$height_hex")
195+
echo "[$(date -u +%H:%M:%S)] height=$height"
196+
if [ "$height" -gt "$prev" ]; then
197+
prev=$height
198+
stall_deadline=$(( $(date +%s) + stall_window ))
199+
fi
200+
else
201+
echo "[$(date -u +%H:%M:%S)] RPC unreachable or invalid response — retrying"
202+
fi
203+
# Only declare a stall after seeing at least one block. Many configs set
204+
# genesis_delay > stall_window (e.g. glamsterdam.io: 20s delay vs 18s window
205+
# at 6s slots), so the pre-genesis gap would otherwise trip a false stall.
206+
if [ "$prev" -gt 0 ] && [ "$(date +%s)" -gt "$stall_deadline" ]; then
207+
outcome="STALL: chain not progressing for >${stall_window}s (last height=$prev)"
208+
break
209+
fi
210+
sleep "$poll_interval"
211+
done
212+
213+
if [ -z "$outcome" ]; then
214+
if [ "$prev" -eq 0 ]; then
215+
outcome="NO_PROGRESS: chain never produced a block within ${duration_secs}s"
216+
else
217+
outcome="STABLE: chain progressed for full ${duration_secs}s window (final height=$prev)"
218+
fi
219+
fi
220+
echo "$outcome"
221+
```
222+
223+
Pass: height advances ≥1 within every `3 × seconds_per_slot` (the stall window),
224+
sustained for the full `duration_secs` (poll cadence is `2 × seconds_per_slot`).
225+
Fail: STALL or NO_PROGRESS.
226+
227+
### Check B — Assertoor results
228+
229+
```bash
230+
curl -s "http://127.0.0.1:${ASSERTOOR_PORT}/api/v1/test_runs" \
231+
| jq '.data[] | {name, status, result}'
232+
```
233+
234+
Pass: every test_run has `result=success`. Fail: any `result=failure`, or any test
235+
stuck `pending` / `running` past 3× its expected duration. The assertoor web UI at
236+
`http://127.0.0.1:${ASSERTOOR_PORT}/` shows per-step trees; use it for deep dives.
237+
238+
### Check C — Erigon-focused log scan
239+
240+
```bash
241+
kurtosis service logs "$ENCLAVE" "$EL_SERVICE" 2>&1 \
242+
| grep -iE 'panic|fatal|^ERROR|"lvl"="error"|consensus failure|invalid block' \
243+
| tail -200
244+
```
245+
246+
For cross-client comparison (used by the triage section), run the same scan across
247+
every EL/CL service:
248+
249+
```bash
250+
for svc in $(kurtosis enclave inspect "$ENCLAVE" --full-uuids \
251+
| awk '$2 ~ /^(el|cl|vc)-[0-9]+-[a-z]+-[a-z]+$/ {print $2}'); do
252+
echo "=== $svc ==="
253+
kurtosis service logs "$ENCLAVE" "$svc" 2>&1 \
254+
| grep -iE 'error|panic|fatal' | tail -30
255+
done
256+
```
257+
258+
If `snooper-engine-*` services exist (when `snooper_enabled: true` in the YAML), pull
259+
their logs too — they capture the full Engine API request/response trace, invaluable
260+
when an EL bug is suspected.
261+
262+
## Issue detection criteria
263+
264+
| User check | Pass | Fail |
265+
|---|---|---|
266+
| 1. Block production / height progress | `eth_blockNumber` advances ≥1 every `2 × seconds_per_slot`, sustained for `duration` | No advance for `>3 × seconds_per_slot`, OR explicit chain reorg / fork-choice loop in CL logs |
267+
| 2. EL/CL log errors (focus erigon) | No `panic`, `fatal`, or error-level lines in any erigon service | Any erigon-side panic/fatal/consensus failure. Non-erigon errors recorded but informational unless they crash the peer. |
268+
| 3. Assertoor test failures | All assertoor `test_runs` reach `result=success` | Any `result=failure`, OR a test stuck `pending`/`running` past 3× expected duration |
269+
270+
A single failed check trips the triage section. Block-stall + erigon panic + assertoor
271+
fail are independent signals — record all three in the run history; do not stop at the
272+
first.
273+
274+
## Debugging methodology — triage erigon vs peer-client vs network/config
275+
276+
Decision tree:
277+
278+
1. **Reproduce.** A single one-shot failure gets one re-run before triaging. Truly
279+
intermittent failures still get triaged, but flag them as flaky.
280+
2. **Classify the symptom.** One of: block-stall, EL panic, EL invalid-payload, CL
281+
fork-choice mismatch, assertoor opcode/EIP test failure, spamoor tx-submission
282+
failure.
283+
3. **Cross-client comparison.** For each erigon-side error, find the equivalent moment
284+
in the peer-client log at the same slot/block. Three outcomes:
285+
- **Erigon wrong**: erigon rejects/panics; peer client + assertoor accept the
286+
block → erigon bug, fix locally.
287+
- **Peer wrong**: erigon accepts; peer rejects → check peer-client image tag
288+
against the fork's expected tag (often a stale image). Surface to user; do not
289+
fix erigon.
290+
- **Both disagree with spec**: clients produce different "valid" answers from what
291+
the EIP spec says → escalate to the user. Likely spec ambiguity or a misread.
292+
4. **Cross-reference the spec.** Pull the relevant EIP (`/erigon-implement-eip` Step 1)
293+
and the devnet spec (`/erigon-implement-eip` Step 4) for the failing block, opcode,
294+
or state transition.
295+
5. **Rule out config drift.** Diff the YAML's `el_extra_params`, `network_params`, and
296+
fork epochs against the equivalent CI suite under `.github/workflows/kurtosis/`.
297+
Mismatches there are config bugs, not erigon bugs.
298+
6. **Rule out enclave plumbing.** `kurtosis service exec <enclave> <svc> "ping
299+
<other_svc>"` to verify network reachability; check JWT mounting via
300+
`kurtosis service exec <enclave> <el-svc> "ls -la /jwt/"`. The CLI takes the
301+
command as a single positional arg (multi-word commands must be quoted) — there
302+
is no `--` separator.
303+
304+
### Triage table
305+
306+
| Symptom | Likely owner | Next action |
307+
|---|---|---|
308+
| Erigon panic with stack trace inside `execution/...` | Erigon | Capture stack, find offending call in repo, propose fix |
309+
| `eth_newPayloadV4` returns INVALID; CL logs say block is valid; assertoor passes elsewhere | Erigon (likely block-validation divergence) | Replay the payload via `debug_traceBlockByNumber` / `debug_traceBlockByHash`; check fork activation timestamp |
310+
| All EL clients stop progressing after a specific slot | Config (fork epoch wrong) or shared dep | Diff YAML against working CI suite; check ethereum-package branch |
311+
| Assertoor `block-proposal-check` fails on slot N for `vc-N-erigon-…` | Erigon block builder | Fetch block N body via RPC; replay locally |
312+
| Assertoor `synchronized-check` fails | Network plumbing | Inspect peer counts; `kurtosis service exec` connectivity test |
313+
| `caplin` panics but `lighthouse` runs fine on the same EL | Caplin | Edit YAML to swap CL to lighthouse for bisection; report to user |
314+
| Spamoor reports persistent "insufficient funds" / "nonce too low" | Spamoor config | Increase `funding_gas_limit`; lower `throughput`; check prefunded keys |
315+
| Snooper shows malformed Engine API request | Erigon RPC layer | Capture the request from snooper logs; inspect erigon engine handler |
316+
| Erigon accepts a payload that lighthouse + teku both reject | Erigon (single-client divergence) | Almost always an erigon bug — fix locally |
317+
| `eth_blockNumber` stays at 0x0 after >2 epochs | Validators not running | Check `vc-*` service logs; verify keystore mounting |
318+
319+
## Fix-rebuild-rerun loop
320+
321+
Auto-iterates by default (`auto=true`), capped at `max-attempts=5`. Per attempt:
322+
323+
1. **Tear down**: `kurtosis enclave rm -f "$ENCLAVE"`.
324+
2. **Apply the fix** to erigon source via `Edit`. Only auto-apply when the triage
325+
classified the issue as "Erigon wrong" with high confidence. If ambiguous (peer
326+
could be wrong, or spec interpretation unclear), pause and surface to the user
327+
even before the cap — that overrides `auto=true`.
328+
3. **Rebuild image**: `docker build -t test/erigon:current --build-arg BINARIES="erigon caplin" .`.
329+
4. **Re-launch**: same config, fresh timestamped enclave name (so each attempt's dump
330+
stays separate).
331+
5. **Re-run monitor**: same three checks.
332+
6. **Record**: per attempt — symptom, hypothesis, fix applied, outcome.
333+
334+
After `max-attempts` consecutive failures, halt and print the per-attempt history. Do
335+
not auto-apply more fixes once the cap is hit. Reference: `/autoresearch` follows the
336+
same iterate-and-record pattern.
337+
338+
If `auto=false`, pause for user approval before steps 2–4 of every attempt.
339+
340+
## Cleanup (always run)
341+
342+
Run as the final step of every iteration regardless of outcome (success, failure,
343+
or user interrupt):
344+
345+
```bash
346+
DUMP_DIR="/tmp/kurtosis-dump-${ENCLAVE}"
347+
# `kurtosis enclave dump` refuses if the destination already exists — never pre-create it.
348+
kurtosis enclave dump "$ENCLAVE" "$DUMP_DIR" || true
349+
kurtosis enclave rm -f "$ENCLAVE" || true
350+
echo "Logs dumped to: $DUMP_DIR"
351+
```
352+
353+
The dump contains per-service logs (`el-*`, `cl-*`, `vc-*`, `assertoor`, `spamoor`,
354+
`dora`, `snooper-*`) — keep this directory until triage is complete; GitHub blob
355+
storage is not in play here so the logs are only on disk locally.
356+
357+
After multiple iterations, also prune dangling docker images:
358+
359+
```bash
360+
docker image prune -f
361+
```
362+
363+
## Troubleshooting
364+
365+
| Problem | Solution |
366+
|---|---|
367+
| `kurtosis run` fails with Starlark error mentioning `GpuConfig` | Use one of the pinned `--package@branch` values from the mapping table (e.g. `6.1.0` for glamsterdam, `5.0.1` for regular/pectra) — they predate the `GpuConfig` built-in. OR upgrade kurtosis CLI to ≥ 1.18.1 if you need an `ethereum-package` branch that includes it. |
368+
| `kurtosis enclave dump` errors "destination exists" | Use a fresh dir name or `rm -rf` it first |
369+
| All `el-*-erigon-*` services missing from `enclave inspect` | Image build failed; check `docker images \| grep test/erigon` and rerun docker build |
370+
| `eth_blockNumber` returns `0x0` forever | Validators didn't start; check `vc-*` service logs; verify keystore mounting |
371+
| Connection refused on assertoor port | Service still booting; or `assertoor` not in `additional_services` in the YAML |
372+
| `kurtosis service logs` truncates very long logs | Use `kurtosis enclave dump` for the full per-service log files |
373+
| `caplin-minimal` config fails with "binary not found" | Confirm `BINARIES="erigon caplin"` in the docker build args |
374+
| `eth_blockNumber` advances but assertoor reports timeout | Slot time / preset mismatch; check `seconds_per_slot` and `preset` in YAML |
375+
| Erigon image stale despite rebuild | `docker image rm test/erigon:current && docker build ...` to force; check BuildKit cache scope |
376+
| Port already allocated | Another enclave is running — `kurtosis enclave ls` then `kurtosis enclave rm -f <old>` |
377+
| Engine API JWT mismatch in EL logs | Check `kurtosis service exec <enclave> el-1-erigon-… "ls -la /jwt/"`; restart enclave if missing |

0 commit comments

Comments
 (0)