Skip to content

Commit 34c74a0

Browse files
committed
feat(zfp): zero-false-positive overhaul — 13-layer gate pipeline
Add a full Zero-False-Positive (ZFP) pipeline in front of the existing Vigilo workflow so that High/Critical findings are only promoted after surviving independent PoC, dup, severity, adversarial, and vaccine-loop gates. ## New agents (packages/claude/agents/) - verifier.md — single ZFP quality gate, runs 8 gates including L13 RCA distinctness semantic check - judge.md — cross-family severity calibrator using C4/Sherlock rubrics; auditor-family ≠ judge-family - griller.md — adversarial FP hunter, 3 rounds, variant: max - poc-generator.md — Foundry PoC emitter (gpt-5.2-codex) - patcher.md — minimal fix (≤10 lines) tied to Root Cause - re-verifier.md — vaccine loop closer; post-patch PoC must FAIL to confirm bug is real (opus-4-5, different tier) - economic-auditor.md — GPT-primary auditor for invariant violations (LTV/share-price/no-free-lunch) - invariant-tester.md — Foundry + Medusa invariant fuzz generator - dup-detector.md — corpus similarity (haiku) with ~20k finding index ## 13-layer ZFP pipeline (vigilo.md Phase 3) L1 static pre-pass deprio known-class L2 auditor hypothesis w/ RCA L3 PoC generation L4 PoC compile L5 PoC passes vulnerable state L5' invariant fuzzer counterexamples L6 determinism (two runs) L7 corpus dup-check L8 non-vacuous assertion + impact match L9 post-patch PoC FAIL = bug real L10 severity judge (cross-family) L11 3-round adversarial grill (variant: max) L12 cross-auditor consensus boost L13 RCA semantic distinctness Findings promote only when every applicable gate PASSes. ## Model routing rewrite (src/shared/model-requirements.ts) - Opus-4-6 critical path (cheaper than 4-7 while keeping reasoning depth); Opus-4-5 secondary, Opus-3 reserve fallback - GPT-5.2 / gpt-5.2-codex primary for code-gen + cross-family auditors - pickJudgeForAuditor() helper enforces family diversity between auditor and judge to break shared-prior collusion - `variant: max` reserved for griller only (single most expensive role) ## Finding schema (skills/vulnerability-base/SKILL.md) - New Iron Law #5: Root Cause ≠ Symptom - Top-level `## Root Cause` section required - L13 semantic check: Verifier rejects findings where RCA paraphrases the symptom; two worked RCA examples (reentrancy + oracle) showing good vs bad framings - Quality checklist extended ## Scripts - scripts/static-prepass.sh — Slither + Semgrep + Aderyn parallel run, outputs .vigilo/prepass.md; handles missing tools gracefully - scripts/corpus-ingest.py — clones top-N Code4rena + Sherlock findings repos in parallel, extracts severity via 5 strategies - scripts/corpus-stats.sh — corpus dashboard (source/severity/protocol/year) - scripts/dup-query.py — kNN query with ngram Jaccard + token overlap + protocol filter; JSON output consumed by dup-detector agent - scripts/corpus-bootstrap.sh — wrapper + pgvector schema init for v2 ## Infrastructure - pgvector container on :5433 ready for v2 semantic similarity - vigilo-corpus/ structure documented in docs/ZFP-OVERHAUL.md ## CI - .github/workflows/zfp-bench.yml — runs ScaBench regression on pushes + PRs; fails if valid-finding rate regresses >2% vs baseline ## Build - packages/opencode/build.mjs switched from `bun build` CLI to Bun.build() API because `bun build` collides with the `build` script slot on bun >= 1.3 ## Docs - docs/ZFP-OVERHAUL.md — design rationale, 13-layer table, roadmap - docs/INSTALL-LOCAL.md — how to point opencode-web3 / Claude Code at the local build; cost budgeting per role ## Corpus (external, not in tree) Populated at ~/.vigilo-corpus/ with 20,789 indexed findings across 120 repos (60 C4 + 60 Sherlock, 2022–2025). Severity extracted from path, filename suffix (-G/-Q), title tags [H-01], explicit "Severity:" lines, and Sherlock "Issue H-1" patterns.
1 parent 9759f82 commit 34c74a0

22 files changed

Lines changed: 3408 additions & 116 deletions

.github/workflows/zfp-bench.yml

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
name: zfp-bench
2+
3+
# Runs the Vigilo ScaBench regression suite on every push to the ZFP branch +
4+
# PRs into main. Fails the job if valid-finding rate regresses >2% vs the
5+
# recorded baseline.
6+
#
7+
# The bench runner uses `packages/bench` which scores Vigilo against
8+
# Code4rena ground truth. This workflow does NOT invoke live LLMs — it
9+
# replays previously-cached audit outputs + re-scores. Live-LLM regression
10+
# is a separate nightly workflow (not shipped in this PR — see roadmap).
11+
12+
on:
13+
push:
14+
branches: [main, "zfp-*"]
15+
pull_request:
16+
branches: [main]
17+
workflow_dispatch:
18+
inputs:
19+
baseline_ref:
20+
description: "Git ref to compare against"
21+
required: false
22+
default: "main"
23+
24+
permissions:
25+
contents: read
26+
pull-requests: write
27+
28+
jobs:
29+
bench:
30+
runs-on: ubuntu-latest
31+
timeout-minutes: 25
32+
defaults:
33+
run:
34+
working-directory: packages/bench
35+
36+
steps:
37+
- uses: actions/checkout@v5
38+
with:
39+
fetch-depth: 0
40+
41+
- uses: oven-sh/setup-bun@v2
42+
with:
43+
bun-version: "1.3.12"
44+
45+
- uses: actions/setup-node@v5
46+
with:
47+
node-version: "22"
48+
49+
# bun install has a name conflict with the `install` script slot on this
50+
# bun version — use npm for dependency install.
51+
- name: install deps
52+
run: npm ci --no-audit --no-fund
53+
54+
- name: typecheck
55+
run: npx tsc --noEmit
56+
57+
- name: build bench runner
58+
run: npm run build
59+
60+
- name: verify bench CLI
61+
run: node dist/cli.js --help
62+
63+
# ── Replay-only regression (fast, no live LLM) ────────────────────────
64+
- name: run ScaBench replay
65+
id: bench
66+
run: |
67+
node dist/cli.js run \
68+
--dataset ./data/dataset.json \
69+
--baselines ./data/baselines \
70+
--out ./data/results-current.json \
71+
--mode replay \
72+
2>&1 | tee bench-output.log
73+
# Extract headline metrics for step summary
74+
node dist/cli.js summarize \
75+
--results ./data/results-current.json \
76+
--out ./data/summary.md \
77+
|| echo "summary step skipped (no summarize subcommand)"
78+
79+
- name: post summary
80+
if: always()
81+
run: |
82+
if [ -f ./data/summary.md ]; then
83+
cat ./data/summary.md >> "$GITHUB_STEP_SUMMARY"
84+
else
85+
echo "## Bench output" >> "$GITHUB_STEP_SUMMARY"
86+
echo '```' >> "$GITHUB_STEP_SUMMARY"
87+
tail -60 bench-output.log >> "$GITHUB_STEP_SUMMARY"
88+
echo '```' >> "$GITHUB_STEP_SUMMARY"
89+
fi
90+
91+
- name: regression gate
92+
env:
93+
BENCH_MAX_REGRESSION_PCT: "2"
94+
run: |
95+
if [ ! -f ./data/baseline-summary.json ]; then
96+
echo "::notice::No baseline recorded yet — skipping regression gate"
97+
exit 0
98+
fi
99+
node - <<'JS'
100+
import { readFileSync } from "node:fs"
101+
const maxRegressionPct = Number(process.env.BENCH_MAX_REGRESSION_PCT || "2")
102+
const base = JSON.parse(readFileSync("./data/baseline-summary.json", "utf8"))
103+
const curr = JSON.parse(readFileSync("./data/results-current.json", "utf8"))
104+
// Score shape depends on bench CLI output. Guard for missing fields.
105+
const baseRate = Number(base.validFindingRate ?? base.valid_rate ?? 0)
106+
const currRate = Number(curr.validFindingRate ?? curr.valid_rate ?? 0)
107+
if (!Number.isFinite(baseRate) || !Number.isFinite(currRate) || baseRate === 0) {
108+
console.log(`No usable baseline (base=${baseRate}, curr=${currRate}) — skipping gate`)
109+
process.exit(0)
110+
}
111+
const delta = ((currRate - baseRate) / baseRate) * 100
112+
console.log(`Baseline valid-rate: ${(baseRate * 100).toFixed(2)}%`)
113+
console.log(`Current valid-rate: ${(currRate * 100).toFixed(2)}%`)
114+
console.log(`Delta: ${delta >= 0 ? "+" : ""}${delta.toFixed(2)}%`)
115+
if (delta < -maxRegressionPct) {
116+
console.error(`::error::Valid-finding rate regressed ${delta.toFixed(2)}% (gate: -${maxRegressionPct}%)`)
117+
process.exit(1)
118+
}
119+
JS
120+
121+
- name: upload results
122+
if: always()
123+
uses: actions/upload-artifact@v4
124+
with:
125+
name: zfp-bench-results-${{ github.run_id }}
126+
path: |
127+
packages/bench/data/results-current.json
128+
packages/bench/data/summary.md
129+
packages/bench/bench-output.log
130+
retention-days: 30

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,4 @@ coverage/
3636
reference/
3737
nul
3838
.sisyphus/
39+
.omc/

docs/INSTALL-LOCAL.md

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Local Vigilo Development — pointing OpenCode / Claude Code at the local build
2+
3+
This guide wires a local Vigilo source tree (e.g. `zfp-overhaul` branch) into
4+
an existing OpenCode / opencode-web3 / Claude Code session so you can iterate
5+
on agents, skills, and routing without publishing to npm.
6+
7+
## Prerequisites
8+
9+
- `bun ≥ 1.3.12`
10+
- `node ≥ 22`
11+
- `forge ≥ 1.5`
12+
- (optional) `slither`, `halmos`, `medusa`, `semgrep`, `aderyn`
13+
- Live worktree at `/home/void/Vigilo-zfp` (or your chosen path)
14+
15+
## 1 — Build the plugin
16+
17+
```bash
18+
cd /home/void/Vigilo-zfp/packages/opencode
19+
npm ci # bun install conflicts with `build` script name on bun 1.3
20+
bun build.mjs # uses Bun.build() API (see note below)
21+
npx tsc --noEmit # typecheck
22+
```
23+
24+
### Note: bun script-name conflict
25+
26+
The `build` script in `package.json` and the `bun build` CLI subcommand
27+
conflict on bun ≥ 1.3. This repo's `build.mjs` sidesteps the conflict by
28+
using `Bun.build()` + `npx tsc` directly. Run `bun build.mjs`, not
29+
`bun run build`.
30+
31+
## 2 — Option A: symlink into opencode-web3
32+
33+
```bash
34+
# Back up your config
35+
cp ~/.config/opencode-web3/opencode/opencode.json{,.bak}
36+
37+
# Edit opencode.json — replace "vigilo@latest" with local file reference
38+
```
39+
40+
Replace the plugin line in `~/.config/opencode-web3/opencode/opencode.json`:
41+
42+
```diff
43+
"plugin": [
44+
"opencode-claude-auth",
45+
"opencode-openai-codex-auth",
46+
- "vigilo@latest"
47+
+ "file:/home/void/Vigilo-zfp/packages/opencode"
48+
],
49+
```
50+
51+
Restart opencode-web3. The local build is now loaded.
52+
53+
## 3 — Option B: Claude Code plugin path
54+
55+
Claude Code auto-discovers agents from `packages/claude/agents/*.md`. Point
56+
at the local plugin via `~/.claude/settings.json`:
57+
58+
```jsonc
59+
{
60+
"extraKnownMarketplaces": {
61+
"vigilo-local": {
62+
"source": {
63+
"source": "local",
64+
"path": "/home/void/Vigilo-zfp/packages/claude"
65+
}
66+
}
67+
}
68+
}
69+
```
70+
71+
Then run `/plugin install vigilo@vigilo-local` from a Claude Code session.
72+
73+
## 4 — Verify new agents are registered
74+
75+
From an OpenCode / Claude Code session:
76+
77+
```
78+
/agents list
79+
```
80+
81+
Expected new agents (9):
82+
83+
- `verifier`
84+
- `judge` (and `judge-gpt` variant once wired)
85+
- `griller`
86+
- `poc-generator`
87+
- `patcher`
88+
- `re-verifier`
89+
- `economic-auditor`
90+
- `invariant-tester`
91+
- `dup-detector`
92+
93+
Plus existing: `vigilo`, `quaestor`, `explorator`, `speculator`, and the 8
94+
specialist auditors.
95+
96+
## 5 — Run a smoke audit on alchemix-v3
97+
98+
```bash
99+
cd /home/void/alchemix-v3
100+
101+
# Run the Phase 2.5 static pre-pass alone (no LLM cost)
102+
/home/void/Vigilo-zfp/packages/claude/scripts/static-prepass.sh .
103+
cat .vigilo/prepass.md
104+
105+
# Full audit (live LLMs — budget ~$3-8 per run for alchemix-v3 size)
106+
# From opencode-web3 / Claude Code:
107+
/audit
108+
```
109+
110+
Expected pipeline:
111+
112+
1. Phase -1 classify → FULL_AUDIT
113+
2. Phase 0 scope (scope.md already exists)
114+
3. Phase 1 recon (explorator + speculator parallel)
115+
4. Phase 1.5 risk-priority map
116+
5. Phase 2 deep analysis (reentrancy + oracle + economic + … — parallel ≤3)
117+
6. **Phase 2.5 static pre-pass** (parallel, non-blocking)
118+
7. **Phase 3 ZFP pipeline** — PoC → verifier → dup-check → judge → griller →
119+
patcher → re-verifier
120+
8. Phase 4 quality review
121+
9. Phase 5 report → `.vigilo/reports/`
122+
123+
## 6 — Compare to prior findings
124+
125+
alchemix-v3 already has a `.vigilo/` from a prior run. After ZFP audit:
126+
127+
```bash
128+
# Snapshot the new output
129+
cp -r .vigilo .vigilo.zfp
130+
131+
# Diff
132+
diff -r .vigilo.prior/findings .vigilo.zfp/findings | head -60
133+
```
134+
135+
Metrics to extract:
136+
137+
- New findings vs prior (potential improvement)
138+
- Prior findings dropped by ZFP (potential FP rejection or quality gate)
139+
- Severity distribution shift
140+
141+
## 7 — Configure the corpus (optional but recommended)
142+
143+
```bash
144+
# Bootstrap ~/.vigilo-corpus/ with top-60 C4 + 60 Sherlock findings repos
145+
python3 packages/claude/scripts/corpus-ingest.py --top-n 60 --workers 12
146+
147+
# Stats
148+
packages/claude/scripts/corpus-stats.sh
149+
150+
# Test query
151+
python3 packages/claude/scripts/dup-query.py \
152+
--title "Reentrancy in withdraw" --protocol vault --k 5
153+
```
154+
155+
## 8 — Configure pgvector (optional, v2 semantic dup-detect)
156+
157+
```bash
158+
# pgvector container (already running if set up during install)
159+
docker run -d --name vigilo-pgvector \
160+
-e POSTGRES_PASSWORD=vigilo -e POSTGRES_DB=vigilo \
161+
-p 5433:5432 pgvector/pgvector:pg17
162+
163+
# Initialize schema
164+
packages/claude/scripts/corpus-bootstrap.sh --pgvector
165+
```
166+
167+
Connection string: `postgres://postgres:vigilo@localhost:5433/vigilo`
168+
169+
## 9 — Troubleshooting
170+
171+
### "agent `verifier` not found"
172+
- Check `/agents list` — if missing, verify plugin is loaded (`/plugin list`)
173+
- Restart opencode session after changing config
174+
- Confirm `packages/claude/agents/verifier.md` exists in the linked path
175+
176+
### Slither compile error
177+
The default filter `(/|^)(test|mock|script|lib|node_modules)(/|$)` excludes
178+
common test paths. If your project has nested test dirs (e.g. `src/test/`),
179+
they're included via the `\.t\.sol$` suffix rule. If Slither still fails on
180+
`Type not found`, it may be a project-specific crytic-compile issue —
181+
configure `slither.config.json` at the project root.
182+
183+
### `bun install` fails with "Script not found"
184+
Use `npm ci` or `npm install` — bun ≥ 1.3 interprets `install` as a script
185+
run due to conflict with the `build` script slot.
186+
187+
### OpenCode doesn't pick up local changes
188+
- Rebuild: `cd packages/opencode && bun build.mjs`
189+
- Clear OpenCode plugin cache (location depends on version)
190+
- Restart opencode-web3
191+
192+
## 10 — Run benchmark locally
193+
194+
```bash
195+
cd packages/bench
196+
npm ci
197+
npm run build
198+
node dist/cli.js --help
199+
node dist/cli.js run --dataset ./data/dataset.json --baselines ./data/baselines \
200+
--out ./data/results-local.json --mode replay
201+
```
202+
203+
## 11 — Cost budgeting
204+
205+
Expected LLM spend per full audit with new ZFP pipeline:
206+
207+
| Role | Calls/finding | Model | Est. cost/call |
208+
|------|---------------|-------|----------------|
209+
| Specialist auditors | 1 | Sonnet 4.6 | $0.15 |
210+
| poc-generator | 1–3 | gpt-5.2-codex high | $0.08 |
211+
| verifier | 1 | Opus 4.6 xhigh | $0.40 |
212+
| judge | 1 | Opus 4.6 xhigh | $0.20 |
213+
| griller | 3 rounds | Opus 4.6 **max** | $0.60 × 3 |
214+
| patcher | 1–2 | gpt-5.2-codex high | $0.05 |
215+
| re-verifier | 1 | Opus 4.5 high | $0.15 |
216+
| dup-detector | 1 | Haiku 4.5 | $0.01 |
217+
218+
Per **candidate finding**: ~$3 end-to-end. Per full audit (~10 candidates):
219+
~$30. Rejected findings save griller cost (~$1.80 saved per reject).
220+
221+
Budget the griller carefully — it's the single most expensive role. Disable
222+
via `--no-grill` flag if iterating on non-Critical findings.

0 commit comments

Comments
 (0)