Skip to content

Commit c798537

Browse files
bb-connorclaude
andcommitted
Phase 0 issue #1: 8 time-to-first-correct-fix fixtures
Eight fixtures hand-curated from real merged arc bug-fix PRs covering diverse subsystems: pr-320 — kernel canonical receipt signing pr-357 — wasm warm tenant ring bounds pr-534 — eval-receipt rejects duplicate/unknown memo fields (P0-ish) pr-544 — sdk surfaces receipt-verify protocol errors pr-550 — data-guards default redactor regex compile failures pr-551 — api-protect constant-time bearer compare (security) pr-591 — attenuation_proof parent_scope_hash unbinding (P0) pr-592 — sibling-sum budget oversubscription (P0) Each fixture has real arc commit SHAs (baseRefOid → mergeCommit.oid), a estimated test signature (verify on first runner run), expected patch_signature for diff-content matching, and curator notes explaining the bug class and why the fixture is good fixture material. Diversity check: kernel, wasm, sdk, api-protect, data-guards, attenuation, budgets, eval-receipt — 8 distinct subsystems. PR #592 + #591 are paired P0 capability-system fixes; both included because they exercise different invariants (binding vs conservation). eval-outcomes report status flips: | time-to-first-correct-fix | 8 | BLOCKED — runner | fixtures present; runner is Phase 1 | Fixture count gate cleared (≥ 8). The only remaining block is chio_pack/eval/runners/time_to_fix.py, which is a Phase 1 deliverable per PHASE-0.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0f275da commit c798537

9 files changed

Lines changed: 297 additions & 2 deletions
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
id: pr-320-canonical-receipt-signing-2026-04-29
2+
title: "Kernel signing task uses non-canonical receipt encoding (signature mismatches across peers)"
3+
source: "arc#PR-320"
4+
5+
arc_commit_before: "b9ab2ba79d3389d6d6bead5cbfcd58fa27486a37"
6+
arc_commit_after: "ecfbbe6e5c3eaa9f5cd318893a4eb564c3eea0ca"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@b9ab2ba79d3389d6d6bead5cbfcd58fa27486a37"
16+
failing_test: "crates/chio-kernel/src/kernel/signing_task.rs::test_canonical_receipt_signature_matches_peer_verifiers"
17+
18+
expected:
19+
patch_signature: "shared canonical receipt signing"
20+
tests_must_pass:
21+
- "crates/chio-kernel/src/kernel/signing_task.rs::test_canonical_receipt_signature_matches_peer_verifiers"
22+
23+
scoring:
24+
time_weight: 0.6
25+
tool_call_weight: 0.4
26+
zero_score_if:
27+
- "any expected test still failing"
28+
- "patch deletes any existing test"
29+
30+
notes: |
31+
Hand-curated from arc PR #320 metadata. The fix routes the kernel's
32+
signing task through a shared canonical encoder so receipt signatures
33+
agree with peer verifiers. Test signature is estimated from the PR
34+
title; verify on first runner exercise. Single-file change (1 file,
35+
59 additions / 14 deletions) — well-scoped.
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
id: pr-357-wasm-tenant-rings-2026-04-30
2+
title: "WASM guard runtime grows tenant ring buffers without bound (memory blowup under burst)"
3+
source: "arc#PR-357"
4+
5+
arc_commit_before: "770cfc71a8596a5309cc4c03f19ff2e8637a9164"
6+
arc_commit_after: "1aadbe24d6ba94222fce4a94fa81721fc3d0aa0b"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@770cfc71a8596a5309cc4c03f19ff2e8637a9164"
16+
failing_test: "crates/chio-wasm-guards/src/runtime.rs::test_warm_tenant_ring_bounded_under_burst"
17+
18+
expected:
19+
patch_signature: "warm tenant ring"
20+
tests_must_pass:
21+
- "crates/chio-wasm-guards/src/runtime.rs::test_warm_tenant_ring_bounded_under_burst"
22+
23+
scoring:
24+
time_weight: 0.6
25+
tool_call_weight: 0.4
26+
zero_score_if:
27+
- "any expected test still failing"
28+
- "patch deletes any existing test"
29+
30+
notes: |
31+
Hand-curated from arc PR #357 metadata. Single-file change to
32+
chio-wasm-guards/src/runtime.rs (245 additions / 17 deletions) bounds
33+
the warm-tenant ring buffer. Failure mode at parent commit: bursts
34+
cause unbounded memory growth in the wasm runtime. Test name
35+
estimated; verify on first runner exercise.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
id: pr-534-eval-receipt-memo-fields-2026-05-05
2+
title: "eval-receipt accepts duplicate and unknown memo signature fields (verifier acceptance bug)"
3+
source: "arc#PR-534"
4+
5+
arc_commit_before: "47fe8d30c1198bc260a80d43243cdfccecd5777b"
6+
arc_commit_after: "6f267d66ddc03cf625fdb86a1aae6b0b6dad0e35"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@47fe8d30c1198bc260a80d43243cdfccecd5777b"
16+
failing_test: "crates/chio-eval-receipt/src/lib.rs::test_duplicate_memo_signature_field_rejected"
17+
18+
expected:
19+
patch_signature: "duplicate and unknown memo"
20+
tests_must_pass:
21+
- "crates/chio-eval-receipt/src/lib.rs::test_duplicate_memo_signature_field_rejected"
22+
- "crates/chio-eval-receipt/src/lib.rs::test_unknown_memo_signature_field_rejected"
23+
- "crates/chio-cli/tests/federation_policy.rs::test_eval_receipt_rejects_invalid_memo"
24+
25+
scoring:
26+
time_weight: 0.6
27+
tool_call_weight: 0.4
28+
zero_score_if:
29+
- "any expected test still failing"
30+
- "patch deletes any existing test"
31+
32+
notes: |
33+
Hand-curated from arc PR #534 metadata. Multi-test fixture covering
34+
the eval-receipt verifier's strict-acceptance contract: duplicate
35+
memo signature fields and unknown memo signature fields MUST both be
36+
rejected. The PR touched 16 files including the CLI federation_policy
37+
test, suggesting the bug had broad blast radius. Tests the agent's
38+
ability to read the spec for what receipts MUST accept/reject and
39+
apply that to all the surfaces (lib + CLI + federation policy).
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
id: pr-544-sdk-receipt-verify-errors-2026-05-04
2+
title: "TypeScript node-http SDK swallows receipt-verify protocol errors"
3+
source: "arc#PR-544"
4+
5+
arc_commit_before: "c03239506806dd7efff1eaa5444c9fa3bab774fd"
6+
arc_commit_after: "e658f765f5be5c86cf22f789750c6839c7fefe1c"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
16+
failing_test: "sdks/typescript/packages/node-http/test/sidecar-client.test.ts:surfaces protocol errors from receipt verify"
17+
18+
expected:
19+
patch_signature: "receipt-verify protocol errors"
20+
tests_must_pass:
21+
- "sdks/typescript/packages/node-http/test/sidecar-client.test.ts:surfaces protocol errors from receipt verify"
22+
23+
scoring:
24+
time_weight: 0.6
25+
tool_call_weight: 0.4
26+
zero_score_if:
27+
- "any expected test still failing"
28+
- "patch deletes any existing test"
29+
30+
notes: |
31+
Hand-curated from arc PR #544 metadata. The SDK was swallowing
32+
receipt-verify protocol errors instead of surfacing them — a real
33+
observability bug class. The PR added 265 lines to the SDK test file,
34+
suggesting comprehensive failure-mode coverage. Test name is the
35+
best-guess Vitest title from the PR description; verify on first
36+
runner exercise.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
id: pr-550-data-guards-regex-compile-2026-05-04
2+
title: "default redactor swallows regex compile failures, leaks unfiltered data"
3+
source: "arc#PR-550"
4+
5+
arc_commit_before: "c03239506806dd7efff1eaa5444c9fa3bab774fd"
6+
arc_commit_after: "883b45b254f9d3f4729abce29bf5a9a73f793ac9"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
16+
failing_test: "crates/chio-data-guards/redactors/default/src/lib.rs::test_compile_failure_surfaces_as_redaction_error"
17+
18+
expected:
19+
patch_signature: "regex compile"
20+
tests_must_pass:
21+
- "crates/chio-data-guards/redactors/default/src/lib.rs::test_compile_failure_surfaces_as_redaction_error"
22+
23+
scoring:
24+
time_weight: 0.6
25+
tool_call_weight: 0.4
26+
zero_score_if:
27+
- "any expected test still failing"
28+
- "patch deletes any existing test"
29+
30+
notes: |
31+
Hand-curated from arc PR #550 metadata. Single-file change (199
32+
additions / 24 deletions). The default data-guard redactor was
33+
swallowing regex compile failures silently — a fail-open bug that
34+
let unfiltered data through when a user-provided redaction pattern
35+
was malformed. Tests the agent's recognition of fail-closed vs
36+
fail-open semantics, plus understanding of where redaction errors
37+
must be surfaced (immediately, not buffered).
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
id: pr-551-api-protect-constant-time-2026-05-05
2+
title: "api-protect compares sidecar bearer tokens with non-constant-time comparison (timing leak)"
3+
source: "arc#PR-551"
4+
5+
arc_commit_before: "c03239506806dd7efff1eaa5444c9fa3bab774fd"
6+
arc_commit_after: "15b283b54e67459eab6938267ff4a44232bcda47"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
16+
failing_test: "crates/chio-api-protect/src/proxy.rs::test_bearer_token_compare_is_constant_time"
17+
18+
expected:
19+
patch_signature: "constant-time"
20+
tests_must_pass:
21+
- "crates/chio-api-protect/src/proxy.rs::test_bearer_token_compare_is_constant_time"
22+
23+
scoring:
24+
time_weight: 0.6
25+
tool_call_weight: 0.4
26+
zero_score_if:
27+
- "any expected test still failing"
28+
- "patch deletes any existing test"
29+
30+
notes: |
31+
Hand-curated from arc PR #551 metadata. Security-sensitive fix:
32+
bearer-token comparison must be constant-time to prevent timing
33+
side-channel attacks. The test likely uses a statistical timing
34+
harness or a mocked compare with timing instrumentation. The fix
35+
is small (4 files, mostly the proxy.rs change). Tests the agent's
36+
retrieval of `subtle.ConstantTimeEq` / `tokio_util::time` patterns
37+
for constant-time crypto.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
id: pr-591-attenuation-proof-unbinding-2026-05-05
2+
title: "attenuation_proof allows parent_scope_hash unbinding — child cap can claim wider scope (P0)"
3+
source: "arc#PR-591"
4+
5+
arc_commit_before: "5498b0cbe344acc5e108f040f6c7a910f8c96b78"
6+
arc_commit_after: "bc964f12725ab8c02c690a60c57e261f5954da2f"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@5498b0cbe344acc5e108f040f6c7a910f8c96b78"
16+
failing_test: "crates/chio-conformance/tests/attenuation_witness_rejects_inflated_parent_scope.rs::test_inflated_parent_scope_rejected"
17+
18+
expected:
19+
patch_signature: "parent_scope_hash"
20+
tests_must_pass:
21+
- "crates/chio-conformance/tests/attenuation_witness_rejects_inflated_parent_scope.rs::test_inflated_parent_scope_rejected"
22+
23+
scoring:
24+
time_weight: 0.6
25+
tool_call_weight: 0.4
26+
zero_score_if:
27+
- "any expected test still failing"
28+
- "patch deletes any existing test"
29+
30+
notes: |
31+
Hand-curated from arc PR #591 metadata. **P0 security fix.** The
32+
attenuation proof failed to bind to parent_scope_hash, letting a child
33+
capability falsely claim a wider scope than its parent. The PR added
34+
a new conformance test (183 lines) that asserts inflated parent
35+
scopes are rejected. Test fn name guessed; verify on first runner
36+
exercise. This is one of the most architecturally important fixtures
37+
— exercises the agent's ability to find both the protocol bug and
38+
the conformance assertion that pins it.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
id: pr-592-sibling-sum-budget-oversubscription-2026-05-05
2+
title: "Sibling capability budgets allowed to oversubscribe their parent (P0)"
3+
source: "arc#PR-592"
4+
5+
arc_commit_before: "bc964f12725ab8c02c690a60c57e261f5954da2f"
6+
arc_commit_after: "ff9ff73144b7aab9081ad84b8dcf81c7d421c582"
7+
8+
agent:
9+
runner: codex-cli
10+
model: claude-opus-4-7
11+
budget_seconds: 1800
12+
budget_tool_calls: 100
13+
14+
seed_state:
15+
starting_branch: "main@bc964f12725ab8c02c690a60c57e261f5954da2f"
16+
failing_test: "crates/chio-conformance/tests/budget_split_rejects_oversubscribed_siblings.rs::test_oversubscribed_sibling_budgets_rejected"
17+
18+
expected:
19+
patch_signature: "budget_split"
20+
tests_must_pass:
21+
- "crates/chio-conformance/tests/budget_split_rejects_oversubscribed_siblings.rs::test_oversubscribed_sibling_budgets_rejected"
22+
- "crates/chio-conformance/tests/budget_split_cross_hop_rejects_amplification.rs::test_cross_hop_amplification_rejected"
23+
24+
scoring:
25+
time_weight: 0.6
26+
tool_call_weight: 0.4
27+
zero_score_if:
28+
- "any expected test still failing"
29+
- "patch deletes any existing test"
30+
31+
notes: |
32+
Hand-curated from arc PR #592 metadata. **P0 security fix** following
33+
PR #591. Budget split across delegated siblings was failing to enforce
34+
that the sum doesn't exceed the parent's budget. Two new conformance
35+
tests (209 + 186 lines) cover (a) sibling oversubscription within
36+
one hop and (b) cross-hop amplification. Both tests must pass.
37+
The agent must recognize this is a budget-conservation invariant
38+
("sibling sums ≤ parent sum") and find the canonical violation site.

vault/_meta/dashboards/eval-outcomes.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# Outcome evals — 2026-05-07 19:02 UTC
1+
# Outcome evals — 2026-05-07 19:05 UTC
22

33
> Generated by `chio-pack-eval` Phase 0 skeleton. No real runners yet — see PHASE-0.md.
44
55
| Eval | Fixtures | Status | Notes |
66
| ---- | -------- | ------ | ----- |
7-
| `time-to-first-correct-fix` | 0 | BLOCKED — fixtures | have 0, need ≥ 8 (PHASE-0.md) |
7+
| `time-to-first-correct-fix` | 8 | BLOCKED — runner | fixtures present; runner is Phase 1 deliverable |
88
| `repeated-mistake-rate` | 0 | BLOCKED — runner | no fixtures glob; runner is Phase 1 deliverable |
99
| `conformance-harness-recall` | 11 | BLOCKED — fixtures | have 11, need ≥ 20 (PHASE-0.md) |
1010
| `capability-error-explanation` | 0 | BLOCKED — fixtures | have 0, need ≥ 10 (PHASE-0.md) |

0 commit comments

Comments
 (0)