Phase 0 issue #1: 8 time-to-first-correct-fix fixtures

bb-connor · claude · bb-connor · commit c79853755ff8 · 2026-05-07T15:05:53.000-04:00
Eight fixtures hand-curated from real merged arc bug-fix PRs covering
diverse subsystems:

  pr-320 — kernel canonical receipt signing
  pr-357 — wasm warm tenant ring bounds
  pr-534 — eval-receipt rejects duplicate/unknown memo fields (P0-ish)
  pr-544 — sdk surfaces receipt-verify protocol errors
  pr-550 — data-guards default redactor regex compile failures
  pr-551 — api-protect constant-time bearer compare (security)
  pr-591 — attenuation_proof parent_scope_hash unbinding (P0)
  pr-592 — sibling-sum budget oversubscription (P0)

Each fixture has real arc commit SHAs (baseRefOid → mergeCommit.oid),
a estimated test signature (verify on first runner run), expected
patch_signature for diff-content matching, and curator notes explaining
the bug class and why the fixture is good fixture material.

Diversity check: kernel, wasm, sdk, api-protect, data-guards, attenuation,
budgets, eval-receipt — 8 distinct subsystems. PR #592 + #591 are paired
P0 capability-system fixes; both included because they exercise different
invariants (binding vs conservation).

eval-outcomes report status flips:
  | time-to-first-correct-fix | 8 | BLOCKED — runner | fixtures present;
                                                       runner is Phase 1 |

Fixture count gate cleared (≥ 8). The only remaining block is
chio_pack/eval/runners/time_to_fix.py, which is a Phase 1 deliverable
per PHASE-0.md.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-320-canonical-receipt-signing.yml b/chio-pack/eval/fixtures/time-to-fix/pr-320-canonical-receipt-signing.yml
@@ -0,0 +1,35 @@
+id: pr-320-canonical-receipt-signing-2026-04-29
+title: "Kernel signing task uses non-canonical receipt encoding (signature mismatches across peers)"
+source: "arc#PR-320"
+
+arc_commit_before: "b9ab2ba79d3389d6d6bead5cbfcd58fa27486a37"
+arc_commit_after:  "ecfbbe6e5c3eaa9f5cd318893a4eb564c3eea0ca"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@b9ab2ba79d3389d6d6bead5cbfcd58fa27486a37"
+  failing_test:    "crates/chio-kernel/src/kernel/signing_task.rs::test_canonical_receipt_signature_matches_peer_verifiers"
+
+expected:
+  patch_signature: "shared canonical receipt signing"
+  tests_must_pass:
+    - "crates/chio-kernel/src/kernel/signing_task.rs::test_canonical_receipt_signature_matches_peer_verifiers"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #320 metadata. The fix routes the kernel's
+  signing task through a shared canonical encoder so receipt signatures
+  agree with peer verifiers. Test signature is estimated from the PR
+  title; verify on first runner exercise. Single-file change (1 file,
+  59 additions / 14 deletions) — well-scoped.
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-357-wasm-tenant-rings.yml b/chio-pack/eval/fixtures/time-to-fix/pr-357-wasm-tenant-rings.yml
@@ -0,0 +1,35 @@
+id: pr-357-wasm-tenant-rings-2026-04-30
+title: "WASM guard runtime grows tenant ring buffers without bound (memory blowup under burst)"
+source: "arc#PR-357"
+
+arc_commit_before: "770cfc71a8596a5309cc4c03f19ff2e8637a9164"
+arc_commit_after:  "1aadbe24d6ba94222fce4a94fa81721fc3d0aa0b"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@770cfc71a8596a5309cc4c03f19ff2e8637a9164"
+  failing_test:    "crates/chio-wasm-guards/src/runtime.rs::test_warm_tenant_ring_bounded_under_burst"
+
+expected:
+  patch_signature: "warm tenant ring"
+  tests_must_pass:
+    - "crates/chio-wasm-guards/src/runtime.rs::test_warm_tenant_ring_bounded_under_burst"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #357 metadata. Single-file change to
+  chio-wasm-guards/src/runtime.rs (245 additions / 17 deletions) bounds
+  the warm-tenant ring buffer. Failure mode at parent commit: bursts
+  cause unbounded memory growth in the wasm runtime. Test name
+  estimated; verify on first runner exercise.
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-534-eval-receipt-memo-fields.yml b/chio-pack/eval/fixtures/time-to-fix/pr-534-eval-receipt-memo-fields.yml
@@ -0,0 +1,39 @@
+id: pr-534-eval-receipt-memo-fields-2026-05-05
+title: "eval-receipt accepts duplicate and unknown memo signature fields (verifier acceptance bug)"
+source: "arc#PR-534"
+
+arc_commit_before: "47fe8d30c1198bc260a80d43243cdfccecd5777b"
+arc_commit_after:  "6f267d66ddc03cf625fdb86a1aae6b0b6dad0e35"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@47fe8d30c1198bc260a80d43243cdfccecd5777b"
+  failing_test:    "crates/chio-eval-receipt/src/lib.rs::test_duplicate_memo_signature_field_rejected"
+
+expected:
+  patch_signature: "duplicate and unknown memo"
+  tests_must_pass:
+    - "crates/chio-eval-receipt/src/lib.rs::test_duplicate_memo_signature_field_rejected"
+    - "crates/chio-eval-receipt/src/lib.rs::test_unknown_memo_signature_field_rejected"
+    - "crates/chio-cli/tests/federation_policy.rs::test_eval_receipt_rejects_invalid_memo"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #534 metadata. Multi-test fixture covering
+  the eval-receipt verifier's strict-acceptance contract: duplicate
+  memo signature fields and unknown memo signature fields MUST both be
+  rejected. The PR touched 16 files including the CLI federation_policy
+  test, suggesting the bug had broad blast radius. Tests the agent's
+  ability to read the spec for what receipts MUST accept/reject and
+  apply that to all the surfaces (lib + CLI + federation policy).
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-544-sdk-receipt-verify-errors.yml b/chio-pack/eval/fixtures/time-to-fix/pr-544-sdk-receipt-verify-errors.yml
@@ -0,0 +1,36 @@
+id: pr-544-sdk-receipt-verify-errors-2026-05-04
+title: "TypeScript node-http SDK swallows receipt-verify protocol errors"
+source: "arc#PR-544"
+
+arc_commit_before: "c03239506806dd7efff1eaa5444c9fa3bab774fd"
+arc_commit_after:  "e658f765f5be5c86cf22f789750c6839c7fefe1c"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
+  failing_test:    "sdks/typescript/packages/node-http/test/sidecar-client.test.ts:surfaces protocol errors from receipt verify"
+
+expected:
+  patch_signature: "receipt-verify protocol errors"
+  tests_must_pass:
+    - "sdks/typescript/packages/node-http/test/sidecar-client.test.ts:surfaces protocol errors from receipt verify"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #544 metadata. The SDK was swallowing
+  receipt-verify protocol errors instead of surfacing them — a real
+  observability bug class. The PR added 265 lines to the SDK test file,
+  suggesting comprehensive failure-mode coverage. Test name is the
+  best-guess Vitest title from the PR description; verify on first
+  runner exercise.
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-550-data-guards-regex-compile.yml b/chio-pack/eval/fixtures/time-to-fix/pr-550-data-guards-regex-compile.yml
@@ -0,0 +1,37 @@
+id: pr-550-data-guards-regex-compile-2026-05-04
+title: "default redactor swallows regex compile failures, leaks unfiltered data"
+source: "arc#PR-550"
+
+arc_commit_before: "c03239506806dd7efff1eaa5444c9fa3bab774fd"
+arc_commit_after:  "883b45b254f9d3f4729abce29bf5a9a73f793ac9"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
+  failing_test:    "crates/chio-data-guards/redactors/default/src/lib.rs::test_compile_failure_surfaces_as_redaction_error"
+
+expected:
+  patch_signature: "regex compile"
+  tests_must_pass:
+    - "crates/chio-data-guards/redactors/default/src/lib.rs::test_compile_failure_surfaces_as_redaction_error"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #550 metadata. Single-file change (199
+  additions / 24 deletions). The default data-guard redactor was
+  swallowing regex compile failures silently — a fail-open bug that
+  let unfiltered data through when a user-provided redaction pattern
+  was malformed. Tests the agent's recognition of fail-closed vs
+  fail-open semantics, plus understanding of where redaction errors
+  must be surfaced (immediately, not buffered).
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-551-api-protect-constant-time.yml b/chio-pack/eval/fixtures/time-to-fix/pr-551-api-protect-constant-time.yml
@@ -0,0 +1,37 @@
+id: pr-551-api-protect-constant-time-2026-05-05
+title: "api-protect compares sidecar bearer tokens with non-constant-time comparison (timing leak)"
+source: "arc#PR-551"
+
+arc_commit_before: "c03239506806dd7efff1eaa5444c9fa3bab774fd"
+arc_commit_after:  "15b283b54e67459eab6938267ff4a44232bcda47"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
+  failing_test:    "crates/chio-api-protect/src/proxy.rs::test_bearer_token_compare_is_constant_time"
+
+expected:
+  patch_signature: "constant-time"
+  tests_must_pass:
+    - "crates/chio-api-protect/src/proxy.rs::test_bearer_token_compare_is_constant_time"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #551 metadata. Security-sensitive fix:
+  bearer-token comparison must be constant-time to prevent timing
+  side-channel attacks. The test likely uses a statistical timing
+  harness or a mocked compare with timing instrumentation. The fix
+  is small (4 files, mostly the proxy.rs change). Tests the agent's
+  retrieval of `subtle.ConstantTimeEq` / `tokio_util::time` patterns
+  for constant-time crypto.
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-591-attenuation-proof-unbinding.yml b/chio-pack/eval/fixtures/time-to-fix/pr-591-attenuation-proof-unbinding.yml
@@ -0,0 +1,38 @@
+id: pr-591-attenuation-proof-unbinding-2026-05-05
+title: "attenuation_proof allows parent_scope_hash unbinding — child cap can claim wider scope (P0)"
+source: "arc#PR-591"
+
+arc_commit_before: "5498b0cbe344acc5e108f040f6c7a910f8c96b78"
+arc_commit_after:  "bc964f12725ab8c02c690a60c57e261f5954da2f"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@5498b0cbe344acc5e108f040f6c7a910f8c96b78"
+  failing_test:    "crates/chio-conformance/tests/attenuation_witness_rejects_inflated_parent_scope.rs::test_inflated_parent_scope_rejected"
+
+expected:
+  patch_signature: "parent_scope_hash"
+  tests_must_pass:
+    - "crates/chio-conformance/tests/attenuation_witness_rejects_inflated_parent_scope.rs::test_inflated_parent_scope_rejected"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #591 metadata. **P0 security fix.** The
+  attenuation proof failed to bind to parent_scope_hash, letting a child
+  capability falsely claim a wider scope than its parent. The PR added
+  a new conformance test (183 lines) that asserts inflated parent
+  scopes are rejected. Test fn name guessed; verify on first runner
+  exercise. This is one of the most architecturally important fixtures
+  — exercises the agent's ability to find both the protocol bug and
+  the conformance assertion that pins it.
diff --git a/chio-pack/eval/fixtures/time-to-fix/pr-592-sibling-sum-budget-oversubscription.yml b/chio-pack/eval/fixtures/time-to-fix/pr-592-sibling-sum-budget-oversubscription.yml
@@ -0,0 +1,38 @@
+id: pr-592-sibling-sum-budget-oversubscription-2026-05-05
+title: "Sibling capability budgets allowed to oversubscribe their parent (P0)"
+source: "arc#PR-592"
+
+arc_commit_before: "bc964f12725ab8c02c690a60c57e261f5954da2f"
+arc_commit_after:  "ff9ff73144b7aab9081ad84b8dcf81c7d421c582"
+
+agent:
+  runner: codex-cli
+  model:  claude-opus-4-7
+  budget_seconds:    1800
+  budget_tool_calls: 100
+
+seed_state:
+  starting_branch: "main@bc964f12725ab8c02c690a60c57e261f5954da2f"
+  failing_test:    "crates/chio-conformance/tests/budget_split_rejects_oversubscribed_siblings.rs::test_oversubscribed_sibling_budgets_rejected"
+
+expected:
+  patch_signature: "budget_split"
+  tests_must_pass:
+    - "crates/chio-conformance/tests/budget_split_rejects_oversubscribed_siblings.rs::test_oversubscribed_sibling_budgets_rejected"
+    - "crates/chio-conformance/tests/budget_split_cross_hop_rejects_amplification.rs::test_cross_hop_amplification_rejected"
+
+scoring:
+  time_weight:      0.6
+  tool_call_weight: 0.4
+  zero_score_if:
+    - "any expected test still failing"
+    - "patch deletes any existing test"
+
+notes: |
+  Hand-curated from arc PR #592 metadata. **P0 security fix** following
+  PR #591. Budget split across delegated siblings was failing to enforce
+  that the sum doesn't exceed the parent's budget. Two new conformance
+  tests (209 + 186 lines) cover (a) sibling oversubscription within
+  one hop and (b) cross-hop amplification. Both tests must pass.
+  The agent must recognize this is a budget-conservation invariant
+  ("sibling sums ≤ parent sum") and find the canonical violation site.
diff --git a/vault/_meta/dashboards/eval-outcomes.md b/vault/_meta/dashboards/eval-outcomes.md
@@ -1,10 +1,10 @@
-# Outcome evals — 2026-05-07 19:02 UTC
+# Outcome evals — 2026-05-07 19:05 UTC
 
 > Generated by `chio-pack-eval` Phase 0 skeleton. No real runners yet — see PHASE-0.md.
 
 | Eval | Fixtures | Status | Notes |
 | ---- | -------- | ------ | ----- |
-| `time-to-first-correct-fix` | 0 | BLOCKED — fixtures | have 0, need ≥ 8 (PHASE-0.md) |
+| `time-to-first-correct-fix` | 8 | BLOCKED — runner | fixtures present; runner is Phase 1 deliverable |
 | `repeated-mistake-rate` | 0 | BLOCKED — runner | no fixtures glob; runner is Phase 1 deliverable |
 | `conformance-harness-recall` | 11 | BLOCKED — fixtures | have 11, need ≥ 20 (PHASE-0.md) |
 | `capability-error-explanation` | 0 | BLOCKED — fixtures | have 0, need ≥ 10 (PHASE-0.md) |