File tree Expand file tree Collapse file tree
chio-pack/eval/fixtures/time-to-fix Expand file tree Collapse file tree Original file line number Diff line number Diff line change 1+ id : pr-320-canonical-receipt-signing-2026-04-29
2+ title : " Kernel signing task uses non-canonical receipt encoding (signature mismatches across peers)"
3+ source : " arc#PR-320"
4+
5+ arc_commit_before : " b9ab2ba79d3389d6d6bead5cbfcd58fa27486a37"
6+ arc_commit_after : " ecfbbe6e5c3eaa9f5cd318893a4eb564c3eea0ca"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@b9ab2ba79d3389d6d6bead5cbfcd58fa27486a37"
16+ failing_test : " crates/chio-kernel/src/kernel/signing_task.rs::test_canonical_receipt_signature_matches_peer_verifiers"
17+
18+ expected :
19+ patch_signature : " shared canonical receipt signing"
20+ tests_must_pass :
21+ - " crates/chio-kernel/src/kernel/signing_task.rs::test_canonical_receipt_signature_matches_peer_verifiers"
22+
23+ scoring :
24+ time_weight : 0.6
25+ tool_call_weight : 0.4
26+ zero_score_if :
27+ - " any expected test still failing"
28+ - " patch deletes any existing test"
29+
30+ notes : |
31+ Hand-curated from arc PR #320 metadata. The fix routes the kernel's
32+ signing task through a shared canonical encoder so receipt signatures
33+ agree with peer verifiers. Test signature is estimated from the PR
34+ title; verify on first runner exercise. Single-file change (1 file,
35+ 59 additions / 14 deletions) — well-scoped.
Original file line number Diff line number Diff line change 1+ id : pr-357-wasm-tenant-rings-2026-04-30
2+ title : " WASM guard runtime grows tenant ring buffers without bound (memory blowup under burst)"
3+ source : " arc#PR-357"
4+
5+ arc_commit_before : " 770cfc71a8596a5309cc4c03f19ff2e8637a9164"
6+ arc_commit_after : " 1aadbe24d6ba94222fce4a94fa81721fc3d0aa0b"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@770cfc71a8596a5309cc4c03f19ff2e8637a9164"
16+ failing_test : " crates/chio-wasm-guards/src/runtime.rs::test_warm_tenant_ring_bounded_under_burst"
17+
18+ expected :
19+ patch_signature : " warm tenant ring"
20+ tests_must_pass :
21+ - " crates/chio-wasm-guards/src/runtime.rs::test_warm_tenant_ring_bounded_under_burst"
22+
23+ scoring :
24+ time_weight : 0.6
25+ tool_call_weight : 0.4
26+ zero_score_if :
27+ - " any expected test still failing"
28+ - " patch deletes any existing test"
29+
30+ notes : |
31+ Hand-curated from arc PR #357 metadata. Single-file change to
32+ chio-wasm-guards/src/runtime.rs (245 additions / 17 deletions) bounds
33+ the warm-tenant ring buffer. Failure mode at parent commit: bursts
34+ cause unbounded memory growth in the wasm runtime. Test name
35+ estimated; verify on first runner exercise.
Original file line number Diff line number Diff line change 1+ id : pr-534-eval-receipt-memo-fields-2026-05-05
2+ title : " eval-receipt accepts duplicate and unknown memo signature fields (verifier acceptance bug)"
3+ source : " arc#PR-534"
4+
5+ arc_commit_before : " 47fe8d30c1198bc260a80d43243cdfccecd5777b"
6+ arc_commit_after : " 6f267d66ddc03cf625fdb86a1aae6b0b6dad0e35"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@47fe8d30c1198bc260a80d43243cdfccecd5777b"
16+ failing_test : " crates/chio-eval-receipt/src/lib.rs::test_duplicate_memo_signature_field_rejected"
17+
18+ expected :
19+ patch_signature : " duplicate and unknown memo"
20+ tests_must_pass :
21+ - " crates/chio-eval-receipt/src/lib.rs::test_duplicate_memo_signature_field_rejected"
22+ - " crates/chio-eval-receipt/src/lib.rs::test_unknown_memo_signature_field_rejected"
23+ - " crates/chio-cli/tests/federation_policy.rs::test_eval_receipt_rejects_invalid_memo"
24+
25+ scoring :
26+ time_weight : 0.6
27+ tool_call_weight : 0.4
28+ zero_score_if :
29+ - " any expected test still failing"
30+ - " patch deletes any existing test"
31+
32+ notes : |
33+ Hand-curated from arc PR #534 metadata. Multi-test fixture covering
34+ the eval-receipt verifier's strict-acceptance contract: duplicate
35+ memo signature fields and unknown memo signature fields MUST both be
36+ rejected. The PR touched 16 files including the CLI federation_policy
37+ test, suggesting the bug had broad blast radius. Tests the agent's
38+ ability to read the spec for what receipts MUST accept/reject and
39+ apply that to all the surfaces (lib + CLI + federation policy).
Original file line number Diff line number Diff line change 1+ id : pr-544-sdk-receipt-verify-errors-2026-05-04
2+ title : " TypeScript node-http SDK swallows receipt-verify protocol errors"
3+ source : " arc#PR-544"
4+
5+ arc_commit_before : " c03239506806dd7efff1eaa5444c9fa3bab774fd"
6+ arc_commit_after : " e658f765f5be5c86cf22f789750c6839c7fefe1c"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
16+ failing_test : " sdks/typescript/packages/node-http/test/sidecar-client.test.ts:surfaces protocol errors from receipt verify"
17+
18+ expected :
19+ patch_signature : " receipt-verify protocol errors"
20+ tests_must_pass :
21+ - " sdks/typescript/packages/node-http/test/sidecar-client.test.ts:surfaces protocol errors from receipt verify"
22+
23+ scoring :
24+ time_weight : 0.6
25+ tool_call_weight : 0.4
26+ zero_score_if :
27+ - " any expected test still failing"
28+ - " patch deletes any existing test"
29+
30+ notes : |
31+ Hand-curated from arc PR #544 metadata. The SDK was swallowing
32+ receipt-verify protocol errors instead of surfacing them — a real
33+ observability bug class. The PR added 265 lines to the SDK test file,
34+ suggesting comprehensive failure-mode coverage. Test name is the
35+ best-guess Vitest title from the PR description; verify on first
36+ runner exercise.
Original file line number Diff line number Diff line change 1+ id : pr-550-data-guards-regex-compile-2026-05-04
2+ title : " default redactor swallows regex compile failures, leaks unfiltered data"
3+ source : " arc#PR-550"
4+
5+ arc_commit_before : " c03239506806dd7efff1eaa5444c9fa3bab774fd"
6+ arc_commit_after : " 883b45b254f9d3f4729abce29bf5a9a73f793ac9"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
16+ failing_test : " crates/chio-data-guards/redactors/default/src/lib.rs::test_compile_failure_surfaces_as_redaction_error"
17+
18+ expected :
19+ patch_signature : " regex compile"
20+ tests_must_pass :
21+ - " crates/chio-data-guards/redactors/default/src/lib.rs::test_compile_failure_surfaces_as_redaction_error"
22+
23+ scoring :
24+ time_weight : 0.6
25+ tool_call_weight : 0.4
26+ zero_score_if :
27+ - " any expected test still failing"
28+ - " patch deletes any existing test"
29+
30+ notes : |
31+ Hand-curated from arc PR #550 metadata. Single-file change (199
32+ additions / 24 deletions). The default data-guard redactor was
33+ swallowing regex compile failures silently — a fail-open bug that
34+ let unfiltered data through when a user-provided redaction pattern
35+ was malformed. Tests the agent's recognition of fail-closed vs
36+ fail-open semantics, plus understanding of where redaction errors
37+ must be surfaced (immediately, not buffered).
Original file line number Diff line number Diff line change 1+ id : pr-551-api-protect-constant-time-2026-05-05
2+ title : " api-protect compares sidecar bearer tokens with non-constant-time comparison (timing leak)"
3+ source : " arc#PR-551"
4+
5+ arc_commit_before : " c03239506806dd7efff1eaa5444c9fa3bab774fd"
6+ arc_commit_after : " 15b283b54e67459eab6938267ff4a44232bcda47"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@c03239506806dd7efff1eaa5444c9fa3bab774fd"
16+ failing_test : " crates/chio-api-protect/src/proxy.rs::test_bearer_token_compare_is_constant_time"
17+
18+ expected :
19+ patch_signature : " constant-time"
20+ tests_must_pass :
21+ - " crates/chio-api-protect/src/proxy.rs::test_bearer_token_compare_is_constant_time"
22+
23+ scoring :
24+ time_weight : 0.6
25+ tool_call_weight : 0.4
26+ zero_score_if :
27+ - " any expected test still failing"
28+ - " patch deletes any existing test"
29+
30+ notes : |
31+ Hand-curated from arc PR #551 metadata. Security-sensitive fix:
32+ bearer-token comparison must be constant-time to prevent timing
33+ side-channel attacks. The test likely uses a statistical timing
34+ harness or a mocked compare with timing instrumentation. The fix
35+ is small (4 files, mostly the proxy.rs change). Tests the agent's
36+ retrieval of `subtle.ConstantTimeEq` / `tokio_util::time` patterns
37+ for constant-time crypto.
Original file line number Diff line number Diff line change 1+ id : pr-591-attenuation-proof-unbinding-2026-05-05
2+ title : " attenuation_proof allows parent_scope_hash unbinding — child cap can claim wider scope (P0)"
3+ source : " arc#PR-591"
4+
5+ arc_commit_before : " 5498b0cbe344acc5e108f040f6c7a910f8c96b78"
6+ arc_commit_after : " bc964f12725ab8c02c690a60c57e261f5954da2f"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@5498b0cbe344acc5e108f040f6c7a910f8c96b78"
16+ failing_test : " crates/chio-conformance/tests/attenuation_witness_rejects_inflated_parent_scope.rs::test_inflated_parent_scope_rejected"
17+
18+ expected :
19+ patch_signature : " parent_scope_hash"
20+ tests_must_pass :
21+ - " crates/chio-conformance/tests/attenuation_witness_rejects_inflated_parent_scope.rs::test_inflated_parent_scope_rejected"
22+
23+ scoring :
24+ time_weight : 0.6
25+ tool_call_weight : 0.4
26+ zero_score_if :
27+ - " any expected test still failing"
28+ - " patch deletes any existing test"
29+
30+ notes : |
31+ Hand-curated from arc PR #591 metadata. **P0 security fix.** The
32+ attenuation proof failed to bind to parent_scope_hash, letting a child
33+ capability falsely claim a wider scope than its parent. The PR added
34+ a new conformance test (183 lines) that asserts inflated parent
35+ scopes are rejected. Test fn name guessed; verify on first runner
36+ exercise. This is one of the most architecturally important fixtures
37+ — exercises the agent's ability to find both the protocol bug and
38+ the conformance assertion that pins it.
Original file line number Diff line number Diff line change 1+ id : pr-592-sibling-sum-budget-oversubscription-2026-05-05
2+ title : " Sibling capability budgets allowed to oversubscribe their parent (P0)"
3+ source : " arc#PR-592"
4+
5+ arc_commit_before : " bc964f12725ab8c02c690a60c57e261f5954da2f"
6+ arc_commit_after : " ff9ff73144b7aab9081ad84b8dcf81c7d421c582"
7+
8+ agent :
9+ runner : codex-cli
10+ model : claude-opus-4-7
11+ budget_seconds : 1800
12+ budget_tool_calls : 100
13+
14+ seed_state :
15+ starting_branch : " main@bc964f12725ab8c02c690a60c57e261f5954da2f"
16+ failing_test : " crates/chio-conformance/tests/budget_split_rejects_oversubscribed_siblings.rs::test_oversubscribed_sibling_budgets_rejected"
17+
18+ expected :
19+ patch_signature : " budget_split"
20+ tests_must_pass :
21+ - " crates/chio-conformance/tests/budget_split_rejects_oversubscribed_siblings.rs::test_oversubscribed_sibling_budgets_rejected"
22+ - " crates/chio-conformance/tests/budget_split_cross_hop_rejects_amplification.rs::test_cross_hop_amplification_rejected"
23+
24+ scoring :
25+ time_weight : 0.6
26+ tool_call_weight : 0.4
27+ zero_score_if :
28+ - " any expected test still failing"
29+ - " patch deletes any existing test"
30+
31+ notes : |
32+ Hand-curated from arc PR #592 metadata. **P0 security fix** following
33+ PR #591. Budget split across delegated siblings was failing to enforce
34+ that the sum doesn't exceed the parent's budget. Two new conformance
35+ tests (209 + 186 lines) cover (a) sibling oversubscription within
36+ one hop and (b) cross-hop amplification. Both tests must pass.
37+ The agent must recognize this is a budget-conservation invariant
38+ ("sibling sums ≤ parent sum") and find the canonical violation site.
Original file line number Diff line number Diff line change 1- # Outcome evals — 2026-05-07 19:02 UTC
1+ # Outcome evals — 2026-05-07 19:05 UTC
22
33> Generated by ` chio-pack-eval ` Phase 0 skeleton. No real runners yet — see PHASE-0.md.
44
55| Eval | Fixtures | Status | Notes |
66| ---- | -------- | ------ | ----- |
7- | ` time-to-first-correct-fix ` | 0 | BLOCKED — fixtures | have 0, need ≥ 8 (PHASE-0.md) |
7+ | ` time-to-first-correct-fix ` | 8 | BLOCKED — runner | fixtures present; runner is Phase 1 deliverable |
88| ` repeated-mistake-rate ` | 0 | BLOCKED — runner | no fixtures glob; runner is Phase 1 deliverable |
99| ` conformance-harness-recall ` | 11 | BLOCKED — fixtures | have 11, need ≥ 20 (PHASE-0.md) |
1010| ` capability-error-explanation ` | 0 | BLOCKED — fixtures | have 0, need ≥ 10 (PHASE-0.md) |
You can’t perform that action at this time.
0 commit comments