feat: ADR 0032 M6 FOLD-T002 — cut production evidence over to the typed-stream fold

johnteee · claude · johnteee · commit 8b280bb47aa5 · 2026-06-14T01:17:40.000+08:00
build_run_evidence_bundle now derives evidence FROM the typed RunEvent stream
(read_run_events_from_audit + build_evidence_from_events), not raw audit dicts.
The typed stream is the production path; the raw-dict assembly survives only as
the shared _assemble_evidence_bundle helper that the fold also calls, so the two
cannot diverge. Every evidence-bearing event type is typed (M2+M3+M5), so the
typed reader is lossless here.

Finding (honest): the plan anticipated "synthetic receipt-only fixtures masking
real-path gaps" to retire — they do not exist. The receipt/evidence path was
already event-backed (tests/test_run_receipt.py writes real RunStore events;
test_real_run_receipt_completeness_from_plan validates a real run). The direct
RunEvidenceBundle(...) constructions in the suite are legitimate downstream-
consumer/checker unit tests, not gap-masking fixtures. No real-path gaps
surfaced under the cutover.

- teaagent/run_evidence.py: build_run_evidence_bundle routes through the fold.
- tests/test_run_evidence.py: re-anchored the FOLD-T001 parity test against
  _assemble_evidence_bundle (raw-dict path) so it stays non-circular now that
  the public builder itself folds.
- Plan §7 M6 row + FOLD-T002 ticket marked DONE; M6 COMPLETE.

Constraint: public API unchanged; raw-dict assembly retained as shared helper so fold==assembly is structurally guaranteed; no behavior change observed (lossless typing).
Tested: tests/test_run_evidence.py 15 passed; evidence/receipt/summary/5-min-proof/first-hour/adversarial 47 passed; all bundle consumers (skill/route/completeness/tui-cost/goal/provenance/summary/ws4-observability/conversation-ux/p0-harness) 171 passed; mypy clean.
Not-tested: full suite not run on 3.12 (hypothesis missing in 3.14 sandbox).
Confidence: high
Roadmap-Status: unchanged
Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;
diff --git a/docs/generated/docs-inventory.md b/docs/generated/docs-inventory.md
@@ -414,7 +414,7 @@ Do not edit this file manually — regenerate instead.
 | `ops/security-hardening.md` | working | 11733 | `0a385c7dab82` |
 | `ops/troubleshooting.md` | working | 9127 | `4921b6d50f5c` |
 | `permission-and-approval-playbook.md` | working | 6560 | `813bc74bb156` |
-| `plans/adr-0032-m1-m6-work-plan-2026-06-13.md` | archive | 56141 | `37d1576baf5a` |
+| `plans/adr-0032-m1-m6-work-plan-2026-06-13.md` | archive | 57704 | `53c78129190f` |
 | `plans/agent-ecosystem-acceptance-roadmap-2026-05-31.md` | archive | 29099 | `7c4a4972cfeb` |
 | `plans/community-pain-points-response-plan-2026-06-05.md` | archive | 7276 | `571d010133ad` |
 | `plans/competitive-positioning-plan-2026-05-31.md` | archive | 8726 | `d16dfd2bdd99` |
diff --git a/docs/plans/adr-0032-m1-m6-work-plan-2026-06-13.md b/docs/plans/adr-0032-m1-m6-work-plan-2026-06-13.md
@@ -178,7 +178,7 @@ consumers by M6.
 | ADR-0032-M3 | Plan gate is an interceptor using `PlanValidator`, landed parity-first (§13.3): a shadow-parity test asserting interceptor==inline per reason code went green before the inline branch was deleted in a separate commit. Denials and reason codes match current behavior; adversarial and first-hour tests remain green. |
 | ADR-0032-M4 (CLOSED — owner decisions B + B-analog, 2026-06-13) | **No gate moves to an interceptor; approval AND budget enforcement both STAY INLINE.** Both proved runtime-stateful on assessment, a poor fit for the pure-interceptor model. **Approval** (decision B): live JIT/session state, tool handler, auto-mode-swappable policy — every coupling gap was invisible to a unit parity test (`docs/work-log/m4-approval-sliceB-blocked-2026-06-13.md`). **Budget** (decision B-analog): it is three mechanisms — only the global cost cap (`_assert_cost_budget`) is stateless; the phase budget (live `phase_tracker`) and the warning ladder (`_budget_warning_levels_emitted` + `BudgetMonitor._emitted_levels`/`_prompted` dedup sets + an interactive `on_prompt` side-effect handler — the same `assert_allowed` shadow-coexistence trap that blocked approval) are stateful, and even the cost cap is enforced at two evolving-cost points per iteration that do not map 1:1 to events (`docs/work-log/m4-budget-stays-inline-2026-06-13.md`). Both gates' observability is already provided by M2 (their audit events — `tool_call_*`, `approval_*`, `budget_warning`, `budget_prompt`, `phase_budget_warning` — are typed + reader-surfaced); the M6 fold reads them without owning enforcement. Approval/budget behavior unchanged. **Net: plan gate (M3) is the sole governance gate moved to an interceptor.** |
 | ADR-0032-M5 (REVISED — observability-only, 2026-06-13) | **Hook OBSERVABILITY folds onto the spine; hook EXECUTION stays in the tool-dispatch layer.** Assessment found the planned "HookRegistry on spine" unsuitable for the same runtime-coupling reason as approval/budget: PreToolUse/PostToolUse run in `teaagent/tools.py::execute` and **mutate in-flight `arguments`/`result`** (the spine has no channel to ferry mutated payloads back to the dispatch site), and the 6 session-lifecycle hooks (SessionStart/End, UserPromptSubmit, PreCompact, Stop, SubagentStop) have **no production caller** — nothing to strangle; wiring them is feature work. Done: the 5 dispatch-layer hook audit events (`tool_hook_pre_mutation`, `tool_hook_pre_mutation_blocked`, `tool_hook_vetoed`, `tool_hook_post_mutation`, `tool_hook_post_failed`) are typed in `RunEventType` + mapped both directions, so the M2-T001 reader surfaces hook veto/mutation activity from the audit JSONL for the M6 fold. Mapping/reader only; audit bytes unchanged; hook execution + mutation semantics unchanged. See `docs/work-log/m5-hooks-observability-only-2026-06-13.md`. |
-| ADR-0032-M6 (was M2 fold; corrected scope A) — **FOLD-T001 DONE** | Evidence and receipts are folded from the typed event stream and equal the legacy builder on success/failure/pending fixtures (cancelled once emitted in M2); the fold reads the full stream (no fallback flag, per Q1); synthetic receipt-only fixtures are retired or relabeled legacy. Runs only after M2 coverage + M3/M4 decision events exist. **FOLD-T001 landed**: `build_evidence_from_events()` is a parallel builder sharing `_assemble_evidence_bundle` with the legacy path (cannot drift; only the event *source* differs), parity-asserted on success/failure/pending fixtures (`tests/test_run_evidence.py::test_m6_fold_*`). Surfaced + fixed a structural gap: the typed `RunEvent` was **lossy** — it dropped the top-level `created_at` that the extractors thread into command/test/approval timestamps; added `RunEvent.created_at` (optional; reader populates it from audit). Legacy stays default. **FOLD-T002 (cutover: switch receipt/evidence default to the fold + retire synthetic fixtures) PENDING** — the behavior-changing slice. |
+| ADR-0032-M6 (was M2 fold; corrected scope A) — **COMPLETE (FOLD-T001 + T002)** | Evidence and receipts are folded from the typed event stream and equal the legacy builder on success/failure/pending fixtures (cancelled once emitted in M2); the fold reads the full stream (no fallback flag, per Q1). **FOLD-T001**: `build_evidence_from_events()` parallel builder sharing `_assemble_evidence_bundle` with the legacy path (cannot drift; only the event *source* differs), parity-asserted (`tests/test_run_evidence.py::test_m6_fold_*`). Fixed a structural gap: the typed `RunEvent` was lossy — dropped top-level `created_at` (threaded into command/test/approval timestamps); added optional `RunEvent.created_at`, reader populates it. **FOLD-T002 (cutover DONE)**: `build_run_evidence_bundle` now routes production evidence THROUGH the typed reader + fold — the typed stream is the production path; the raw-dict assembly survives only as the shared helper (so the two cannot diverge). Suite-wide green (evidence/receipt/summary/5-min-proof/first-hour/adversarial + all bundle consumers, ~218 tests). **Finding: no synthetic receipt-only fixtures existed to retire** — the receipt/evidence path was already event-backed (`test_run_receipt.py` writes real RunStore events; `test_real_run_receipt_completeness_from_plan` validates a real run); direct `RunEvidenceBundle(...)` constructions are legitimate downstream-consumer/checker unit tests, not masking fixtures. The plan anticipated a gap that does not exist. Parity test re-anchored against `_assemble_evidence_bundle` (the raw-dict path) so it stays meaningful post-cutover. |
 | ADR-0032-M7 (was M6) | ContextBus and webhook sinks consume the spine; inline emission paths are deleted; validator shows no orphaned eventing modules. |
 
 ## 8. Task Plan
@@ -805,7 +805,22 @@ commit once Slice A is green.
 - Risk: medium (parity-gated, additive). Parallelizable: no.
   Human Review Required: no.
 
-### ADR32-FOLD-T002: Receipt Fold + Synthetic Fixture Retirement (was ADR32-M2-T003)
+### ADR32-FOLD-T002: Receipt Fold + Synthetic Fixture Retirement (was ADR32-M2-T003) [DONE]
+
+> **DONE (2026-06-13, owner chose "do the cutover now").** `build_run_evidence_bundle`
+> now routes production evidence through `read_run_events_from_audit` +
+> `build_evidence_from_events` — the typed stream is the production path; raw-dict
+> assembly survives only as the shared `_assemble_evidence_bundle` helper. Green
+> suite-wide (~218 tests across evidence/receipt/summary/5-min-proof/first-hour/
+> adversarial + all bundle consumers). **Finding: there were no synthetic
+> receipt-only fixtures to retire** — the receipt/evidence path was already
+> event-backed (`test_run_receipt.py` writes real RunStore events;
+> `test_real_run_receipt_completeness_from_plan` validates a real run). The
+> direct `RunEvidenceBundle(...)` constructions in the suite are legitimate
+> downstream-consumer/checker unit tests, not gap-masking fixtures. The
+> anticipated real-path gaps did not materialize. The FOLD-T001 parity test was
+> re-anchored against `_assemble_evidence_bundle` so it remains non-circular
+> after the cutover.
 
 - Goal: build receipts from the folded evidence and retire synthetic
   receipt-only fixtures that mask real-path gaps.
diff --git a/teaagent/run_evidence.py b/teaagent/run_evidence.py
@@ -798,7 +798,17 @@ def build_run_evidence_bundle(
     except FileNotFoundError:
         return RunEvidenceBundle(run_id=run_id, goal_id=goal_id)
 
-    return _assemble_evidence_bundle(events, root=root, run_id=run_id, goal_id=goal_id)
+    # M6 FOLD-T002 cutover: production evidence is now derived from the TYPED
+    # event stream, not raw audit dicts. Every evidence-bearing audit event is
+    # typed in RunEventType (M2 + M3 + M5), so read_run_events_from_audit is
+    # lossless here; events whose type is not in the taxonomy are not read by any
+    # extractor anyway. The legacy raw-dict assembly is no longer the production
+    # path — it survives only as the shared _assemble_evidence_bundle helper that
+    # the fold also uses, so the two cannot diverge.
+    from teaagent.runner._events import read_run_events_from_audit
+
+    typed = read_run_events_from_audit(events)
+    return build_evidence_from_events(typed, root=root, run_id=run_id, goal_id=goal_id)
 
 
 def build_evidence_from_events(
diff --git a/tests/test_run_evidence.py b/tests/test_run_evidence.py
@@ -322,17 +322,27 @@ def _write_run(root: str, run_id: str, events: list[dict]) -> None:
 
 
 def _assert_fold_matches_legacy(events: list[dict], run_id: str) -> RunEvidenceBundle:
-    """Legacy bundle (RunStore source) must equal the typed-stream fold."""
-    from teaagent.run_evidence import build_evidence_from_events
-    from teaagent.run_store import RunStore
+    """The typed-stream fold must equal the raw-audit-dict assembly.
+
+    Baselines against ``_assemble_evidence_bundle`` (the raw-dict path) rather
+    than ``build_run_evidence_bundle`` — after the M6 FOLD-T002 cutover the
+    public builder itself folds through the typed reader, so comparing it to the
+    fold would be circular. The invariant that matters is that routing raw audit
+    dicts through ``read_run_events_from_audit`` (typed reader) loses no evidence
+    versus assembling directly from those dicts.
+    """
+    from teaagent.run_evidence import (
+        _assemble_evidence_bundle,
+        build_evidence_from_events,
+    )
     from teaagent.runner._events import read_run_events_from_audit
 
     with tempfile.TemporaryDirectory() as root:
-        _write_run(root, run_id, events)
-
-        legacy = build_run_evidence_bundle(root, run_id)
+        # Raw-dict assembly (pre-cutover production path) is the baseline.
+        legacy = _assemble_evidence_bundle(events, root=root, run_id=run_id)
 
-        typed = read_run_events_from_audit(RunStore(root).show_run(run_id))
+        # Typed-stream fold (current production path) must match it.
+        typed = read_run_events_from_audit(events)
         folded = build_evidence_from_events(typed, root=root, run_id=run_id)
 
         assert folded.to_dict() == legacy.to_dict()