fix: address post-migration review findings (F1 hook-fold claim, F2 cutover guard, F3 guard scope)

johnteee · claude · johnteee · commit 57cc614e3c37 · 2026-06-14T10:00:52.000+08:00
Strict "review again" of M5/M6/M7 surfaced three claim-vs-evidence gaps; code was
correct today but claims/guards were overstated or missing.

F2 [Med-High] — the FOLD-T002 cutover (build_run_evidence_bundle routes through
read_run_events_from_audit, which drops unmapped types) introduced an unguarded
losslessness invariant. Added test_m6_every_evidence_extractor_type_is_typed:
enumerates the audit types the evidence/proof-of-use extractors read and asserts
each is in RunEventType, so a future untyped evidence type fails loudly instead
of being silently dropped from production evidence.

F1 [Med] — M5's "hook observability folds into M6" was hollow: no extractor reads
tool_hook_*, and RunEvidenceBundle has no hooks field, so typed hook events never
reach any bundle/receipt. Corrected the claim in the plan §7 M5 row, ADR realized-
architecture M5 bullet, and memory: typing + reader-visibility ONLY; folding hook
activity into receipts is backlog (needs a bundle field + extractor).

F3 [Low-Med] — the M7 orphan-bus guard is a heuristic tripwire, not a proof: it
excludes generic publish/emit, so a RunEventStream-shaped bus (subscribe+emit) is
not detected. Documented the limitation in the validator docstring and the ADR.

No production code behavior changed; this is a guard test + honest doc corrections.

Constraint: docs + one additive guard test; no runtime behavior change; F2 guard is an enumerated-list coupling (loud failure on taxonomy drift), F3 limitation documented not closed.
Tested: tests/test_run_evidence.py + tests/test_event_spine_wiring.py 23 passed (incl. new F2 guard); validate_event_spine_wiring exits 0.
Not-tested: full suite not run on 3.12 (hypothesis missing in 3.14 sandbox).
Confidence: high
Roadmap-Status: unchanged
Co-Authored-By: Claude Fable 5 &lt;noreply@anthropic.com&gt;
diff --git a/docs/adr/0032-run-event-taxonomy.md b/docs/adr/0032-run-event-taxonomy.md
@@ -149,6 +149,10 @@ recorded here as the durable contract (see the per-phase work-logs under
 - **M5** — **hook OBSERVABILITY** is typed onto the spine; **hook EXECUTION stays
   in the tool-dispatch layer** (`teaagent/tools.py`) because PreToolUse/PostToolUse
   mutate in-flight args/results, and the session-lifecycle hooks are unwired.
+  *Scope note (review F1): the 5 hook audit events are typed and reader-visible
+  but NOT folded into evidence — no extractor reads them and `RunEvidenceBundle`
+  has no hooks field. Surfacing hook activity in receipts is backlog (needs a
+  bundle field + extractor), not delivered by M5.*
 - **M6** — **evidence/receipts fold over the typed stream** (`build_evidence_from_events`,
   now the production path inside `build_run_evidence_bundle`). Fixed a real
   lossiness gap (typed `RunEvent` now carries `created_at`).
@@ -173,9 +177,15 @@ EventSpine.emit ──(register_audit_consumer, M1)──▶ AuditLogger.record
   consumer would see only the spine-emitted subset (a coverage regression).
 - `ContextBus` and the integration `RunEventStream` are **unwired in production**;
   they are not lifecycle buses. The guard's allowlist names every sanctioned
-  event-delivery surface, so a **new competing lifecycle-event bus fails the gate**
-  and forces a conscious decision. The taxonomy-closure check proves no
-  `RunEventType` is orphaned from the audit record.
+  event-delivery surface. The taxonomy-closure check proves no `RunEventType` is
+  orphaned from the audit record.
+- *Guard scope (review F3): the orphan-bus check is a **heuristic tripwire**, not
+  a proof. It keys on specific high-signal method names (`register_consumer`,
+  `register_interceptor`, `add_sink`, `on_event`, `publish_delta`,
+  `subscribe_deltas`) and deliberately excludes generic `publish`/`emit` to avoid
+  noise — so a bus shaped like `RunEventStream` (`subscribe`+`emit`) is not
+  detected. It catches the common shapes and forces a conscious allowlist
+  decision for them; it does not guarantee detection of every conceivable bus.*
 
 **Lesson:** the spine's realized value is the **typed read side** (evidence →
 receipts) and a single typed lifecycle path — not wholesale relocation of
diff --git a/docs/generated/docs-inventory.md b/docs/generated/docs-inventory.md
@@ -42,7 +42,7 @@ Do not edit this file manually — regenerate instead.
 | `adr/0029-consensus-validation-deferred.md` | working | 1587 | `8a2da40abc07` |
 | `adr/0030-root-module-freeze.md` | working | 1297 | `bee25422e85f` |
 | `adr/0031-shadow-mode-exit-criteria.md` | working | 3598 | `46a9a0d5eaac` |
-| `adr/0032-run-event-taxonomy.md` | working | 14011 | `50c291b14941` |
+| `adr/0032-run-event-taxonomy.md` | working | 14751 | `259c84511705` |
 | `adr/README.md` | working | 7109 | `713a782f5411` |
 | `agent-contribution-contract.md` | constitution | 5204 | `9c2dad1195d2` |
 | `agent-mode-operator-guide.md` | working | 2778 | `25b258ab7bfe` |
@@ -414,7 +414,7 @@ Do not edit this file manually — regenerate instead.
 | `ops/security-hardening.md` | working | 11733 | `0a385c7dab82` |
 | `ops/troubleshooting.md` | working | 9127 | `4921b6d50f5c` |
 | `permission-and-approval-playbook.md` | working | 6560 | `813bc74bb156` |
-| `plans/adr-0032-m1-m6-work-plan-2026-06-13.md` | archive | 60885 | `70eeb1c1049b` |
+| `plans/adr-0032-m1-m6-work-plan-2026-06-13.md` | archive | 61230 | `54a436ad04b7` |
 | `plans/agent-ecosystem-acceptance-roadmap-2026-05-31.md` | archive | 29099 | `7c4a4972cfeb` |
 | `plans/community-pain-points-response-plan-2026-06-05.md` | archive | 7276 | `571d010133ad` |
 | `plans/competitive-positioning-plan-2026-05-31.md` | archive | 8726 | `d16dfd2bdd99` |
diff --git a/docs/plans/adr-0032-m1-m6-work-plan-2026-06-13.md b/docs/plans/adr-0032-m1-m6-work-plan-2026-06-13.md
@@ -177,7 +177,7 @@ consumers by M6.
 | ADR-0032-M2 (REDEFINED, taxonomy-only §16) | Every audit event the evidence bundle reads is typed in `RunEventType` and mapped both directions, so the M2-T001 reader surfaces it **from the audit JSONL** (mapper is sufficient; emit-site migration is NOT in M2 — it is deferred to the component milestones, §16). Covers routes, git-sandbox, skills, tests, undo, provenance, approval/tool-call decision events, cancelled/pending lifecycle. Pure additive; zero behavior change. (Old M2 "evidence/receipt fold" moved to M6 — §14.) |
 | ADR-0032-M3 | Plan gate is an interceptor using `PlanValidator`, landed parity-first (§13.3): a shadow-parity test asserting interceptor==inline per reason code went green before the inline branch was deleted in a separate commit. Denials and reason codes match current behavior; adversarial and first-hour tests remain green. |
 | ADR-0032-M4 (CLOSED — owner decisions B + B-analog, 2026-06-13) | **No gate moves to an interceptor; approval AND budget enforcement both STAY INLINE.** Both proved runtime-stateful on assessment, a poor fit for the pure-interceptor model. **Approval** (decision B): live JIT/session state, tool handler, auto-mode-swappable policy — every coupling gap was invisible to a unit parity test (`docs/work-log/m4-approval-sliceB-blocked-2026-06-13.md`). **Budget** (decision B-analog): it is three mechanisms — only the global cost cap (`_assert_cost_budget`) is stateless; the phase budget (live `phase_tracker`) and the warning ladder (`_budget_warning_levels_emitted` + `BudgetMonitor._emitted_levels`/`_prompted` dedup sets + an interactive `on_prompt` side-effect handler — the same `assert_allowed` shadow-coexistence trap that blocked approval) are stateful, and even the cost cap is enforced at two evolving-cost points per iteration that do not map 1:1 to events (`docs/work-log/m4-budget-stays-inline-2026-06-13.md`). Both gates' observability is already provided by M2 (their audit events — `tool_call_*`, `approval_*`, `budget_warning`, `budget_prompt`, `phase_budget_warning` — are typed + reader-surfaced); the M6 fold reads them without owning enforcement. Approval/budget behavior unchanged. **Net: plan gate (M3) is the sole governance gate moved to an interceptor.** |
-| ADR-0032-M5 (REVISED — observability-only, 2026-06-13) | **Hook OBSERVABILITY folds onto the spine; hook EXECUTION stays in the tool-dispatch layer.** Assessment found the planned "HookRegistry on spine" unsuitable for the same runtime-coupling reason as approval/budget: PreToolUse/PostToolUse run in `teaagent/tools.py::execute` and **mutate in-flight `arguments`/`result`** (the spine has no channel to ferry mutated payloads back to the dispatch site), and the 6 session-lifecycle hooks (SessionStart/End, UserPromptSubmit, PreCompact, Stop, SubagentStop) have **no production caller** — nothing to strangle; wiring them is feature work. Done: the 5 dispatch-layer hook audit events (`tool_hook_pre_mutation`, `tool_hook_pre_mutation_blocked`, `tool_hook_vetoed`, `tool_hook_post_mutation`, `tool_hook_post_failed`) are typed in `RunEventType` + mapped both directions, so the M2-T001 reader surfaces hook veto/mutation activity from the audit JSONL for the M6 fold. Mapping/reader only; audit bytes unchanged; hook execution + mutation semantics unchanged. See `docs/work-log/m5-hooks-observability-only-2026-06-13.md`. |
+| ADR-0032-M5 (REVISED — observability-only, 2026-06-13) | **Hook OBSERVABILITY folds onto the spine; hook EXECUTION stays in the tool-dispatch layer.** Assessment found the planned "HookRegistry on spine" unsuitable for the same runtime-coupling reason as approval/budget: PreToolUse/PostToolUse run in `teaagent/tools.py::execute` and **mutate in-flight `arguments`/`result`** (the spine has no channel to ferry mutated payloads back to the dispatch site), and the 6 session-lifecycle hooks (SessionStart/End, UserPromptSubmit, PreCompact, Stop, SubagentStop) have **no production caller** — nothing to strangle; wiring them is feature work. Done: the 5 dispatch-layer hook audit events (`tool_hook_pre_mutation`, `tool_hook_pre_mutation_blocked`, `tool_hook_vetoed`, `tool_hook_post_mutation`, `tool_hook_post_failed`) are typed in `RunEventType` + mapped both directions, so the M2-T001 reader can surface them as typed RunEvents. **Correction (post-migration review F1):** this is typing + reader-visibility ONLY — it is NOT yet folded into evidence/receipts. No evidence extractor reads `tool_hook_*` and `RunEvidenceBundle` has no hooks field, so hook veto/mutation activity does not currently appear in any bundle/receipt. Surfacing it would need a new `RunEvidenceBundle` hooks field + extractor (backlog). Mapping/reader only; audit bytes unchanged; hook execution + mutation semantics unchanged. See `docs/work-log/m5-hooks-observability-only-2026-06-13.md`. |
 | ADR-0032-M6 (was M2 fold; corrected scope A) — **COMPLETE (FOLD-T001 + T002)** | Evidence and receipts are folded from the typed event stream and equal the legacy builder on success/failure/pending fixtures (cancelled once emitted in M2); the fold reads the full stream (no fallback flag, per Q1). **FOLD-T001**: `build_evidence_from_events()` parallel builder sharing `_assemble_evidence_bundle` with the legacy path (cannot drift; only the event *source* differs), parity-asserted (`tests/test_run_evidence.py::test_m6_fold_*`). Fixed a structural gap: the typed `RunEvent` was lossy — dropped top-level `created_at` (threaded into command/test/approval timestamps); added optional `RunEvent.created_at`, reader populates it. **FOLD-T002 (cutover DONE)**: `build_run_evidence_bundle` now routes production evidence THROUGH the typed reader + fold — the typed stream is the production path; the raw-dict assembly survives only as the shared helper (so the two cannot diverge). Suite-wide green (evidence/receipt/summary/5-min-proof/first-hour/adversarial + all bundle consumers, ~218 tests). **Finding: no synthetic receipt-only fixtures existed to retire** — the receipt/evidence path was already event-backed (`test_run_receipt.py` writes real RunStore events; `test_real_run_receipt_completeness_from_plan` validates a real run); direct `RunEvidenceBundle(...)` constructions are legitimate downstream-consumer/checker unit tests, not masking fixtures. The plan anticipated a gap that does not exist. Parity test re-anchored against `_assemble_evidence_bundle` (the raw-dict path) so it stays meaningful post-cutover. |
 | ADR-0032-M7 (was M6) — **COMPLETE as guard + document, 2026-06-13** | Original goal ("ContextBus + webhook consume the spine; delete inline eventing") **NOT done — it is a regression or vacuous.** Webhook is an `audit.add_sink` already fed transitively by the M1 spine→audit consumer; a *direct* spine consumer would see only the spine-emitted subset (coverage regression). ContextBus + integration `RunEventStream` are **unwired in production** (no callers) — nothing to migrate. The inline `audit.record` calls are the **complete event record** (read by evidence/receipts/webhook), not redundant eventing to delete. **Done instead (owner: guard + document):** `scripts/validate_event_spine_wiring.py` + `tests/test_event_spine_wiring.py` enforce the realized invariant — one typed lifecycle path (EventSpine→audit consumer), an allowlist of sanctioned event-delivery surfaces so a NEW competing lifecycle bus fails the gate, and taxonomy closure (no RunEventType orphaned from the audit record). Added as a pre-commit hook. ADR 0032 "Realized architecture (M1–M7)" section documents the outcome. **MIGRATION COMPLETE.** |
 
diff --git a/scripts/validate_event_spine_wiring.py b/scripts/validate_event_spine_wiring.py
@@ -26,8 +26,15 @@
      (no orphaned typed event that cannot reach the audit record).
   B. No orphaned event bus — any class exposing a high-signal lifecycle-event
      delivery method must be in the curated allowlist below. A *new* competing
-     lifecycle-event bus fails the gate, forcing a conscious architecture
-     decision rather than silent drift.
+     lifecycle-event bus of a common shape fails the gate, forcing a conscious
+     architecture decision rather than silent drift.
+
+     LIMITATION (heuristic tripwire, not a proof): check B keys on specific
+     method names and deliberately excludes generic ``publish`` / ``emit`` to
+     avoid noise from skill-writer / marketplace / token-streaming. As a result
+     a bus shaped like the integration ``RunEventStream`` (``subscribe`` +
+     ``emit``) is NOT detected. The check catches the common shapes; it does not
+     guarantee detection of every conceivable event bus.
 
 Run: python3 scripts/validate_event_spine_wiring.py
 Exit code 0 when clean, 1 on any violation.
diff --git a/tests/test_run_evidence.py b/tests/test_run_evidence.py
@@ -501,3 +501,55 @@ def test_m6_fold_preserves_created_at_in_command_timestamps():
         assert typed[0].created_at == '2026-06-13T13:00:01+00:00'
         folded = build_evidence_from_events(typed, root=root, run_id='r-ts')
         assert folded.commands_run[0].timestamp == '2026-06-13T13:00:01+00:00'
+
+
+# Audit event types that the evidence/proof-of-use extractors filter on. After
+# the M6 FOLD-T002 cutover, build_run_evidence_bundle routes through
+# read_run_events_from_audit, which DROPS any audit event whose type is not in
+# RunEventType. So production evidence is lossless ONLY IF every type below is
+# typed. This list is the enforced coupling: if you teach an extractor to read a
+# NEW audit event type, add it here — the test will then fail until that type is
+# also added to RunEventType (otherwise the cutover would silently drop it).
+EVIDENCE_EXTRACTOR_AUDIT_TYPES: frozenset[str] = frozenset(
+    {
+        'run_started',
+        'run_failed',
+        'tool_use',
+        'tool_call_started',
+        'tool_call_completed',
+        'tool_error',
+        'test_run',
+        'approval_requested',
+        'approval_granted',
+        'approval_denied',
+        'tool_call_approved',
+        'tool_call_denied',
+        'tool_call_pending_approval',
+        'model_route',
+        'provenance_collected',
+        'skill_activated',
+        'skill_lifecycle_transition',
+        'git_sandbox_started',
+        'git_sandbox_resolved',
+        'undo_applied',
+    }
+)
+
+
+def test_m6_every_evidence_extractor_type_is_typed() -> None:
+    """Guard the FOLD-T002 cutover: every audit type the evidence extractors read
+    must be in RunEventType, or read_run_events_from_audit would silently drop it
+    from production evidence (F2 from the post-migration review).
+    """
+    from teaagent.runner._events import _AUDIT_EVENT_TO_RUN_EVENT_TYPE
+
+    missing = sorted(
+        t
+        for t in EVIDENCE_EXTRACTOR_AUDIT_TYPES
+        if t not in _AUDIT_EVENT_TO_RUN_EVENT_TYPE
+    )
+    assert not missing, (
+        f'evidence extractors read audit types not in RunEventType: {missing}. '
+        f'The M6 cutover would silently drop them — add them to RunEventType + '
+        f'the audit mapper in teaagent/runner/_events.py.'
+    )