refactor(runner): extract budget enforcement to shared governed-execution layer (ADR-0041 G1)

johnteee · johnteee · commit 6ed32b2f004c · 2026-06-30T17:17:44.000+08:00
Behavior-preserving extraction of AgentRunner's per-iteration budget enforcement
into a single shared layer. Risk-reported per the high-risk-paths gate; no
approval/authorization or audit semantics changed.

- New teaagent/runner/_governed_execution.py: GovernedExecutionContext +
  enforce_cost_budget / enforce_phase_budget / enforce_budget_warnings, a verbatim
  move of _core.py's _assert_cost_budget / _check_phase_budget /
  _check_budget_warnings bodies. AgentRunner builds one context at construction and
  its three budget methods now delegate (signatures unchanged).
- scripts/validate_runner_invariants.py: new §1.2 G1 gate
  (_check_governed_execution_authority) requires _core.py to import the layer and
  call the enforce_* functions, locking the delegation against regression.
- docs/reviews/a-p1-3-governed-execution-risk.md: reflective-risk report.
- ADR-0041 §1.3/§1.5: G1 marked partial (budget delegated; _authorize_tool_call
  approval extraction deferred to its own risk-reported slice); G2/G3 met.
- tests/runner/test_governed_execution.py: unit coverage for the extracted funcs.

Action: A-P1-3
Constraint: behavior-preserving verbatim extraction of budget enforcement only; approval/authorization (_authorize_tool_call) deliberately left inline (larger blast radius, separate report); no audit event added/removed; git-revert rollback
Tested: 93 tests pass across tests/runner/ (incl. TestLiveDifferential + new test_governed_execution) + budget/governance/p0-harness + subagent lineage flow + subagent budget-inheritance integration; validate_runner_invariants.py exit 0 (now gates _core-&gt;layer); import-cycle smoke OK; mypy clean on touched files; ruff (uv 0.15.17) clean; validate_docs_consistency.py exit 0
Not-tested: full repo suite not run by me (pre-commit runs the smoke subset + full mypy); _authorize_tool_call extraction not attempted
Confidence: high
diff --git a/docs/adr/0041-execution-surface-unification-and-harness-thinning.md b/docs/adr/0041-execution-surface-unification-and-harness-thinning.md
@@ -148,7 +148,7 @@ All must pass before Phase 2 starts:
 
 | Gate | Evidence |
 | --- | --- |
-| **G1 — Shared layer exists** | `teaagent/runner/_governed_execution.py` exported; `AgentRunner` delegates budget/approval/audit to it |
+| **G1 — Shared layer exists** | Partial: `teaagent/runner/_governed_execution.py` exported; `AgentRunner` delegates **budget** enforcement (cost/phase/warnings) to it. Approval/authorization (`_authorize_tool_call`) still inline — tracked as a separate slice |
 | **G2 — Subagent path uses layer** | `SubagentManager` constructs context via shared helpers; no parallel `assert_allowed` in `_manager.py` |
 | **G3 — Live differential** | Parametrized test (ADR-0040 §2 follow-up) runs identical tool+budget scenario through `AgentRunner` and `SubagentManager.run_subagent`; `assert_budget_invariant`, `assert_audit_invariant`, `assert_approval_invariant` pass on collected evidence |
 | **G4 — CI green** | `scripts/validate_runner_invariants.py`, full test suite, acceptance tier unchanged |
@@ -181,13 +181,19 @@ Behavior-preserving slices landed (2026-06-30):
   `SubagentManager.run_subagent` (via the `subagent` tool) and asserts
   `assert_audit_invariant`, `assert_audit_events_match`, `assert_approval_invariant`,
   and `assert_budget_invariant` on evidence collected from both surfaces.
-
-Still open for Phase 1: the `_governed_execution.py` module that unifies
-authorization, approval, audit, and full budget *enforcement* (§1.1) — the G1
-extraction from `_core.py` — plus the §1.2 import check for that module. **G1 is
-unmet** (a high-risk `_core.py` refactor requiring a `docs/reviews/*-risk.md`
-report); G3 is met; G2 is met for the budget invariant. Phase 2 is not yet
-authorized.
+- **G1 budget enforcement extracted.** `runner/_governed_execution.py` now owns
+  per-iteration budget enforcement (`enforce_cost_budget` / `enforce_phase_budget`
+  / `enforce_budget_warnings`), a verbatim behavior-preserving move from `_core.py`;
+  `AgentRunner` delegates to it. The §1.2 import gate in
+  `scripts/validate_runner_invariants.py` requires `_core.py` to use the layer.
+  Risk report: `docs/reviews/a-p1-3-governed-execution-risk.md`; unit tests:
+  `tests/runner/test_governed_execution.py`.
+
+Still open for Phase 1 (G1 remainder): the **authorization** dimension —
+`AgentRunner._authorize_tool_call` reassigns `ApprovalPolicy` and calls
+run-summary emission, a larger blast radius deferred to its own reflective-risk
+slice. **G1 is partial** (budget enforcement delegated; approval still inline);
+**G2 and G3 are met**. Phase 2 is not yet authorized.
 
 ### Phase 2 — Domain reasoning migration (harness thinning)
 
diff --git a/docs/generated/docs-inventory.md b/docs/generated/docs-inventory.md
@@ -6,7 +6,7 @@
 Generated by `python3 scripts/generate_docs_inventory.py`.
 Do not edit this file manually — regenerate instead.
 
-**Markdown files:** 611
+**Markdown files:** 612
 
 | Path | Tier | Bytes | SHA256 (12) |
 | --- | --- | ---: | --- |
@@ -44,7 +44,7 @@ Do not edit this file manually — regenerate instead.
 | `adr/0031-shadow-mode-exit-criteria.md` | working | 3598 | `46a9a0d5eaac` |
 | `adr/0032-run-event-taxonomy.md` | working | 16065 | `b9f0c0d7c30a` |
 | `adr/0040-second-framework-invariants.md` | working | 7554 | `00b53102ace3` |
-| `adr/0041-execution-surface-unification-and-harness-thinning.md` | working | 17046 | `163b9b9d4ad6` |
+| `adr/0041-execution-surface-unification-and-harness-thinning.md` | working | 17638 | `74544dd0eba8` |
 | `adr/README.md` | working | 7611 | `c91dd1b63df7` |
 | `agent-contribution-contract.md` | constitution | 5204 | `9c2dad1195d2` |
 | `agent-mode-operator-guide.md` | working | 2778 | `25b258ab7bfe` |
@@ -527,6 +527,7 @@ Do not edit this file manually — regenerate instead.
 | `retrospective/review-system.md` | working | 15180 | `61db6643e4aa` |
 | `retrospective/tool-capability-review.md` | working | 15681 | `20ca7506fc04` |
 | `reviews/a-p0-2-observability-risk.md` | working | 2781 | `8a2295be6857` |
+| `reviews/a-p1-3-governed-execution-risk.md` | working | 4979 | `bcb6e5e1fb88` |
 | `reviews/a-p1-4-approval-migration-risk.md` | working | 6858 | `66927f1fb201` |
 | `reviews/daily-driver-critique-and-counterarguments-2026-06-04.md` | archive | 4904 | `a6f32a23c7e5` |
 | `reviews/daily-driver-docs-package-review-2026-06-02.md` | archive | 2288 | `a180df555135` |
diff --git a/docs/reviews/a-p1-3-governed-execution-risk.md b/docs/reviews/a-p1-3-governed-execution-risk.md
@@ -0,0 +1,119 @@
+# Reflective-Risk Report: Governed-Execution Budget Extraction (A-P1-3)
+
+ADR 0041 Phase 1, gate G1 (budget dimension). Date: 2026-06-30.
+High-risk path touched: `teaagent/runner/_core.py`.
+
+## Goal
+
+Extract the primary runner's per-iteration **budget enforcement** (cost ceiling,
+phase budget, graduated cost warnings) from `teaagent/runner/_core.py` into a
+shared `teaagent/runner/_governed_execution.py` layer that `AgentRunner`
+delegates to, so the budget invariants are defined once and inherited by
+subagents (which execute through `AgentRunner` via `run_chat_agent`).
+Behavior-preserving; no approval/authorization or policy change.
+
+## Stakeholders
+
+Harness maintainers; every agent run (primary + subagent) relies on correct
+budget enforcement; the governance gates of ADR-0009 / ADR-0040.
+
+## Assets at Risk
+
+- Budget-enforcement correctness (overspend protection).
+- Audit event stream shape (`phase_budget_warning`, `budget_warning`,
+  `budget_prompt`, `budget_read_only_suggested`).
+- Run-loop control flow (`BudgetExceededError` / `RunCancelledError` propagation).
+
+## Threat Model
+
+A subtle behavior change during extraction could (a) fail to raise on an
+over-budget run (overspend), (b) change audit event order/fields (breaks audit
+consumers / schema), or (c) alter the exception type so the run loop mishandles
+it.
+
+## Assumption Audit
+
+- ASSUMPTION: `self.budget`, `self.phase_tracker`, `self.audit`,
+  `self._budget_monitor`, and `self._budget_warning_levels_emitted` are never
+  reassigned after `__init__`. VERIFIED by reading `_core.py` — set once; the
+  warning set is mutated in place (`.add`), not reassigned. A context built once
+  holding the same references therefore stays in sync.
+- ASSUMPTION: the three methods read no other mutable `self` state. VERIFIED by
+  reading the bodies — only the five collaborators above.
+
+## Evidence Check
+
+- The extracted functions are a verbatim copy of the original method bodies,
+  parameterized over a `GovernedExecutionContext` of the same collaborators (the
+  diff is a pure move).
+- Method signatures are unchanged; all run-loop call sites are untouched.
+
+## Authority / Tool Boundary
+
+- In scope: `teaagent/runner/_core.py` (3 budget method bodies + one `__init__`
+  context construction), new `teaagent/runner/_governed_execution.py`,
+  `scripts/validate_runner_invariants.py`, tests, ADR.
+- Out of scope (explicitly deferred): `_authorize_tool_call` / approval-policy
+  extraction; sandbox; audit chain; policy semantics.
+
+## Failure Modes
+
+- Import cycle (`runner` <-> new module): mitigated — the module imports only
+  leaf modules (`errors`/`budget`/`budget_monitor`/`phase_tracker`/`audit`);
+  import smoke passes.
+- Context desync if a collaborator is reassigned in future: guarded by the
+  no-reassignment invariant (documented in the module) and the full test suite.
+
+## Worst-case Scenario
+
+A budget check silently stops raising, so a run overspends its cost cap. Bounded
+by the full budget/runner suites and the live differential test, which assert
+enforcement still triggers; a regression fails CI before merge.
+
+## Safe Dry-run Plan
+
+Behavior-preserving pure move, verified offline by running the existing budget,
+runner, governance, and subagent suites (93 passed) plus the live differential
+test and new unit tests — no production run, no external I/O.
+
+## Rollback Plan
+
+`git revert` the commit. The change is additive (one new module) plus three
+method-body delegations and a static gate; reverting restores the inline methods
+exactly. No data migration, no persisted-state change.
+
+## Bounded Execution
+
+Single commit; only the files listed above; no network; no destructive ops;
+verified by local test suites and the runner-invariant gate before commit.
+
+## Audit Log Plan
+
+Audit emission is byte-identical: every `audit.record(...)` call moved verbatim
+into the shared functions. No audit event added or removed.
+
+## Human Review Required
+
+Yes — high-risk path (`teaagent/runner/_core.py`). This report is the
+reflective-risk artifact; the `check-high-risk-paths` pre-commit hook gates the
+commit on its presence.
+
+## Human Approval Gate
+
+Owner authorized the G1 extraction in-session. Budget dimension only; the
+higher-blast-radius `_authorize_tool_call` extraction remains deferred to a
+separate report and review.
+
+## Acceptance Criteria
+
+- `_governed_execution.py` owns budget enforcement; `AgentRunner` delegates to it.
+- All existing budget / runner / governance / subagent tests pass unchanged.
+- `validate_runner_invariants.py` passes and now gates `_core.py` -> shared layer.
+- ruff + mypy clean; the G3 differential test stays green.
+
+## Go / No-go Decision
+
+**GO** for the budget dimension — bounded, behavior-preserving, fully verified,
+trivially reversible. **NO-GO** for bundling `_authorize_tool_call` into this
+change: it reassigns `ApprovalPolicy` and calls run-summary emission, a larger
+blast radius that warrants its own reflective-risk report and review.
diff --git a/scripts/validate_runner_invariants.py b/scripts/validate_runner_invariants.py
@@ -52,6 +52,18 @@
 _BUDGET_CLAMP_SYMBOL = 'compute_clamped_budget'
 _BUDGET_CLAMP_FILES: tuple[str, ...] = ('teaagent/subagents/_manager.py',)
 
+# ADR 0041 Phase 1 (G1): the primary runner must delegate per-iteration budget
+# enforcement to the shared governed-execution layer, not re-implement it inline.
+_GOVERNED_EXEC_IMPORTS: frozenset[str] = frozenset(
+    {'_governed_execution', 'teaagent.runner._governed_execution'}
+)
+_GOVERNED_EXEC_SYMBOLS: tuple[str, ...] = (
+    'enforce_cost_budget',
+    'enforce_phase_budget',
+    'enforce_budget_warnings',
+)
+_GOVERNED_EXEC_FILES: tuple[str, ...] = ('teaagent/runner/_core.py',)
+
 
 def _collect_imports(source: str) -> set[str]:
     tree = ast.parse(source)
@@ -113,12 +125,36 @@ def _check_budget_clamp_authority(rel_path: str) -> list[str]:
     return []
 
 
+def _check_governed_execution_authority(rel_path: str) -> list[str]:
+    """ADR 0041 Phase 1 G1: budget enforcement is delegated to the shared layer."""
+    file_path = _REPO_ROOT / rel_path
+    if not file_path.is_file():
+        return [f'{rel_path}: file not found']
+    source = file_path.read_text(encoding='utf-8')
+    imports = _collect_imports(source)
+    if not (imports & _GOVERNED_EXEC_IMPORTS):
+        return [
+            f'{rel_path}: missing governed-execution import — must import '
+            f'teaagent.runner._governed_execution and delegate budget enforcement '
+            f'instead of re-implementing it inline (ADR 0041 Phase 1, G1)'
+        ]
+    missing = [s for s in _GOVERNED_EXEC_SYMBOLS if s not in source]
+    if missing:
+        return [
+            f'{rel_path}: imports the governed-execution layer but does not call '
+            f'{", ".join(missing)} (ADR 0041 Phase 1, G1)'
+        ]
+    return []
+
+
 def validate() -> list[str]:
     errors: list[str] = []
     for rel_path in _SECOND_FRAMEWORK_FILES:
         errors.extend(_check_file(rel_path))
     for rel_path in _BUDGET_CLAMP_FILES:
         errors.extend(_check_budget_clamp_authority(rel_path))
+    for rel_path in _GOVERNED_EXEC_FILES:
+        errors.extend(_check_governed_execution_authority(rel_path))
     return errors
 
 
diff --git a/teaagent/runner/_core.py b/teaagent/runner/_core.py
@@ -11,7 +11,7 @@
 from teaagent.audit import AuditLogger
 from teaagent.auto_mode import AutoModeConfig
 from teaagent.budget import RunBudget
-from teaagent.budget_monitor import BudgetAction, BudgetMonitor
+from teaagent.budget_monitor import BudgetMonitor
 from teaagent.context import ContextCompactor
 from teaagent.errors import (
     AgentHarnessError,
@@ -49,6 +49,12 @@
 from ._approval_manager import RunnerApprovalCoordinator  # noqa: E402
 from ._auto_mode_manager import AutoModeManager  # noqa: E402
 from ._events import EventSpine, RunEventType, register_audit_consumer  # noqa: E402
+from ._governed_execution import (  # noqa: E402
+    GovernedExecutionContext,
+    enforce_budget_warnings,
+    enforce_cost_budget,
+    enforce_phase_budget,
+)
 from ._plan_validator import PlanGateInterceptor, PlanValidator  # noqa: E402
 from ._types import (  # noqa: E402
     ApprovalHandler,
@@ -166,6 +172,13 @@ def __init__(
         self._budget_warning_levels_emitted: set[int] = set()
         self._budget_prompted = False
         self._compaction_warning_emitted = False
+        self._governed_execution = GovernedExecutionContext(
+            budget=self.budget,
+            phase_tracker=self.phase_tracker,
+            audit=self.audit,
+            budget_monitor=self._budget_monitor,
+            budget_warning_levels_emitted=self._budget_warning_levels_emitted,
+        )
         self.plan_validator = PlanValidator(
             approval_policy=self.approval_policy,
             require_plan=require_plan,
@@ -211,16 +224,7 @@ def __init__(
             self.plan_validator.set_read_only_lint_errors(lint_errors)
 
     def _assert_cost_budget(self, cost_cents: float) -> None:
-        max_cost = self.budget.max_estimated_cost_cents
-        if max_cost is None:
-            return
-        # 0 means zero spend allowed - any positive cost exceeds it
-        if max_cost == 0:
-            if cost_cents > 0:
-                raise BudgetExceededError('cost budget exceeded (zero cap)')
-            return
-        if cost_cents > max_cost:
-            raise BudgetExceededError('cost budget exceeded')
+        enforce_cost_budget(self._governed_execution, cost_cents)
 
     def _read_usage(
         self,
@@ -250,97 +254,18 @@ def _check_phase_budget(
         cost_cents: float,
         tool_calls: int,
     ) -> None:
-        tracker = self.phase_tracker
-        phase = tracker.current_phase
-        pb = self.budget.phase_budget_for(phase)
-
-        phase_iters = tracker.phase_iterations()
-        if phase_iters > pb.max_iterations:
-            self.audit.record(
-                'phase_budget_warning',
-                run_id,
-                phase=phase.value,
-                metric='iterations',
-                current=phase_iters,
-                limit=pb.max_iterations,
-            )
-            raise BudgetExceededError(f'phase {phase.value} iteration budget exceeded')
-
-        phase_tools = tracker.phase_tool_calls()
-        if phase_tools > pb.max_tool_calls:
-            self.audit.record(
-                'phase_budget_warning',
-                run_id,
-                phase=phase.value,
-                metric='tool_calls',
-                current=phase_tools,
-                limit=pb.max_tool_calls,
-            )
-            raise BudgetExceededError(f'phase {phase.value} tool-call budget exceeded')
-
-        phase_cost = tracker.phase_cost_cents(cost_cents)
-        if (
-            pb.max_estimated_cost_cents is not None
-            and phase_cost > pb.max_estimated_cost_cents
-        ):
-            self.audit.record(
-                'phase_budget_warning',
-                run_id,
-                phase=phase.value,
-                metric='cost',
-                current=phase_cost,
-                limit=pb.max_estimated_cost_cents,
-            )
-            raise BudgetExceededError(f'phase {phase.value} cost budget exceeded')
+        enforce_phase_budget(
+            self._governed_execution,
+            run_id=run_id,
+            cost_cents=cost_cents,
+        )
 
     def _check_budget_warnings(self, *, run_id: str, cost_cents: float) -> None:
-        budget_cap = self.budget.max_estimated_cost_cents
-        if budget_cap is None:
-            return
-        # 0 cap is enforced by _assert_cost_budget; no warnings needed
-        if budget_cap == 0:
-            return
-        max_cost = float(budget_cap)
-        percent = (cost_cents / max_cost) * 100.0
-        for level in (50, 80, 90, 100):
-            if percent < level or level in self._budget_warning_levels_emitted:
-                continue
-            self._budget_warning_levels_emitted.add(level)
-
-            action = self._budget_monitor.check_at_threshold(
-                run_id=run_id,
-                cost_cents=cost_cents,
-                threshold=level,
-            )
-
-            self.audit.record(
-                'budget_warning',
-                run_id,
-                level=level,
-                percent=percent,
-                cost_cents=cost_cents,
-                max_cost_cents=max_cost,
-            )
-
-            if action == BudgetAction.PROMPT_CONFIRM:
-                self.audit.record(
-                    'budget_prompt',
-                    run_id,
-                    percent=percent,
-                    cost_cents=cost_cents,
-                    max_cost_cents=max_cost,
-                    approved=False,
-                )
-                raise RunCancelledError('run cancelled: budget at 90%')
-
-            if action == BudgetAction.SUGGEST_READ_ONLY:
-                self.audit.record(
-                    'budget_read_only_suggested',
-                    run_id,
-                    percent=percent,
-                    cost_cents=cost_cents,
-                    max_cost_cents=max_cost,
-                )
+        enforce_budget_warnings(
+            self._governed_execution,
+            run_id=run_id,
+            cost_cents=cost_cents,
+        )
 
     def _check_compaction_warning(
         self, *, context: RunContext, input_tokens: int, output_tokens: int
diff --git a/teaagent/runner/_governed_execution.py b/teaagent/runner/_governed_execution.py
diff --git a/tests/runner/test_governed_execution.py b/tests/runner/test_governed_execution.py