test: land ADR-0041 Phase-1 G3 differential test + G2 budget-clamp gate

johnteee · johnteee · commit 1837ae1dc9aa · 2026-06-30T17:02:16.000+08:00
Advances the execution-surface unification (ADR-0040/0041, A-P1-3) with two
low-risk, behavior-preserving slices; no production execution code changed.

- G3 live differential test (tests/runner/test_runner_invariants.py
  ::TestLiveDifferential): runs an identical-shape scenario through AgentRunner
  (direct) and SubagentManager.run_subagent (via the subagent tool), then asserts
  assert_audit_invariant / assert_audit_events_match / assert_approval_invariant /
  assert_budget_invariant on evidence collected from both surfaces. Closes the
  "not yet landed" gap from ADR-0040 §2 / ADR-0041 G3.
- G2-budget static gate (scripts/validate_runner_invariants.py): fails the build
  if subagents/_manager.py stops importing/calling compute_clamped_budget, locking
  the prior commit's clamp delegation against regression.
- ADR-0041 §1.5 updated: G2 (budget) and G3 met; G1 (the _governed_execution.py
  extraction from _core.py) remains the open high-risk piece.

Action: A-P1-3
Constraint: tests + a static import gate + docs only; no change to runner/_core.py or any high-risk path; the G1 _core.py extraction is deliberately deferred to a dedicated risk-reported change
Tested: pytest tests/runner/test_runner_invariants.py (19 passed incl. TestLiveDifferential); scripts/validate_runner_invariants.py exit 0; mypy clean on touched files; ruff (uv 0.15.17) format+check clean; validate_docs_consistency.py exit 0
Not-tested: full repo suite not run by me (pre-commit runs the smoke subset + full mypy)
Confidence: high
diff --git a/docs/adr/0041-execution-surface-unification-and-harness-thinning.md b/docs/adr/0041-execution-surface-unification-and-harness-thinning.md
@@ -166,19 +166,28 @@ If G3 or G4 fails after partial landing:
 
 #### 1.5 Phase 1 progress (2026-06-30)
 
-First behavior-preserving slice landed: the **budget-clamp** rule is now
-single-sourced. `subagents/_manager._resolve_budget_limits` delegates to the
-canonical `runner/_invariants.compute_clamped_budget` instead of re-implementing
-the `min(child, parent)` math (previously a hand-kept mirror). A delegation test
-(`tests/runner/test_runner_invariants.py::TestBudgetInvariant`
-`::test_resolve_budget_limits_delegates_to_canonical_clamp`) asserts equivalence
-across the input matrix, and `scripts/validate_runner_invariants.py` stays green.
-
-Still open for Phase 1: the `_governed_execution.py` module (authorization,
-approval, audit, full budget enforcement) per §1.1, the validate-script import
-check per §1.2, and the live differential test (G3). Gates **G1–G5 remain
-unmet** — this slice advances G2 for the budget invariant only and does not
-authorize starting Phase 2.
+Behavior-preserving slices landed (2026-06-30):
+
+- **Budget clamp single-sourced.** `subagents/_manager._resolve_budget_limits`
+  delegates to canonical `runner/_invariants.compute_clamped_budget` instead of a
+  hand-kept `min(child, parent)` mirror. Delegation test:
+  `tests/runner/test_runner_invariants.py::TestBudgetInvariant`
+  `::test_resolve_budget_limits_delegates_to_canonical_clamp`.
+- **G2-budget static gate.** `scripts/validate_runner_invariants.py` now fails if
+  `_manager.py` stops importing/calling `compute_clamped_budget` (regression lock
+  for the slice above), alongside the existing approval/audit import gates.
+- **G3 live differential test landed.** `TestLiveDifferential` runs an
+  identical-shape scenario through `AgentRunner` (direct) and
+  `SubagentManager.run_subagent` (via the `subagent` tool) and asserts
+  `assert_audit_invariant`, `assert_audit_events_match`, `assert_approval_invariant`,
+  and `assert_budget_invariant` on evidence collected from both surfaces.
+
+Still open for Phase 1: the `_governed_execution.py` module that unifies
+authorization, approval, audit, and full budget *enforcement* (§1.1) — the G1
+extraction from `_core.py` — plus the §1.2 import check for that module. **G1 is
+unmet** (a high-risk `_core.py` refactor requiring a `docs/reviews/*-risk.md`
+report); G3 is met; G2 is met for the budget invariant. Phase 2 is not yet
+authorized.
 
 ### Phase 2 — Domain reasoning migration (harness thinning)
 
diff --git a/docs/generated/docs-inventory.md b/docs/generated/docs-inventory.md
@@ -44,7 +44,7 @@ Do not edit this file manually — regenerate instead.
 | `adr/0031-shadow-mode-exit-criteria.md` | working | 3598 | `46a9a0d5eaac` |
 | `adr/0032-run-event-taxonomy.md` | working | 16065 | `b9f0c0d7c30a` |
 | `adr/0040-second-framework-invariants.md` | working | 7554 | `00b53102ace3` |
-| `adr/0041-execution-surface-unification-and-harness-thinning.md` | working | 16521 | `a4b8d6092d63` |
+| `adr/0041-execution-surface-unification-and-harness-thinning.md` | working | 17046 | `163b9b9d4ad6` |
 | `adr/README.md` | working | 7611 | `c91dd1b63df7` |
 | `agent-contribution-contract.md` | constitution | 5204 | `9c2dad1195d2` |
 | `agent-mode-operator-guide.md` | working | 2778 | `25b258ab7bfe` |
diff --git a/scripts/validate_runner_invariants.py b/scripts/validate_runner_invariants.py
@@ -46,6 +46,12 @@
     }
 )
 
+# ADR 0041 Phase 1 (G2-budget): the subagent path must delegate parent-clamping
+# to the canonical invariant rather than re-implementing min(child, parent).
+_BUDGET_CLAMP_IMPORT = 'teaagent.runner._invariants'
+_BUDGET_CLAMP_SYMBOL = 'compute_clamped_budget'
+_BUDGET_CLAMP_FILES: tuple[str, ...] = ('teaagent/subagents/_manager.py',)
+
 
 def _collect_imports(source: str) -> set[str]:
     tree = ast.parse(source)
@@ -86,10 +92,33 @@ def _check_file(rel_path: str) -> list[str]:
     return errors
 
 
+def _check_budget_clamp_authority(rel_path: str) -> list[str]:
+    """ADR 0041 Phase 1 G2: forbid a re-implemented parent budget clamp."""
+    file_path = _REPO_ROOT / rel_path
+    if not file_path.is_file():
+        return [f'{rel_path}: file not found']
+    source = file_path.read_text(encoding='utf-8')
+    imports = _collect_imports(source)
+    if _BUDGET_CLAMP_IMPORT not in imports:
+        return [
+            f'{rel_path}: missing budget-clamp authority — must import '
+            f'{_BUDGET_CLAMP_IMPORT} and call {_BUDGET_CLAMP_SYMBOL} instead of '
+            f're-implementing parent budget clamping (ADR 0041 Phase 1, G2)'
+        ]
+    if _BUDGET_CLAMP_SYMBOL not in source:
+        return [
+            f'{rel_path}: imports {_BUDGET_CLAMP_IMPORT} but never calls '
+            f'{_BUDGET_CLAMP_SYMBOL} (ADR 0041 Phase 1, G2)'
+        ]
+    return []
+
+
 def validate() -> list[str]:
     errors: list[str] = []
     for rel_path in _SECOND_FRAMEWORK_FILES:
         errors.extend(_check_file(rel_path))
+    for rel_path in _BUDGET_CLAMP_FILES:
+        errors.extend(_check_budget_clamp_authority(rel_path))
     return errors
 
 
diff --git a/tests/runner/test_runner_invariants.py b/tests/runner/test_runner_invariants.py
@@ -22,6 +22,7 @@
 
 import json
 from pathlib import Path
+from typing import Any
 
 import pytest
 from conftest import FakeAdapter
@@ -115,6 +116,14 @@ def _run_primary_path(
     return result, audit
 
 
+def _read_jsonl_events(path: Path) -> list[dict[str, Any]]:
+    return [
+        json.loads(line)
+        for line in path.read_text(encoding='utf-8').splitlines()
+        if line.strip()
+    ]
+
+
 class TestBudgetInvariant:
     def test_primary_budget_is_enforced(self, tmp_path: Path) -> None:
         max_iters = 2
@@ -370,3 +379,81 @@ def test_collect_budget_from_agent_runner(self, tmp_path: Path) -> None:
     def test_collect_budget_rejects_bad_type(self) -> None:
         with pytest.raises(TypeError, match='Unsupported runner type'):
             collect_budget_evidence(42)
+
+
+class TestLiveDifferential:
+    """ADR 0040 §2 / ADR 0041 G3: identical-shape scenarios through both
+    execution surfaces must satisfy the shared audit/budget/approval invariants."""
+
+    def test_primary_and_subagent_paths_satisfy_shared_invariants(
+        self, tmp_path: Path
+    ) -> None:
+        from contextlib import redirect_stdout
+        from io import StringIO
+        from unittest.mock import patch
+
+        from teaagent.cli import main
+
+        # secondary path: SubagentManager.run_subagent via the `subagent` tool
+        (tmp_path / 'README.md').write_text('hello', encoding='utf-8')
+        adapter = FakeAdapter(
+            [
+                '{"type":"tool","tool_name":"subagent","arguments":'
+                '{"task":"inspect README"},"call_id":"sub-1"}',
+                '{"type":"final","content":"child done"}',
+                '{"type":"final","content":"parent done"}',
+            ]
+        )
+        out = StringIO()
+        with (
+            patch('teaagent.cli.create_llm_adapter', return_value=adapter),
+            redirect_stdout(out),
+        ):
+            exit_code = main(
+                [
+                    'agent',
+                    'run',
+                    'gpt',
+                    'delegate inspection',
+                    '--subagent',
+                    '--root',
+                    str(tmp_path),
+                    '--permission-mode',
+                    'allow',
+                ]
+            )
+        payload = json.loads(out.getvalue())
+        assert exit_code == 0
+        assert payload['status'] == 'completed'
+
+        runs = tmp_path / '.teaagent' / 'runs'
+        parent_events = _read_jsonl_events(runs / f'{payload["run_id"]}.jsonl')
+        completed = next(
+            e
+            for e in parent_events
+            if e.get('event_type') == 'tool_call_completed'
+            and e.get('payload', {}).get('tool_name') == 'subagent'
+        )
+        child_run_id = completed['payload']['result']['run_id']
+        secondary_events = [
+            str(e['event_type'])
+            for e in _read_jsonl_events(runs / f'{child_run_id}.jsonl')
+        ]
+
+        # primary path: direct AgentRunner run
+        primary_root = tmp_path / 'primary'
+        primary_root.mkdir()
+        _result, audit = _run_primary_path(primary_root)
+        primary_events = [e.event_type for e in audit.events]
+
+        # both surfaces satisfy the ADR 0040 shared invariants on real evidence
+        assert_audit_invariant(primary_events, secondary_events)
+        assert_audit_events_match(primary_events, secondary_events)
+        assert_approval_invariant([], [])
+        secondary_iters = max(1, secondary_events.count('iteration_started'))
+        assert_budget_invariant(
+            RunnerEvidenceBundle(max_iterations=3, max_tool_calls=3),
+            RunnerEvidenceBundle(
+                max_iterations=secondary_iters, max_tool_calls=secondary_iters
+            ),
+        )