Skip to content

Commit 1837ae1

Browse files
committed
test: land ADR-0041 Phase-1 G3 differential test + G2 budget-clamp gate
Advances the execution-surface unification (ADR-0040/0041, A-P1-3) with two low-risk, behavior-preserving slices; no production execution code changed. - G3 live differential test (tests/runner/test_runner_invariants.py ::TestLiveDifferential): runs an identical-shape scenario through AgentRunner (direct) and SubagentManager.run_subagent (via the subagent tool), then asserts assert_audit_invariant / assert_audit_events_match / assert_approval_invariant / assert_budget_invariant on evidence collected from both surfaces. Closes the "not yet landed" gap from ADR-0040 §2 / ADR-0041 G3. - G2-budget static gate (scripts/validate_runner_invariants.py): fails the build if subagents/_manager.py stops importing/calling compute_clamped_budget, locking the prior commit's clamp delegation against regression. - ADR-0041 §1.5 updated: G2 (budget) and G3 met; G1 (the _governed_execution.py extraction from _core.py) remains the open high-risk piece. Action: A-P1-3 Constraint: tests + a static import gate + docs only; no change to runner/_core.py or any high-risk path; the G1 _core.py extraction is deliberately deferred to a dedicated risk-reported change Tested: pytest tests/runner/test_runner_invariants.py (19 passed incl. TestLiveDifferential); scripts/validate_runner_invariants.py exit 0; mypy clean on touched files; ruff (uv 0.15.17) format+check clean; validate_docs_consistency.py exit 0 Not-tested: full repo suite not run by me (pre-commit runs the smoke subset + full mypy) Confidence: high
1 parent c5f4130 commit 1837ae1

4 files changed

Lines changed: 139 additions & 14 deletions

File tree

docs/adr/0041-execution-surface-unification-and-harness-thinning.md

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -166,19 +166,28 @@ If G3 or G4 fails after partial landing:
166166

167167
#### 1.5 Phase 1 progress (2026-06-30)
168168

169-
First behavior-preserving slice landed: the **budget-clamp** rule is now
170-
single-sourced. `subagents/_manager._resolve_budget_limits` delegates to the
171-
canonical `runner/_invariants.compute_clamped_budget` instead of re-implementing
172-
the `min(child, parent)` math (previously a hand-kept mirror). A delegation test
173-
(`tests/runner/test_runner_invariants.py::TestBudgetInvariant`
174-
`::test_resolve_budget_limits_delegates_to_canonical_clamp`) asserts equivalence
175-
across the input matrix, and `scripts/validate_runner_invariants.py` stays green.
176-
177-
Still open for Phase 1: the `_governed_execution.py` module (authorization,
178-
approval, audit, full budget enforcement) per §1.1, the validate-script import
179-
check per §1.2, and the live differential test (G3). Gates **G1–G5 remain
180-
unmet** — this slice advances G2 for the budget invariant only and does not
181-
authorize starting Phase 2.
169+
Behavior-preserving slices landed (2026-06-30):
170+
171+
- **Budget clamp single-sourced.** `subagents/_manager._resolve_budget_limits`
172+
delegates to canonical `runner/_invariants.compute_clamped_budget` instead of a
173+
hand-kept `min(child, parent)` mirror. Delegation test:
174+
`tests/runner/test_runner_invariants.py::TestBudgetInvariant`
175+
`::test_resolve_budget_limits_delegates_to_canonical_clamp`.
176+
- **G2-budget static gate.** `scripts/validate_runner_invariants.py` now fails if
177+
`_manager.py` stops importing/calling `compute_clamped_budget` (regression lock
178+
for the slice above), alongside the existing approval/audit import gates.
179+
- **G3 live differential test landed.** `TestLiveDifferential` runs an
180+
identical-shape scenario through `AgentRunner` (direct) and
181+
`SubagentManager.run_subagent` (via the `subagent` tool) and asserts
182+
`assert_audit_invariant`, `assert_audit_events_match`, `assert_approval_invariant`,
183+
and `assert_budget_invariant` on evidence collected from both surfaces.
184+
185+
Still open for Phase 1: the `_governed_execution.py` module that unifies
186+
authorization, approval, audit, and full budget *enforcement* (§1.1) — the G1
187+
extraction from `_core.py` — plus the §1.2 import check for that module. **G1 is
188+
unmet** (a high-risk `_core.py` refactor requiring a `docs/reviews/*-risk.md`
189+
report); G3 is met; G2 is met for the budget invariant. Phase 2 is not yet
190+
authorized.
182191

183192
### Phase 2 — Domain reasoning migration (harness thinning)
184193

docs/generated/docs-inventory.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Do not edit this file manually — regenerate instead.
4444
| `adr/0031-shadow-mode-exit-criteria.md` | working | 3598 | `46a9a0d5eaac` |
4545
| `adr/0032-run-event-taxonomy.md` | working | 16065 | `b9f0c0d7c30a` |
4646
| `adr/0040-second-framework-invariants.md` | working | 7554 | `00b53102ace3` |
47-
| `adr/0041-execution-surface-unification-and-harness-thinning.md` | working | 16521 | `a4b8d6092d63` |
47+
| `adr/0041-execution-surface-unification-and-harness-thinning.md` | working | 17046 | `163b9b9d4ad6` |
4848
| `adr/README.md` | working | 7611 | `c91dd1b63df7` |
4949
| `agent-contribution-contract.md` | constitution | 5204 | `9c2dad1195d2` |
5050
| `agent-mode-operator-guide.md` | working | 2778 | `25b258ab7bfe` |

scripts/validate_runner_invariants.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,12 @@
4646
}
4747
)
4848

49+
# ADR 0041 Phase 1 (G2-budget): the subagent path must delegate parent-clamping
50+
# to the canonical invariant rather than re-implementing min(child, parent).
51+
_BUDGET_CLAMP_IMPORT = 'teaagent.runner._invariants'
52+
_BUDGET_CLAMP_SYMBOL = 'compute_clamped_budget'
53+
_BUDGET_CLAMP_FILES: tuple[str, ...] = ('teaagent/subagents/_manager.py',)
54+
4955

5056
def _collect_imports(source: str) -> set[str]:
5157
tree = ast.parse(source)
@@ -86,10 +92,33 @@ def _check_file(rel_path: str) -> list[str]:
8692
return errors
8793

8894

95+
def _check_budget_clamp_authority(rel_path: str) -> list[str]:
96+
"""ADR 0041 Phase 1 G2: forbid a re-implemented parent budget clamp."""
97+
file_path = _REPO_ROOT / rel_path
98+
if not file_path.is_file():
99+
return [f'{rel_path}: file not found']
100+
source = file_path.read_text(encoding='utf-8')
101+
imports = _collect_imports(source)
102+
if _BUDGET_CLAMP_IMPORT not in imports:
103+
return [
104+
f'{rel_path}: missing budget-clamp authority — must import '
105+
f'{_BUDGET_CLAMP_IMPORT} and call {_BUDGET_CLAMP_SYMBOL} instead of '
106+
f're-implementing parent budget clamping (ADR 0041 Phase 1, G2)'
107+
]
108+
if _BUDGET_CLAMP_SYMBOL not in source:
109+
return [
110+
f'{rel_path}: imports {_BUDGET_CLAMP_IMPORT} but never calls '
111+
f'{_BUDGET_CLAMP_SYMBOL} (ADR 0041 Phase 1, G2)'
112+
]
113+
return []
114+
115+
89116
def validate() -> list[str]:
90117
errors: list[str] = []
91118
for rel_path in _SECOND_FRAMEWORK_FILES:
92119
errors.extend(_check_file(rel_path))
120+
for rel_path in _BUDGET_CLAMP_FILES:
121+
errors.extend(_check_budget_clamp_authority(rel_path))
93122
return errors
94123

95124

tests/runner/test_runner_invariants.py

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222

2323
import json
2424
from pathlib import Path
25+
from typing import Any
2526

2627
import pytest
2728
from conftest import FakeAdapter
@@ -115,6 +116,14 @@ def _run_primary_path(
115116
return result, audit
116117

117118

119+
def _read_jsonl_events(path: Path) -> list[dict[str, Any]]:
120+
return [
121+
json.loads(line)
122+
for line in path.read_text(encoding='utf-8').splitlines()
123+
if line.strip()
124+
]
125+
126+
118127
class TestBudgetInvariant:
119128
def test_primary_budget_is_enforced(self, tmp_path: Path) -> None:
120129
max_iters = 2
@@ -370,3 +379,81 @@ def test_collect_budget_from_agent_runner(self, tmp_path: Path) -> None:
370379
def test_collect_budget_rejects_bad_type(self) -> None:
371380
with pytest.raises(TypeError, match='Unsupported runner type'):
372381
collect_budget_evidence(42)
382+
383+
384+
class TestLiveDifferential:
385+
"""ADR 0040 §2 / ADR 0041 G3: identical-shape scenarios through both
386+
execution surfaces must satisfy the shared audit/budget/approval invariants."""
387+
388+
def test_primary_and_subagent_paths_satisfy_shared_invariants(
389+
self, tmp_path: Path
390+
) -> None:
391+
from contextlib import redirect_stdout
392+
from io import StringIO
393+
from unittest.mock import patch
394+
395+
from teaagent.cli import main
396+
397+
# secondary path: SubagentManager.run_subagent via the `subagent` tool
398+
(tmp_path / 'README.md').write_text('hello', encoding='utf-8')
399+
adapter = FakeAdapter(
400+
[
401+
'{"type":"tool","tool_name":"subagent","arguments":'
402+
'{"task":"inspect README"},"call_id":"sub-1"}',
403+
'{"type":"final","content":"child done"}',
404+
'{"type":"final","content":"parent done"}',
405+
]
406+
)
407+
out = StringIO()
408+
with (
409+
patch('teaagent.cli.create_llm_adapter', return_value=adapter),
410+
redirect_stdout(out),
411+
):
412+
exit_code = main(
413+
[
414+
'agent',
415+
'run',
416+
'gpt',
417+
'delegate inspection',
418+
'--subagent',
419+
'--root',
420+
str(tmp_path),
421+
'--permission-mode',
422+
'allow',
423+
]
424+
)
425+
payload = json.loads(out.getvalue())
426+
assert exit_code == 0
427+
assert payload['status'] == 'completed'
428+
429+
runs = tmp_path / '.teaagent' / 'runs'
430+
parent_events = _read_jsonl_events(runs / f'{payload["run_id"]}.jsonl')
431+
completed = next(
432+
e
433+
for e in parent_events
434+
if e.get('event_type') == 'tool_call_completed'
435+
and e.get('payload', {}).get('tool_name') == 'subagent'
436+
)
437+
child_run_id = completed['payload']['result']['run_id']
438+
secondary_events = [
439+
str(e['event_type'])
440+
for e in _read_jsonl_events(runs / f'{child_run_id}.jsonl')
441+
]
442+
443+
# primary path: direct AgentRunner run
444+
primary_root = tmp_path / 'primary'
445+
primary_root.mkdir()
446+
_result, audit = _run_primary_path(primary_root)
447+
primary_events = [e.event_type for e in audit.events]
448+
449+
# both surfaces satisfy the ADR 0040 shared invariants on real evidence
450+
assert_audit_invariant(primary_events, secondary_events)
451+
assert_audit_events_match(primary_events, secondary_events)
452+
assert_approval_invariant([], [])
453+
secondary_iters = max(1, secondary_events.count('iteration_started'))
454+
assert_budget_invariant(
455+
RunnerEvidenceBundle(max_iterations=3, max_tool_calls=3),
456+
RunnerEvidenceBundle(
457+
max_iterations=secondary_iters, max_tool_calls=secondary_iters
458+
),
459+
)

0 commit comments

Comments
 (0)