Skip to content

Commit 64f49f8

Browse files
committed
docs: re-sequence ADR 0032 plan (B-then-A) — insert evidence-event taxonomy phase, move fold late
Review of the evidence-fold task found it could never meet "event-derived bundle == legacy bundle": the bundle keys on ~20 audit event types but the M0 set + reader surface only ~5, and no milestone typed the non-gate evidence events (routes, git-sandbox, skills, tests, undo, provenance). Deferring the fold alone would not fix it — a taxonomy-coverage phase was missing. Re-sequenced (owner decision B-then-A): - M2 REDEFINED to evidence-event taxonomy coverage + dual-emit + byte- equivalence (M2-T001 reader stays; new M2-T002/T003) - evidence + receipt fold moved to M6 with corrected scope A (full-stream fold, no fallback flag per Q1, parity-gated, cancelled as documented gap until run_cancelled is emitted) as ADR32-FOLD-T001/T002 - old M6 cleanup becomes M7 - §5 graph, §7 exit criteria, and §8 task slices updated to match - §14 records Intent, the F1 finding, old->new task-ID mapping, a risk table, and the residual risk (fold payoff only arrives at M6; honest fallback is to drop the "structurally derived from events" claim rather than ship a fold that re-wraps untyped data) Constraint: docs only; no code or behavior change; re-sequencing is plan-of-record, applied to future phases Tested: docs validator 0 errors after inventory regeneration; git status shows no .py changes Confidence: high Roadmap-Status: unchanged
1 parent e3e854f commit 64f49f8

4 files changed

Lines changed: 685 additions & 64 deletions

File tree

Lines changed: 278 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,278 @@
1+
# ADR32-M2-T002: Evidence Bundle Fold (Parity Path)
2+
3+
> **Status**: Draft | **Risk**: High | **Dependencies**: ADR32-M2-T001 (T001 already exists — `read_run_events_from_audit()` at `_events.py:217`) | **Human Review Required**: Yes (parity assertion)
4+
5+
## 1. Goal
6+
7+
Add an alternative `build_run_evidence_bundle_from_events()` that derives `RunEvidenceBundle` from the EventSpine's typed `RunEvent` stream, preserving byte-equivalent output with the legacy `build_run_evidence_bundle()` for all four run outcomes (success, failure, cancelled, pending-approval).
8+
9+
## 2. Why This Is Risky
10+
11+
The evidence bundle is consumed by the receipt builder (`format_run_receipt`) and the 5-minute proof flow. A silent mismatch between legacy and event-derived bundles would produce incorrect receipts or fail completeness checks without raising an error. Per §13.4, the parity assertion must go green **before** the default switches.
12+
13+
## 3. Scope
14+
15+
### In scope
16+
- New function `build_run_evidence_bundle_from_events()` in `teaagent/run_evidence.py`
17+
- Reads typed `RunEvent` sequence (from EventSpine stream or stored event log)
18+
- Falls back to raw audit event dicts for evidence categories not yet representable as RunEvents
19+
- Parity test comparing legacy vs event-derived output for 4 fixtures
20+
- Explicit gap reporting: which evidence categories still use audit fallback
21+
- Feature gate: new path is opt-in via `use_event_stream=True` parameter
22+
23+
### Out of scope
24+
- Adding new RunEventType members (that's M2-T001 scope or deferred to later phases)
25+
- Changing the receipt builder (that's M2-T003)
26+
- Retiring synthetic fixtures (that's M2-T003)
27+
- Switching the default production path
28+
29+
## 4. Data Flow
30+
31+
```
32+
EventSpine (typed RunEvent stream)
33+
34+
35+
build_run_evidence_bundle_from_events(events, raw_audit_events)
36+
│ ├─ For M0 RunEvent types: extract from typed RunEvents
37+
│ └─ For all other evidence categories: fall back to raw_audit_events
38+
39+
RunEvidenceBundle ──► parity test vs ──► build_run_evidence_bundle()
40+
(legacy path)
41+
```
42+
43+
## 5. Functional Requirements
44+
45+
### FR-01: New function signature
46+
47+
```python
48+
def build_run_evidence_bundle_from_events(
49+
root: str | Path,
50+
run_id: str,
51+
events: list[RunEvent],
52+
raw_audit_events: list[dict[str, Any]],
53+
*,
54+
goal_id: str = '',
55+
) -> RunEvidenceBundle:
56+
```
57+
58+
- `events`: typed RunEvent sequence (from EventSpine)
59+
- `raw_audit_events`: original audit dicts for fallback (from RunStore.show_run())
60+
- Parity path: `use_event_stream=True` flag (optional, default `False` on `build_run_evidence_bundle`)
61+
62+
### FR-02: Evidence extraction matrix
63+
64+
Which evidence categories can be served from typed RunEvents (M0 set) vs require audit fallback:
65+
66+
| Evidence field | M0 RunEvent source | Audit fallback needed? | Notes |
67+
|---|---|---|---|
68+
| `commands_run` | `TOOL_CALL_STARTED.metadata`, `TOOL_CALL_COMPLETED.result` | Partial — `tool_use` events not represented | Only completed calls w/ result; start-of-run calls from audit |
69+
| `tests` | ❌ no RunEvent | **Yes**`test_run` audit events | Deferred to later M phase |
70+
| `approvals` | ❌ no RunEvent | **Yes**`approval_*` audit events | Deferred to M3 (approval gate interceptor) |
71+
| `routes` | ❌ no RunEvent | **Yes**`model_route` audit events | Deferred to later phase |
72+
| `known_gaps` | `RUN_FAILED`| Partial — `tool_error` audit needed | RUN_FAILED gap available; tool_error falls back |
73+
| `git_sandbox` | ❌ no RunEvent | **Yes**`git_sandbox_*` audit events | Not yet in RunEventType |
74+
| `provenance` | ❌ no RunEvent | **Yes**`provenance_collected` audit events | Deferred |
75+
| `skill_activations` | ❌ no RunEvent | **Yes**`skill_lifecycle_transition` audit events | Deferred |
76+
| `undo_evidence` | ❌ no RunEvent | **Yes**`undo_applied` audit events | Deferred |
77+
| `proof_of_use` | mixed | **Yes** — reads multiple audit event types | Full PoU requires audit |
78+
| `cost_cents / cost_state / budget_cap_cents` | `RUN_COMPLETED.metadata` | Partial — fallback to audit for budget details | RUN_COMPLETED carries final cost if set |
79+
| `context_health` | n/a | n/a | Computed externally, not from events |
80+
81+
### FR-03: Gap reporting
82+
83+
The function MUST return a `gap_categories: list[str]` alongside the bundle, listing which evidence categories fell back to audit:
84+
85+
```python
86+
@dataclass
87+
class EventDerivedEvidenceResult:
88+
bundle: RunEvidenceBundle
89+
gap_categories: list[str] # e.g. ['tests', 'approvals', 'routes', ...]
90+
```
91+
92+
Accessible as `build_run_evidence_bundle(..., use_event_stream=True) -> RunEvidenceBundle` without changing the public return type — gap categories logged via audit.
93+
94+
### FR-04: Feature gate
95+
96+
`build_run_evidence_bundle()` gets a new optional parameter:
97+
98+
```python
99+
def build_run_evidence_bundle(
100+
root: str | Path,
101+
run_id: str,
102+
*,
103+
goal_id: str = '',
104+
use_event_stream: bool = False, # NEW
105+
) -> RunEvidenceBundle:
106+
```
107+
108+
- `False` (default): existing legacy path, unchanged
109+
- `True`: calls `build_run_evidence_bundle_from_events()`, uses RunEvents where possible, falls back to audit
110+
111+
## 6. Acceptance Criteria
112+
113+
| # | Criterion | Verification |
114+
|---|---|---|
115+
| AC-01 | `build_run_evidence_bundle(use_event_stream=True)` returns same output as `build_run_evidence_bundle()` for **success** fixture | Parity test: `to_dict()` byte equality after normalization |
116+
| AC-02 | Same for **failure** fixture | Same |
117+
| AC-03 | Same for **cancelled** fixture | Same |
118+
| AC-04 | Same for **pending-approval** fixture | Same |
119+
| AC-05 | `check_evidence_completeness()` passes on event-derived bundle for all 4 outcomes | Completeness check returns empty list |
120+
| AC-06 | Gap categories are non-empty for evidence types not yet in RunEventType | Gap list includes `['tests', 'approvals', 'routes', ...]` |
121+
| AC-07 | `use_event_stream=False` (default) produces identical output to current code | No behavioral regression |
122+
123+
## 7. Tests
124+
125+
| Test file | New/Existing | Description |
126+
|---|---|---|
127+
| `tests/parity/test_evidence_fold_parity.py` | **NEW** | Golden parity: run legacy vs event-derived for 4 outcomes, compare `to_dict()` output |
128+
| `tests/test_run_evidence.py` | Extend | Add test for `build_run_evidence_bundle_from_events()` with known event fixtures |
129+
| `tests/lifecycle/test_run_event_spine.py` | Extend | Add fixture for each outcome type (success/failure/cancelled/pending) that produces both audit JSONL and RunEvent stream |
130+
131+
## 8. Fixture Strategy
132+
133+
For each of the 4 outcomes, create a fixture function that generates BOTH:
134+
1. Raw audit event list (for legacy `build_run_evidence_bundle()`)
135+
2. Typed RunEvent list (for new `build_run_evidence_bundle_from_events()`)
136+
137+
The fixture must produce equivalent outputs from both paths.
138+
139+
**Fixture location**: `tests/parity/fixtures.py` (or inline in parity test file).
140+
141+
A helper `_run_events_from_audit(audit_events: list[dict]) -> list[RunEvent]` converts the existing M1 golden fixture events to RunEvent sequence where mapping exists.
142+
143+
## 9. Implementation Steps
144+
145+
### Step 1: Fixture scaffold
146+
Create `tests/parity/` directory with:
147+
- `test_evidence_fold_parity.py` — the parity test harness
148+
- `conftest.py` — 4 outcome fixture generators
149+
150+
### Step 2: `EventDerivedEvidenceResult` dataclass
151+
Add to `teaagent/run_evidence.py`:
152+
- `gap_categories` field
153+
- Internal `_GapTracker` helper class accumulates which extractors fell back
154+
155+
### Step 3: `build_run_evidence_bundle_from_events()`
156+
New function in `teaagent/run_evidence.py`. Structure:
157+
158+
```python
159+
def build_run_evidence_bundle_from_events(
160+
root, run_id, events, raw_audit_events, *, goal_id=''
161+
) -> RunEvidenceBundle:
162+
gap_tracker = _GapTracker()
163+
164+
# Extract from typed RunEvents where possible
165+
commands = _extract_commands_from_events(events, gap_tracker)
166+
known_gaps = _extract_gaps_from_events(events, gap_tracker)
167+
cost_cents, cost_state, budget_cap_cents = _extract_cost_from_events(events, gap_tracker)
168+
169+
# Fall back to audit for everything else
170+
tests = extract_tests(raw_audit_events) # gap: 'tests'
171+
approvals = extract_approvals(raw_audit_events) # gap: 'approvals'
172+
routes = extract_routes(raw_audit_events) # gap: 'routes'
173+
git_sandbox = extract_git_sandbox(raw_audit_events) # gap: 'git_sandbox'
174+
provenance = extract_provenance(raw_audit_events) # gap: 'provenance'
175+
skill_activations = extract_skill_activations(raw_audit_events) # gap: 'skills'
176+
undo_mechanism, undo_outcome = _extract_undo_evidence(raw_audit_events) # gap: 'undo'
177+
proof_of_use = build_proof_of_use(raw_audit_events, '')
178+
179+
# ... assemble bundle ...
180+
```
181+
182+
### Step 4: Feature gate on `build_run_evidence_bundle()`
183+
Add `use_event_stream: bool = False` parameter. When `True`, calls the new function.
184+
185+
### Step 5: Parity tests
186+
Each fixture:
187+
1. Generates both raw audit events and typed RunEvents
188+
2. Calls legacy path → gets `RunEvidenceBundle` A
189+
3. Calls event-derived path → gets `RunEvidenceBundle` B
190+
4. Asserts `A.to_dict() == B.to_dict()` (after normalizing timestamps/metadata)
191+
5. Calls `check_evidence_completeness(B, events, outcome_status)` → asserts passes
192+
193+
### Step 6: Known gap test
194+
Assert that `gap_categories` output includes the expected set of categories that still rely on audit fallback.
195+
196+
## 10. Edge Cases & Failure Modes
197+
198+
| Edge case | Expected behavior |
199+
|---|---|
200+
| No typed RunEvents available (empty list) | Full fallback to audit, no data loss |
201+
| Partial RunEvent stream (missing events) | Extract what's available, gap-track missing |
202+
| RunEvent metadata fields missing | Default to None/empty, same as legacy extractors |
203+
| Raw audit events also missing | Empty bundle with `run_id` only |
204+
| Mismatched ordering between RunEvent and audit streams | Sort both by timestamp before extraction |
205+
206+
## 11. M0 RunEvent Type Gap: RUN_CANCELLED and RUN_PAUSED
207+
208+
The `evidence_completeness_checklist()` expects these audit event types for two outcomes:
209+
210+
| Outcome | Required event | In M0 RunEventType? |
211+
|---|---|---|
212+
| cancelled | `event:run_cancelled` | ❌ Not in M0 set |
213+
| pending_approval | `event:run_paused` | ❌ Not in M0 set |
214+
215+
This means **the `check_evidence_completeness()` call on the event-derived path will fail** for cancelled/pending-approval outcomes unless we:
216+
217+
**Option A** (Recommended): Add `RUN_CANCELLED` and `RUN_PAUSED` to `RunEventType` as part of T002. These are minimal additions with no interceptor/consumer changes — just enum members and audit mapping entries. The `read_run_events_from_audit()` will convert them automatically once mapped.
218+
219+
```python
220+
# In RunEventType (add to M0 section, they're essential for evidence completeness):
221+
RUN_CANCELLED = 'run_cancelled'
222+
RUN_PAUSED = 'run_paused'
223+
224+
# In _RUN_EVENT_TO_AUDIT_EVENT_TYPE:
225+
RunEventType.RUN_CANCELLED: 'run_cancelled',
226+
RunEventType.RUN_PAUSED: 'run_paused',
227+
```
228+
229+
**Option B**: Modify `check_evidence_completeness()` to accept an event-derived flag that skips these two event-type checks. **Not recommended** — this creates divergent behavior between legacy and event-derived paths.
230+
231+
**Decision required before implementation.**
232+
233+
## 12. M2-T001 Dependency (Already Done)
234+
235+
T001 (`read_run_events_from_audit()`) is already implemented at `teaagent/runner/_events.py:217`. It reads audit JSONL entries and converts mapped event types to typed `RunEvent` objects, skipping unmapped legacy events. The new function should call:
236+
237+
```python
238+
from teaagent.runner._events import read_run_events_from_audit
239+
240+
audit_entries = RunStore(root).show_run(run_id) # list[dict]
241+
typed_events = read_run_events_from_audit(audit_entries)
242+
```
243+
244+
Event-derived extraction then operates on `typed_events` for supported types and falls back to `audit_entries` for the rest.
245+
246+
## 13. User Review Checklist (Parity Review)
247+
248+
When reviewing the T002 implementation, verify:
249+
250+
- [ ] **FR-01**: `build_run_evidence_bundle_from_events()` accepts both typed events and raw audit events
251+
- [ ] **FR-02**: The evidence extraction matrix is followed — commands_run partial, known_gaps partial, cost partial from RunEvents; everything else from audit fallback
252+
- [ ] **FR-03**: Gap categories are reported explicitly (log or audit event)
253+
- [ ] **FR-04**: `use_event_stream=False` is still the default (no behavior change yet)
254+
- [ ] **AC-01–04**: Parity tests pass for all 4 outcomes — `to_dict()` byte equality after timestamp/metadata normalization
255+
- [ ] **AC-05**: `check_evidence_completeness()` passes on event-derived bundle
256+
- [ ] **AC-06**: Gap categories are non-empty and correct
257+
- [ ] **AC-07**: Legacy path unchanged
258+
- [ ] **M0 gap**: RUN_CANCELLED and RUN_PAUSED decision resolved (Option A or B)
259+
- [ ] All existing evidence tests still pass
260+
- [ ] Pre-commit hooks pass (ruff, mypy)
261+
262+
## 14. Files Touched
263+
264+
- `teaagent/run_evidence.py` — new function, feature gate, gap dataclass
265+
- `tests/parity/test_evidence_fold_parity.py` — parity tests
266+
- `tests/parity/conftest.py` — fixture generators (or inline in parity test)
267+
- `tests/test_run_evidence.py` — extend with event-derived path tests
268+
- `tests/lifecycle/test_run_event_spine.py` — extend with 4-outcome fixtures
269+
270+
## 12. Definition of Done
271+
272+
- [ ] `build_run_evidence_bundle_from_events()` implemented
273+
- [ ] Feature gate `use_event_stream` parameter added
274+
- [ ] Parity tests pass for all 4 outcomes (legacy == event-derived)
275+
- [ ] `check_evidence_completeness()` passes for event-derived bundles
276+
- [ ] Gap categories reported correctly
277+
- [ ] All existing evidence tests still pass
278+
- [ ] Pre-commit hooks pass (ruff, mypy)

0 commit comments

Comments
 (0)