Skip to content

Commit ef25605

Browse files
johnteeecursoragent
andcommitted
feat: wire release eval gate and conversational corpus (Sprint 3)
Add offline prompt + conversational eval corpora, release gate orchestration, CI/release workflow integration, and seeded failure verification for WDA-004 and WDD-001. Constraint: offline scoring only; no live model calls in gate Tested: test_release_eval_gate.py; run_release_eval_gate.py; validate_wiring.py Confidence: high Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 9f39baf commit ef25605

16 files changed

Lines changed: 402 additions & 15 deletions

.github/workflows/release.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,9 @@ jobs:
3333
VERSION="${GITHUB_REF_NAME#v}"
3434
bash scripts/release-changelog.sh "$VERSION"
3535
36+
- name: Run release eval gate
37+
run: python scripts/run_release_eval_gate.py --root . --report dist/evidence/release-eval-gate.json
38+
3639
- name: Build release evidence bundle
3740
run: python scripts/build_release_evidence_bundle.py --output-dir dist/evidence
3841

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Eval Gate Design — 2026-06-10
2+
3+
> **Claim class:** Current truth for release eval gating (WDD-002 precursor).
4+
> **Requires:** WDA-004 wired at HEAD.
5+
6+
## Claim
7+
8+
Agent behavior changes that ship in a release must pass an offline eval gate
9+
covering prompt regression and conversational-quality corpora before the
10+
release workflow proceeds.
11+
12+
## Corpora
13+
14+
| Corpus | Module | Tests |
15+
| --- | --- | --- |
16+
| Prompt regression | `teaagent.prompt_regression` | 3 default cases |
17+
| Conversational quality | `teaagent.eval_corpus` | 4 axes: clarification, interruption, correction, long-context recall |
18+
19+
## Commands
20+
21+
```bash
22+
# Green path (should exit 0)
23+
python scripts/run_release_eval_gate.py --root .
24+
25+
# Verify gate blocks on seeded failure (should exit 1)
26+
python scripts/run_release_eval_gate.py --root . --seed-failure
27+
28+
# Unit tests
29+
python -m pytest tests/test_release_eval_gate.py -q
30+
```
31+
32+
## CI integration
33+
34+
- Tag releases: `.github/workflows/release.yml` runs the gate before packaging.
35+
- Evidence bundle: `scripts/build_release_evidence_bundle.py` includes the gate in `release` profile.
36+
37+
## Artifacts
38+
39+
- Gate report JSON: `--report dist/evidence/release-eval-gate.json` (release workflow)
40+
- Eval store: `.teaagent/eval/` under workspace root (gitignored)
41+
42+
## Limits
43+
44+
- Offline similarity scoring only (no live model calls in gate).
45+
- `repo_map_benchmark` remains in default critical categories but is not in the release corpus yet.

docs/generated/docs-inventory.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
Generated by `python3 scripts/generate_docs_inventory.py`.
77
Do not edit this file manually — regenerate instead.
88

9-
**Markdown files:** 552
9+
**Markdown files:** 553
1010

1111
| Path | Bytes | SHA256 (12) |
1212
| --- | ---: | --- |
@@ -40,7 +40,7 @@ Do not edit this file manually — regenerate instead.
4040
| `adr/0027-context-bus-architecture.md` | 543 | `6fa1d2ced665` |
4141
| `adr/0028-tournament-swarm-architecture.md` | 594 | `ee8dec0fdb60` |
4242
| `adr/0029-consensus-validation-deferred.md` | 1587 | `8a2da40abc07` |
43-
| `adr/README.md` | 6552 | `83d807309b2d` |
43+
| `adr/README.md` | 6668 | `7b2bbdfbda07` |
4444
| `agent-mode-operator-guide.md` | 2778 | `25b258ab7bfe` |
4545
| `analysis/active-findings-status-ledger-2026-06-06.md` | 4724 | `34c514f544b8` |
4646
| `analysis/agent-competitive-risks-2026-05-31.md` | 10235 | `6eff7629ff64` |
@@ -97,6 +97,7 @@ Do not edit this file manually — regenerate instead.
9797
| `analysis/dynamic-skill-generation-and-long-result-audit-2026-06-05.md` | 16539 | `321a208c082b` |
9898
| `analysis/engineering-architecture-critique-2026-06-06.md` | 27959 | `01228917cd9c` |
9999
| `analysis/engineering-critique-refresh-2026-06-10.md` | 8794 | `084a230ba62d` |
100+
| `analysis/eval-gate-design-2026-06-10.md` | 1432 | `ab7b4559a030` |
100101
| `analysis/governance-open-decisions-2026-05-31.md` | 2314 | `41bf6dbde3dc` |
101102
| `analysis/implementation-plan-overlap-review-2026-05-31.md` | 4537 | `0cc33e36385d` |
102103
| `analysis/integration-and-extensibility-critique-2026-06-06.md` | 23666 | `65bc5ef81b44` |
@@ -449,7 +450,7 @@ Do not edit this file manually — regenerate instead.
449450
| `plans/ticket-plans/WDG-002-plan.md` | 1712 | `16cb2bb47cbc` |
450451
| `plans/ux-improvement-roadmap-2026-05-31.md` | 15201 | `368416e593d4` |
451452
| `plans/work-direction-decomposition-2026-06-10.md` | 10371 | `cba4dd33a15d` |
452-
| `plans/work-direction-execution-index-2026-06-10.md` | 5011 | `94a33014dd66` |
453+
| `plans/work-direction-execution-index-2026-06-10.md` | 5135 | `586b534dbfca` |
453454
| `plugin-skill-catalog.md` | 4118 | `8d42b8f0c492` |
454455
| `processes/breaking-changes.md` | 820 | `2a43f4d37b6c` |
455456
| `processes/community-presence.md` | 5009 | `f33f69b2e8ff` |
@@ -488,7 +489,7 @@ Do not edit this file manually — regenerate instead.
488489
| `reviews/project-state-critical-questioning-2026-06-04.md` | 7340 | `78b9b54c3a9c` |
489490
| `reviews/security-risk-assessment-2026-06-02.md` | 24112 | `4c9e2e00d001` |
490491
| `reviews/seven-control-loops-critical-questioning-2026-06-05.md` | 7531 | `ae1e34b8369d` |
491-
| `roadmap-status.md` | 19926 | `3c7c01ed0772` |
492+
| `roadmap-status.md` | 19928 | `5a33e482a4ac` |
492493
| `run-evidence-and-audit-guide.md` | 1980 | `97b527c850b1` |
493494
| `security-whitepaper.md` | 9691 | `d65a19a755cb` |
494495
| `security/approval-abuse-cases-2026-06-02.md` | 1281 | `4c43296d1c66` |

docs/plans/work-direction-execution-index-2026-06-10.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,8 @@ any time after S1 WDA-001 lands (needs honest module labels for concept audit).
7070
| Sprint | IDs | Notes |
7171
| --- | --- | --- |
7272
| S2 | WDA-002, WDA-003, WDA-006 | **Closed** — shadow policy + RBAC enforce; ADR 0029 |
73-
| S3 | WDA-004, WDA-005, WDD-001, WDD-002 | Release gate CI; single-platform update proof |
73+
| S3 | WDA-004, WDD-001, WDD-002 | **Closed** — release eval gate + conversational corpus; [eval gate design](../analysis/eval-gate-design-2026-06-10.md) |
74+
| S3b | WDA-005 | Single-platform update proof (queued) |
7475
| S4 | WDC-002, WDC-003, WDC-004 | Three-concept onboarding; terminology freeze |
7576
| S5 | WDE-001, WDE-002, WDE-003, WDF-001, WDF-002 | Remote backend; root-module freeze |
7677
| S6 | WDH-001, WDH-002, WDH-003 | Stop surveys; external users; "when not to use" page |

docs/roadmap-status.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Provide a single source of truth for roadmap item status, ownership, confidence,
3030
| H2 | Multi-surface continuity | CLI, TUI, IDE, dashboard, background, cloud, and gateway share one run-state contract | TBD | Partially fixed — M2 foundation wired | Medium | WDA-002 | M2 acceptance complete; full surface parity (IDE/dashboard/cloud) still open |
3131
| H3 | Ecosystem trust | MCP, plugins, skills, hooks, subagents, and automations are explainable, revocable, and testable | TBD | Partially fixed — M3 tests pass | Medium | WDC-002 | M3 acceptance complete; general-user trust onboarding simplification still open |
3232
| H4 | Durable team operations | Long-running and team workflows have durable execution, control-plane views, policy, audit, and cost attribution | TBD | Partially fixed — shadow wired | Low | WDA-004 | Policy/RBAC shadow-wired (WDA-002/003); consensus deferred (ADR 0029) |
33-
| H5 | Quality and eval loop | Prompt/runtime/model changes cannot silently degrade daily outcomes | TBD | Partially fixed — unwired | Low | WDA-004 | `context_health` wired via TUI; eval suite/release gate clusters unwired (ENG-R1) |
33+
| H5 | Quality and eval loop | Prompt/runtime/model changes cannot silently degrade daily outcomes | TBD | Partially fixed — release gate wired | Low | WDA-005 | Release eval gate in CI (WDA-004/WDD-001); offline conversational corpus |
3434
| H6 | Packaging and adoption | Desktop/client-server and external-facing release channels have supply-chain, update, and support plans | TBD | Partially fixed — unwired | Low | WDA-005 | `update/*` package implemented but unwired; no single-platform proof yet |
3535

3636
## Milestones

scripts/build_release_evidence_bundle.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,13 @@ def build_release_evidence_bundle(
132132
timeout_seconds=900,
133133
)
134134
)
135+
results.append(
136+
_run(
137+
[python, 'scripts/run_release_eval_gate.py', '--root', '.'],
138+
cwd=repo_root,
139+
timeout_seconds=300,
140+
)
141+
)
135142

136143
pytest_counts = _collect_pytest_counts(python=python, cwd=repo_root)
137144

scripts/run_release_eval_gate.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/usr/bin/env python3
2+
"""Run release eval gate (WDA-004 / WDD-001)."""
3+
4+
from __future__ import annotations
5+
6+
import argparse
7+
import sys
8+
from pathlib import Path
9+
10+
_REPO_ROOT = Path(__file__).resolve().parents[1]
11+
if str(_REPO_ROOT) not in sys.path:
12+
sys.path.insert(0, str(_REPO_ROOT))
13+
14+
from teaagent.governance.release_eval import ( # noqa: E402
15+
format_gate_summary,
16+
run_release_eval_gate,
17+
should_block_release,
18+
)
19+
20+
21+
def main(argv: list[str] | None = None) -> int:
22+
parser = argparse.ArgumentParser(description=__doc__)
23+
parser.add_argument('--root', default='.', help='Workspace/repo root')
24+
parser.add_argument(
25+
'--seed-failure',
26+
action='store_true',
27+
help='Force a failing regression output to verify the gate blocks release.',
28+
)
29+
parser.add_argument(
30+
'--report',
31+
default='',
32+
help='Optional path to write JSON gate report.',
33+
)
34+
args = parser.parse_args(argv)
35+
36+
report_path = Path(args.report) if args.report else None
37+
result = run_release_eval_gate(
38+
args.root,
39+
seed_failure=args.seed_failure,
40+
report_path=report_path,
41+
)
42+
print(format_gate_summary(result))
43+
return 1 if should_block_release(result) else 0
44+
45+
46+
if __name__ == '__main__':
47+
raise SystemExit(main())

scripts/run_test_tier.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
str(_TESTS / 'test_governance_hardening.py'),
3232
str(_TESTS / 'test_validate_wiring.py'),
3333
str(_TESTS / 'test_h4_shadow_wiring.py'),
34+
str(_TESTS / 'test_release_eval_gate.py'),
3435
str(_TESTS / 'regression'),
3536
)
3637

scripts/validate_wiring.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,7 @@
3030
WATCH_MODULES: tuple[str, ...] = (
3131
'teaagent.policy_routing',
3232
'teaagent.consensus_validation',
33-
'teaagent.release_gate',
3433
'teaagent.scope_creep',
35-
'teaagent.prompt_regression',
3634
'teaagent.repo_map_benchmark',
3735
'teaagent.update',
3836
'teaagent.update.changelog',

teaagent/eval_corpus.py

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
"""Eval corpora for release gating (WDD-001)."""
2+
3+
from __future__ import annotations
4+
5+
from teaagent.eval_suite import EvalCategory, EvalStore, EvalSuite, EvalTest
6+
from teaagent.prompt_regression import PromptRegressionEvaluator, PromptRegressionTest
7+
8+
RELEASE_EVAL_SUITE_ID = 'release-eval-corpus'
9+
RELEASE_EVAL_SUITE_NAME = 'Release Eval Corpus'
10+
11+
12+
def create_conversational_quality_tests() -> list[PromptRegressionTest]:
13+
"""Conversational regression corpus: clarify, interrupt, correct, recall."""
14+
return [
15+
PromptRegressionTest(
16+
test_id='conv-clarify-001',
17+
name='Clarification before action',
18+
prompt='Fix the auth bug.',
19+
expected_output=(
20+
'Before I change code, could you clarify which auth flow failed '
21+
'and whether this is login, token refresh, or permission checks?'
22+
),
23+
expected_behavior={'keywords': ['clarif', 'auth']},
24+
tolerance_threshold=0.65,
25+
metadata={'axis': 'clarification'},
26+
),
27+
PromptRegressionTest(
28+
test_id='conv-interrupt-001',
29+
name='Graceful interruption handling',
30+
prompt='Stop — switch to writing tests only, no more refactors.',
31+
expected_output=(
32+
'Understood. I will stop refactoring and focus only on adding tests '
33+
'from this point forward.'
34+
),
35+
expected_behavior={'keywords': ['stop', 'tests']},
36+
tolerance_threshold=0.65,
37+
metadata={'axis': 'interruption'},
38+
),
39+
PromptRegressionTest(
40+
test_id='conv-correct-001',
41+
name='User correction acknowledgment',
42+
prompt='No, the failing module is billing, not auth.',
43+
expected_output=(
44+
'Thanks for the correction — I will target the billing module instead '
45+
'of auth and re-check the failing tests there.'
46+
),
47+
expected_behavior={'keywords': ['billing', 'correction']},
48+
tolerance_threshold=0.65,
49+
metadata={'axis': 'correction'},
50+
),
51+
PromptRegressionTest(
52+
test_id='conv-recall-001',
53+
name='Long-context recall',
54+
prompt='What was the budget cap and rollback rule we agreed on earlier?',
55+
expected_output=(
56+
'Earlier you set a budget cap of 2000 cents with rollback required '
57+
'before any destructive shell command.'
58+
),
59+
expected_behavior={
60+
'keywords': ['budget', 'rollback'],
61+
'min_length': 40,
62+
},
63+
tolerance_threshold=0.6,
64+
metadata={'axis': 'long_context_recall'},
65+
),
66+
]
67+
68+
69+
def _to_eval_test(
70+
regression_test: PromptRegressionTest,
71+
*,
72+
category: EvalCategory,
73+
) -> EvalTest:
74+
return EvalTest(
75+
test_id=regression_test.test_id,
76+
name=regression_test.name,
77+
category=category,
78+
description=f'Eval corpus test: {regression_test.name}',
79+
metadata={
80+
'prompt': regression_test.prompt,
81+
'expected_output': regression_test.expected_output,
82+
'expected_behavior': regression_test.expected_behavior,
83+
'tolerance_threshold': regression_test.tolerance_threshold,
84+
**regression_test.metadata,
85+
},
86+
)
87+
88+
89+
def register_release_eval_suite(store: EvalStore) -> str:
90+
"""Register prompt + conversational tests in the release eval suite."""
91+
evaluator = PromptRegressionEvaluator()
92+
existing = store.load_suite(RELEASE_EVAL_SUITE_ID)
93+
if existing is not None:
94+
return RELEASE_EVAL_SUITE_ID
95+
96+
suite = EvalSuite(
97+
suite_id=RELEASE_EVAL_SUITE_ID,
98+
name=RELEASE_EVAL_SUITE_NAME,
99+
description='Prompt regression + conversational quality corpus (WDD-001).',
100+
)
101+
for regression_test in (
102+
*evaluator.create_default_regression_tests(),
103+
*create_conversational_quality_tests(),
104+
):
105+
category = (
106+
EvalCategory.CONVERSATIONAL
107+
if regression_test.test_id.startswith('conv-')
108+
else EvalCategory.PROMPT_REGRESSION
109+
)
110+
suite.add_test(_to_eval_test(regression_test, category=category))
111+
store.save_suite(suite)
112+
return suite.suite_id

0 commit comments

Comments
 (0)