Skip to content

Commit 48254ba

Browse files
committed
Round 93: cross-category eval_harness contract heading library registry
- DRY PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings in prompt_eval_helpers.py - Add test_prompt_contract_library_registry.py for library-wide contract parity - Refactor all test_*_prompts_eval_harness.py to use shared contract constants - Governance sync: GLOSSARY step 25, panel record, QUALITY_GATES 615+ floor
1 parent 6f21891 commit 48254ba

17 files changed

Lines changed: 238 additions & 101 deletions

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Full library docs: [reflective-prompt-library/README.md](reflective-prompt-libra
2121
## Governance
2222

2323
- **Contributing:** [CONTRIBUTING.md](CONTRIBUTING.md) — quality gates, routing maintenance (R8–R12), `make all`
24-
- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–92)
24+
- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–93)
2525
- **Operator playbook:** [GLOSSARY.md](reflective-prompt-library/GLOSSARY.md) — Governance Maintenance Playbook
2626

2727
The repository contains:

reflective-prompt-library/GLOSSARY.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,7 @@ Curated top-of-cheatsheet summary of high-confusion routing traps (ROUTE-002 hol
337337

338338
## Governance Maintenance Playbook / 治理維護手冊
339339

340-
Ongoing upkeep after panel close (Rounds 1–92). Not agent instructions — operator checklist.
340+
Ongoing upkeep after panel close (Rounds 1–93). Not agent instructions — operator checklist.
341341

342342
**Operational test:** Before router tuning, add fresh ROUTE-002/003 holdout phrases; run `make all`; record decisions in `PROJECT_KNOWLEDGE.md` Decision Index when governance surface changes.
343343

@@ -365,3 +365,4 @@ Ongoing upkeep after panel close (Rounds 1–92). Not agent instructions — ope
365365
22. When editing Human Review coverage on thinking lenses or composable prompts (`01-thinking``06-repo`), keep frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` sets in `test_*_prompts_eval_harness.py` aligned with preamble `## Human Review` sections; use `prompt_eval_helpers.assert_human_review_*` parity helpers and run HR set partition tests.
366366
23. When adding composable prompts or new categories, keep `PROMPT_LIBRARY_CATEGORIES` and `test_human_review_library_registry.py` aligned so frozen HR sets cover every `00-core``06-repo` prompt exactly once.
367367
24. When adding composable prompts or editing `*_SKILL_LINKS` / `*_THINKING_LINKS`, keep per-category dict keys aligned with prompt globs and run `test_prompt_skill_links_library_registry.py` plus `test_all_*_prompts_have_skill_link` in `test_prompt_cross_links.py`.
368+
25. When adding composable prompts or editing eval_harness contract preambles, keep `PROMPT_CONTRACT_HEADINGS` / `PROMPT_EVAL_MIN_SCORE` in `prompt_eval_helpers.py` and run `test_prompt_contract_library_registry.py` plus per-category `test_*_prompts_eval_harness.py` guards.

reflective-prompt-library/PROJECT_KNOWLEDGE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ deferred promotions are recurrence-gated — see [panel backlog](plans/multi-age
7373
## Decision Index
7474

7575
- 2026-06-25 Round 85 panel — composable prompt Primary workflow surface preamble guards (`test_*_prompts_eval_harness.py`) + Supporting-lens exemption → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
76+
- 2026-06-25 Round 93 panel — cross-category eval_harness contract heading library registry (`test_prompt_contract_library_registry.py`, DRY `PROMPT_CONTRACT_HEADINGS`) → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7677
- 2026-06-25 Round 92 panel — cross-category skill/thinking cross-link library registry (`test_prompt_skill_links_library_registry.py`) + missing `test_all_*_prompts_have_skill_link` guards → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7778
- 2026-06-25 Round 91 panel — cross-category Human Review library registry (`test_human_review_library_registry.py`, `PROMPT_LIBRARY_CATEGORIES`) → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7879
- 2026-06-25 Round 90 panel — library-wide Human Review required/exempt set parity (`01-thinking``06-repo`) + DRY `prompt_eval_helpers` HR set guards → [record](plans/multi-agent-panel-consensus-2026-06-25.md)

reflective-prompt-library/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Pick **Strictness L1–L6** first (`skills/reflective-dispatch/SKILL.md`, [GLOSS
3030

3131
## Governance Panel Record
3232

33-
Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–92, options A–GL) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
33+
Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–93, options A–GQ) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
3434

3535
## Directory Map
3636

reflective-prompt-library/plans/QUALITY_GATES_SUMMARY.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ ROUTE-002 measures unseen phrasing separately from ROUTE-001. Round 7 (2026-06-2
314314
2. **ROUTE-001/002/003 in CI** — 128 + 102 + 53 paraphrases at 100% consistency (seeded fixtures); `validate_route_fixture.py` gates minimum coverage
315315
3. **Governance validators** — links, lint, governance metadata, PROJECT_KNOWLEDGE, benchmark fixture, skill examples
316316
4. **Harness policy docs** — CONTRIBUTING, AGENTS, SKILL_INSTALLATION, maintenance playbook
317-
5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_human_review_library_registry.py`, `test_prompt_skill_links_library_registry.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py`, `test_domain_prompts_eval_harness.py`, `test_repo_prompts_eval_harness.py`, `test_validate_governance.py`, `test_validate_links.py`, `test_lint_skills.py`, `test_skill_module_contract.py` (Escalation subsection + Trigger/Methods/Output/Never; 600+ pytest anti-drift suite in CI); reciprocal thinking-lens ↔ skill checks and `00-core` + composable `Primary workflow surface(s)` ↔ `*_SKILL_LINKS` parity in `test_prompt_cross_links.py` (including strict Primary workflow surfaces parity via `test_thinking_lens_primary_surfaces_match_consumer_graph`); Human Review + Escalation route-target guards in thinking/skill contract tests; composable `Primary workflow surface(s)` / Supporting-lens preamble guards and composable `## Human Review` preamble guards (route to `reflective-risk`) via `prompt_eval_helpers.assert_human_review_preamble` in `test_*_prompts_eval_harness.py`; frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` set parity across all prompt categories (Round 90)
317+
5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_human_review_library_registry.py`, `test_prompt_skill_links_library_registry.py`, `test_prompt_contract_library_registry.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py`, `test_domain_prompts_eval_harness.py`, `test_repo_prompts_eval_harness.py`, `test_validate_governance.py`, `test_validate_links.py`, `test_lint_skills.py`, `test_skill_module_contract.py` (Escalation subsection + Trigger/Methods/Output/Never; 615+ pytest anti-drift suite in CI); reciprocal thinking-lens ↔ skill checks and `00-core` + composable `Primary workflow surface(s)` ↔ `*_SKILL_LINKS` parity in `test_prompt_cross_links.py` (including strict Primary workflow surfaces parity via `test_thinking_lens_primary_surfaces_match_consumer_graph`); Human Review + Escalation route-target guards in thinking/skill contract tests; composable `Primary workflow surface(s)` / Supporting-lens preamble guards and composable `## Human Review` preamble guards (route to `reflective-risk`) via `prompt_eval_helpers.assert_human_review_preamble` in `test_*_prompts_eval_harness.py`; frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` set parity across all prompt categories (Round 90); library-wide contract heading registry (`PROMPT_CONTRACT_HEADINGS`, Round 93)
318318

319319
### Ongoing maintenance (not blockers)
320320

@@ -384,4 +384,4 @@ Phase 1 quality-gate tooling and documentation are **complete**. Routing consist
384384
- ✅ Benchmark fixture gate plus optional manual benchmark runs
385385
- ✅ Research-backed design decisions
386386

387-
The project is positioned to grow sustainably with quality discipline built in from the start. **No open implementation blockers** remain from panel Rounds 1–92; work is recurrence-gated maintenance per playbook. The next measurable quality target is **holdout expansion before router tuning** and optional manual baseline-vs-skill benchmark runs — not shipping new core skills without promotion evidence.
387+
The project is positioned to grow sustainably with quality discipline built in from the start. **No open implementation blockers** remain from panel Rounds 1–93; work is recurrence-gated maintenance per playbook. The next measurable quality target is **holdout expansion before router tuning** and optional manual baseline-vs-skill benchmark runs — not shipping new core skills without promotion evidence.

reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2677,3 +2677,55 @@ User directive (repeat): review prompts, plans, skills, and Socratic/critical-th
26772677

26782678
**Resealed 2026-06-25** after **Round 92** (options GH–GL). Primary workflow surface cross-links are now library-registry checked (`00-core`, `02-engineering``06-repo`) with thinking-lens consumer map parity for `01-thinking`. Holdout expansion remains recurrence-gated maintenance.
26792679

2680+
## Round 93 — cross-category eval_harness contract heading library registry (2026-06-25)
2681+
2682+
**Options GM–GQ** | Six-lens panel (Opus, Codex, Gemini, Composer, Sakana, GLM)
2683+
2684+
### Round 93 options
2685+
2686+
| ID | Proposal | Verdict |
2687+
| --- | --- | --- |
2688+
| GM | DRY `PROMPT_CONTRACT_HEADINGS` / `PROMPT_EVAL_MIN_SCORE` / `assert_prompt_contract_headings` in `prompt_eval_helpers.py`; refactor `test_*_prompts_eval_harness.py` | **Agree** |
2689+
| GN | `test_prompt_contract_library_registry.py` cross-category contract registry + library glob parity | **Agree** |
2690+
| GO | GLOSSARY playbook step 25 + governance sync | **Agree** |
2691+
| GP | ROUTE holdout expansion | **Defer** |
2692+
| GQ | Router / tenth skill / benchmark CI | **Reject** |
2693+
2694+
### Round 93 verdict table
2695+
2696+
| ID | Option | Verdict | Action |
2697+
| --- | --- | --- | --- |
2698+
| GM | Shared contract constants | **Agree** | `prompt_eval_helpers.py` + harness refactor |
2699+
| GN | Contract library registry | **Agree** | `test_prompt_contract_library_registry.py` |
2700+
| GO | Playbook + docs | **Agree** | step 25; panel round 93 sync |
2701+
| GP | Holdout expansion | **Defer** | maintenance |
2702+
| GQ | Router/tenth skill/benchmark CI | **Reject** | no change |
2703+
2704+
### Socratic rationale (Round 93)
2705+
2706+
- **Opus:** Rounds 91–92 closed HR and cross-link library registries; contract headings remain duplicated across seven harness files with no library-wide falsifiability.
2707+
- **Codex:** Centralizing `PROMPT_CONTRACT_HEADINGS` prevents per-category drift; registry test mirrors HR/cross-link pattern.
2708+
- **Gemini:** Low-risk maintenance; no router or skill-count changes.
2709+
- **Composer:** `assert_prompt_contract_headings` DRYs seven identical test bodies.
2710+
- **Sakana:** `01-thinking` included in contract registry (unlike cross-link composable-only registry) because all categories share the same preamble contract.
2711+
- **GLM:** Unanimous — implement GM–GO only.
2712+
2713+
## Implemented Changes (Round 93)
2714+
2715+
- `plans/tests/prompt_eval_helpers.py`: `PROMPT_CONTRACT_HEADINGS`, `PROMPT_EVAL_MIN_SCORE`, `assert_prompt_contract_headings`
2716+
- `plans/tests/test_*_prompts_eval_harness.py`: import shared contract constants; DRY contract heading guards
2717+
- `plans/tests/test_prompt_contract_library_registry.py`: cross-category contract registry + library glob parity
2718+
- `GLOSSARY.md`: playbook Rounds 1–93; step 25 for contract library registry
2719+
- `QUALITY_GATES_SUMMARY.md`: contract registry note; panel Rounds 1–93; 615+ pytest floor
2720+
- `PROJECT_KNOWLEDGE.md`: Decision Index Round 93 entry
2721+
- `README.md`, `reflective-prompt-library/README.md`, `test_readme_governance.py`: panel round 93 sync
2722+
2723+
## Verification (Round 93)
2724+
2725+
- `make all`: pytest + ROUTE-001/002/003 100%
2726+
2727+
---
2728+
2729+
**Resealed 2026-06-25** after **Round 93** (options GM–GQ). Eval_harness contract headings are now library-registry checked across all `00-core``06-repo` prompts with shared `PROMPT_CONTRACT_HEADINGS`. Holdout expansion remains recurrence-gated maintenance.
2730+
2731+

reflective-prompt-library/plans/tests/prompt_eval_helpers.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,24 @@
1515
"06-repo",
1616
)
1717

18+
PROMPT_CONTRACT_HEADINGS = (
19+
"## Purpose",
20+
"## Scope",
21+
"## Acceptance Criteria",
22+
"## Falsifiability",
23+
)
24+
25+
PROMPT_EVAL_MIN_SCORE = 80.0
26+
27+
28+
def assert_prompt_contract_headings(prompt_path: Path) -> None:
29+
"""Contract headings must appear in preamble outside fenced template blocks."""
30+
preamble = prompt_preamble(prompt_path)
31+
for heading in PROMPT_CONTRACT_HEADINGS:
32+
assert heading in preamble, (
33+
f"{prompt_path.name} missing {heading} outside template block"
34+
)
35+
1836

1937
def prompt_preamble(prompt_path: Path) -> str:
2038
return prompt_path.read_text(encoding="utf-8").split("```", 1)[0]

reflective-prompt-library/plans/tests/test_agent_prompts_eval_harness.py

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,13 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition # noqa: E402
12+
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition, PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings # noqa: E402
13+
14+
REQUIRED_HEADINGS = PROMPT_CONTRACT_HEADINGS
15+
MIN_SCORE = PROMPT_EVAL_MIN_SCORE
1316

1417
AGENT_DIR = Path(__file__).parent.parent.parent / "04-agent"
1518
REPO_ROOT = str(Path(__file__).parent.parent.parent.parent)
16-
MIN_SCORE = 80.0
17-
18-
REQUIRED_HEADINGS = (
19-
"## Purpose",
20-
"## Scope",
21-
"## Acceptance Criteria",
22-
"## Falsifiability",
23-
)
2419

2520
AGENT_PROMPTS = tuple(sorted(AGENT_DIR.glob("*.md")))
2621
AGENT_PROMPTS_WITH_HUMAN_REVIEW = prompts_with_human_review(AGENT_PROMPTS)
@@ -49,10 +44,7 @@ def harness() -> EvalHarness:
4944

5045
@pytest.mark.parametrize("prompt_path", AGENT_PROMPTS, ids=lambda p: p.name)
5146
def test_agent_prompt_has_contract_headings(prompt_path: Path):
52-
text = prompt_path.read_text(encoding="utf-8")
53-
preamble = text.split("```", 1)[0]
54-
for heading in REQUIRED_HEADINGS:
55-
assert heading in preamble, f"{prompt_path.name} missing {heading} outside template block"
47+
assert_prompt_contract_headings(prompt_path)
5648

5749

5850
@pytest.mark.parametrize("prompt_path", AGENT_PROMPTS, ids=lambda p: p.name)

reflective-prompt-library/plans/tests/test_context_prompts_eval_harness.py

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,18 +9,13 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition # noqa: E402
12+
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition, PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings # noqa: E402
13+
14+
REQUIRED_HEADINGS = PROMPT_CONTRACT_HEADINGS
15+
MIN_SCORE = PROMPT_EVAL_MIN_SCORE
1316

1417
CONTEXT_DIR = Path(__file__).parent.parent.parent / "03-context"
1518
REPO_ROOT = str(Path(__file__).parent.parent.parent.parent)
16-
MIN_SCORE = 80.0
17-
18-
REQUIRED_HEADINGS = (
19-
"## Purpose",
20-
"## Scope",
21-
"## Acceptance Criteria",
22-
"## Falsifiability",
23-
)
2419

2520
CONTEXT_PROMPTS = tuple(sorted(CONTEXT_DIR.glob("*.md")))
2621
CONTEXT_PROMPTS_WITH_HUMAN_REVIEW = prompts_with_human_review(CONTEXT_PROMPTS)
@@ -46,10 +41,7 @@ def harness() -> EvalHarness:
4641

4742
@pytest.mark.parametrize("prompt_path", CONTEXT_PROMPTS, ids=lambda p: p.name)
4843
def test_context_prompt_has_contract_headings(prompt_path: Path):
49-
text = prompt_path.read_text(encoding="utf-8")
50-
preamble = text.split("```", 1)[0]
51-
for heading in REQUIRED_HEADINGS:
52-
assert heading in preamble, f"{prompt_path.name} missing {heading} outside template block"
44+
assert_prompt_contract_headings(prompt_path)
5345

5446

5547
@pytest.mark.parametrize("prompt_path", CONTEXT_PROMPTS, ids=lambda p: p.name)

reflective-prompt-library/plans/tests/test_core_prompts_eval_harness.py

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,22 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import ( # noqa: E402
12+
from prompt_eval_helpers import (
13+
PROMPT_CONTRACT_HEADINGS,
14+
PROMPT_EVAL_MIN_SCORE,
15+
assert_prompt_contract_headings, # noqa: E402
1316
assert_human_review_exempt_have_no_preamble_section,
1417
assert_human_review_preamble,
1518
assert_human_review_required_matches_detection,
1619
assert_human_review_sets_partition,
1720
prompts_with_human_review,
1821
)
1922

23+
REQUIRED_HEADINGS = PROMPT_CONTRACT_HEADINGS
24+
MIN_SCORE = PROMPT_EVAL_MIN_SCORE
25+
2026
CORE_DIR = Path(__file__).parent.parent.parent / "00-core"
2127
REPO_ROOT = str(Path(__file__).parent.parent.parent.parent)
22-
MIN_SCORE = 80.0
23-
24-
REQUIRED_HEADINGS = (
25-
"## Purpose",
26-
"## Scope",
27-
"## Acceptance Criteria",
28-
"## Falsifiability",
29-
)
3028

3129
CORE_PROMPTS = tuple(sorted(CORE_DIR.glob("*.md")))
3230
CORE_HUMAN_REVIEW_REQUIRED = frozenset({
@@ -53,10 +51,7 @@ def harness() -> EvalHarness:
5351

5452
@pytest.mark.parametrize("prompt_path", CORE_PROMPTS, ids=lambda p: p.name)
5553
def test_core_prompt_has_contract_headings(prompt_path: Path):
56-
text = prompt_path.read_text(encoding="utf-8")
57-
preamble = text.split("```", 1)[0]
58-
for heading in REQUIRED_HEADINGS:
59-
assert heading in preamble, f"{prompt_path.name} missing {heading} outside template block"
54+
assert_prompt_contract_headings(prompt_path)
6055

6156

6257
@pytest.mark.parametrize("prompt_path", CORE_PROMPTS, ids=lambda p: p.name)

0 commit comments

Comments
 (0)