Skip to content

Commit ffc418b

Browse files
committed
Round 96: cross-category eval_harness score floor library registry
Add assert_prompt_meets_eval_harness_floor DRY helper, refactor per-category floor tests, and test_prompt_eval_harness_score_library_registry.py for library-wide score-floor falsifiability. Sync governance (GLOSSARY step 28, panel record, Decision Index). 652 pytest; ROUTE-001/002/003 100%.
1 parent d4f7f85 commit ffc418b

17 files changed

Lines changed: 210 additions & 60 deletions

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Full library docs: [reflective-prompt-library/README.md](reflective-prompt-libra
2121
## Governance
2222

2323
- **Contributing:** [CONTRIBUTING.md](CONTRIBUTING.md) — quality gates, routing maintenance (R8–R12), `make all`
24-
- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–95)
24+
- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–96)
2525
- **Operator playbook:** [GLOSSARY.md](reflective-prompt-library/GLOSSARY.md) — Governance Maintenance Playbook
2626

2727
The repository contains:

reflective-prompt-library/GLOSSARY.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,7 @@ Curated top-of-cheatsheet summary of high-confusion routing traps (ROUTE-002 hol
337337

338338
## Governance Maintenance Playbook / 治理維護手冊
339339

340-
Ongoing upkeep after panel close (Rounds 1–95). Not agent instructions — operator checklist.
340+
Ongoing upkeep after panel close (Rounds 1–96). Not agent instructions — operator checklist.
341341

342342
**Operational test:** Before router tuning, add fresh ROUTE-002/003 holdout phrases; run `make all`; record decisions in `PROJECT_KNOWLEDGE.md` Decision Index when governance surface changes.
343343

@@ -368,3 +368,4 @@ Ongoing upkeep after panel close (Rounds 1–95). Not agent instructions — ope
368368
25. When adding composable prompts or editing eval_harness contract preambles, keep `PROMPT_CONTRACT_HEADINGS` / `PROMPT_EVAL_MIN_SCORE` in `prompt_eval_helpers.py` and run `test_prompt_contract_library_registry.py` plus per-category `test_*_prompts_eval_harness.py` guards.
369369
26. When editing composable prompt Purpose preambles, keep `Primary workflow surface(s)` / Supporting-lens lines via `assert_primary_workflow_surface_preamble` in `prompt_eval_helpers.py`; update `SUPPORTING_LENS_PRIMARY_SURFACE_BY_CATEGORY` for exemptions; run `test_prompt_primary_workflow_surface_library_registry.py` plus per-category `test_*_prompts_eval_harness.py` guards.
370370
27. When editing category workflow skill coverage tuples, keep frozen `*_COVER_WORKFLOW_SKILLS` in `test_*_prompts_eval_harness.py` aligned with `assert_category_workflow_skill_coverage`; `01-thinking` stays exempt (consumer graph); run `test_workflow_skill_coverage_library_registry.py`.
371+
28. When editing eval_harness score floors, keep `PROMPT_EVAL_MIN_SCORE` in `prompt_eval_helpers.py` and use `assert_prompt_meets_eval_harness_floor` in per-category `test_*_prompts_eval_harness.py` guards; run `test_prompt_eval_harness_score_library_registry.py`.

reflective-prompt-library/PROJECT_KNOWLEDGE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ deferred promotions are recurrence-gated — see [panel backlog](plans/multi-age
7272

7373
## Decision Index
7474

75+
- 2026-06-25 Round 96 panel — cross-category eval_harness score floor library registry (`test_prompt_eval_harness_score_library_registry.py`, DRY `assert_prompt_meets_eval_harness_floor`) → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7576
- 2026-06-25 Round 85 panel — composable prompt Primary workflow surface preamble guards (`test_*_prompts_eval_harness.py`) + Supporting-lens exemption → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7677
- 2026-06-25 Round 94 panel — cross-category Primary workflow surface preamble library registry (`test_prompt_primary_workflow_surface_library_registry.py`, DRY `assert_primary_workflow_surface_preamble`) → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7778
- 2026-06-25 Round 95 panel — cross-category workflow skill coverage library registry (`test_workflow_skill_coverage_library_registry.py`, DRY `assert_category_workflow_skill_coverage`) → [record](plans/multi-agent-panel-consensus-2026-06-25.md)

reflective-prompt-library/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Pick **Strictness L1–L6** first (`skills/reflective-dispatch/SKILL.md`, [GLOSS
3030

3131
## Governance Panel Record
3232

33-
Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–95, options A–HA) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
33+
Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–96, options A–HF) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
3434

3535
## Directory Map
3636

reflective-prompt-library/plans/QUALITY_GATES_SUMMARY.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ ROUTE-002 measures unseen phrasing separately from ROUTE-001. Round 7 (2026-06-2
314314
2. **ROUTE-001/002/003 in CI** — 128 + 102 + 53 paraphrases at 100% consistency (seeded fixtures); `validate_route_fixture.py` gates minimum coverage
315315
3. **Governance validators** — links, lint, governance metadata, PROJECT_KNOWLEDGE, benchmark fixture, skill examples
316316
4. **Harness policy docs** — CONTRIBUTING, AGENTS, SKILL_INSTALLATION, maintenance playbook
317-
5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_human_review_library_registry.py`, `test_prompt_skill_links_library_registry.py`, `test_prompt_contract_library_registry.py`, `test_prompt_primary_workflow_surface_library_registry.py`, `test_workflow_skill_coverage_library_registry.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py`, `test_domain_prompts_eval_harness.py`, `test_repo_prompts_eval_harness.py`, `test_validate_governance.py`, `test_validate_links.py`, `test_lint_skills.py`, `test_skill_module_contract.py` (Escalation subsection + Trigger/Methods/Output/Never; 640+ pytest anti-drift suite in CI); reciprocal thinking-lens ↔ skill checks and `00-core` + composable `Primary workflow surface(s)` ↔ `*_SKILL_LINKS` parity in `test_prompt_cross_links.py` (including strict Primary workflow surfaces parity via `test_thinking_lens_primary_surfaces_match_consumer_graph`); Human Review + Escalation route-target guards in thinking/skill contract tests; composable `Primary workflow surface(s)` / Supporting-lens preamble guards and composable `## Human Review` preamble guards (route to `reflective-risk`) via `prompt_eval_helpers.assert_human_review_preamble` in `test_*_prompts_eval_harness.py`; frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` set parity across all prompt categories (Round 90); library-wide contract heading registry (`PROMPT_CONTRACT_HEADINGS`, Round 93); workflow skill coverage registry (`*_COVER_WORKFLOW_SKILLS`, Round 95)
317+
5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_human_review_library_registry.py`, `test_prompt_skill_links_library_registry.py`, `test_prompt_contract_library_registry.py`, `test_prompt_primary_workflow_surface_library_registry.py`, `test_workflow_skill_coverage_library_registry.py`, `test_prompt_eval_harness_score_library_registry.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py`, `test_domain_prompts_eval_harness.py`, `test_repo_prompts_eval_harness.py`, `test_validate_governance.py`, `test_validate_links.py`, `test_lint_skills.py`, `test_skill_module_contract.py` (Escalation subsection + Trigger/Methods/Output/Never; 650+ pytest anti-drift suite in CI); reciprocal thinking-lens ↔ skill checks and `00-core` + composable `Primary workflow surface(s)` ↔ `*_SKILL_LINKS` parity in `test_prompt_cross_links.py` (including strict Primary workflow surfaces parity via `test_thinking_lens_primary_surfaces_match_consumer_graph`); Human Review + Escalation route-target guards in thinking/skill contract tests; composable `Primary workflow surface(s)` / Supporting-lens preamble guards and composable `## Human Review` preamble guards (route to `reflective-risk`) via `prompt_eval_helpers.assert_human_review_preamble` in `test_*_prompts_eval_harness.py`; frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` set parity across all prompt categories (Round 90); library-wide contract heading registry (`PROMPT_CONTRACT_HEADINGS`, Round 93); workflow skill coverage registry (`*_COVER_WORKFLOW_SKILLS`, Round 95); eval_harness score floor registry (`PROMPT_EVAL_MIN_SCORE`, Round 96)
318318

319319
### Ongoing maintenance (not blockers)
320320

@@ -384,4 +384,4 @@ Phase 1 quality-gate tooling and documentation are **complete**. Routing consist
384384
- ✅ Benchmark fixture gate plus optional manual benchmark runs
385385
- ✅ Research-backed design decisions
386386

387-
The project is positioned to grow sustainably with quality discipline built in from the start. **No open implementation blockers** remain from panel Rounds 1–95; work is recurrence-gated maintenance per playbook. The next measurable quality target is **holdout expansion before router tuning** and optional manual baseline-vs-skill benchmark runs — not shipping new core skills without promotion evidence.
387+
The project is positioned to grow sustainably with quality discipline built in from the start. **No open implementation blockers** remain from panel Rounds 1–96; work is recurrence-gated maintenance per playbook. The next measurable quality target is **holdout expansion before router tuning** and optional manual baseline-vs-skill benchmark runs — not shipping new core skills without promotion evidence.

reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md

Lines changed: 52 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2829,4 +2829,55 @@ User directive (repeat): review prompts, plans, skills, and Socratic/critical-th
28292829

28302830
---
28312831

2832-
**Resealed 2026-06-25** after **Round 95** (options GW–HA). Workflow skill coverage is now library-registry checked across all `00-core``06-repo` categories with shared `assert_category_workflow_skill_coverage` and frozen tuples per harness (`01-thinking` exempt via empty tuple). Holdout expansion remains recurrence-gated maintenance.
2832+
## Round 96 — cross-category eval_harness score floor library registry (2026-06-25)
2833+
2834+
**Options HB–HF** | Six-lens panel (Opus, Codex, Gemini, Composer, Sakana, GLM)
2835+
2836+
### Round 96 options
2837+
2838+
| ID | Proposal | Verdict |
2839+
| --- | --- | --- |
2840+
| HB | DRY `assert_prompt_meets_eval_harness_floor` in `prompt_eval_helpers.py` | **Agree** |
2841+
| HC | `test_prompt_eval_harness_score_library_registry.py` — score floor registry + library-wide sweep | **Agree** |
2842+
| HD | GLOSSARY playbook step 28 + governance sync | **Agree** |
2843+
| HE | ROUTE holdout expansion | **Defer** |
2844+
| HF | Router / tenth skill / benchmark CI | **Reject** |
2845+
2846+
### Round 96 verdict table
2847+
2848+
| ID | Option | Verdict | Action |
2849+
| --- | --- | --- | --- |
2850+
| HB | DRY eval_harness score floor helper | **Agree** | `assert_prompt_meets_eval_harness_floor` |
2851+
| HC | Score floor library registry | **Agree** | `test_prompt_eval_harness_score_library_registry.py` |
2852+
| HD | Playbook + docs | **Agree** | step 28; panel round 96 sync |
2853+
| HE | Holdout expansion | **Defer** | maintenance |
2854+
| HF | Router/tenth skill/benchmark CI | **Reject** | no change |
2855+
2856+
### Socratic rationale (Round 96)
2857+
2858+
- **Opus:** Rounds 91–95 closed HR, cross-link, contract, primary-surface, and workflow-coverage registries; per-category `meets_eval_harness_floor` guards remain duplicated with no library-wide falsifiability.
2859+
- **Codex:** Centralizing `assert_prompt_meets_eval_harness_floor` prevents score-floor drift; registry sweep catches regressions across all 49 prompts in one place.
2860+
- **Gemini:** Reject duplicating `EvalHarness.evaluate_file` logic in seven harness files — token/cost of maintenance, not runtime.
2861+
- **Composer:** IDE adopters need one playbook step when bumping `PROMPT_EVAL_MIN_SCORE`; library registry mirrors R93 contract pattern.
2862+
- **Sakana:** Score floor is orthogonal to thinking consumer graph — all seven categories including `01-thinking` belong in registry.
2863+
- **GLM:** TW surface unchanged; score floor stays English-only harness policy.
2864+
2865+
**All roles agree.**
2866+
2867+
## Implemented Changes (Round 96)
2868+
2869+
- `plans/tests/prompt_eval_helpers.py`: `assert_prompt_meets_eval_harness_floor`
2870+
- `plans/tests/test_*_prompts_eval_harness.py`: DRY eval_harness score floor guards
2871+
- `plans/tests/test_prompt_eval_harness_score_library_registry.py`: cross-category registry + library glob parity + library-wide sweep
2872+
- `GLOSSARY.md`: playbook Rounds 1–96; step 28 for eval_harness score floor library registry
2873+
- `QUALITY_GATES_SUMMARY.md`: score floor registry note; panel Rounds 1–96; 650+ pytest floor
2874+
- `PROJECT_KNOWLEDGE.md`: Decision Index Round 96 entry
2875+
- `README.md`, `reflective-prompt-library/README.md`, `test_readme_governance.py`: panel round 96 sync
2876+
2877+
## Verification (Round 96)
2878+
2879+
- `make all`: 652 pytest + ROUTE-001/002/003 100%
2880+
2881+
---
2882+
2883+
**Resealed 2026-06-25** after **Round 96** (options HB–HF). Eval_harness score floors are now library-registry checked across all `00-core``06-repo` categories with shared `assert_prompt_meets_eval_harness_floor` and per-category `MIN_SCORE` aliases to `PROMPT_EVAL_MIN_SCORE`. Holdout expansion remains recurrence-gated maintenance.

reflective-prompt-library/plans/tests/prompt_eval_helpers.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,3 +128,17 @@ def assert_category_workflow_skill_coverage(
128128
for skill in required_skills:
129129
assert skill in text, f"{category_label} should reference {skill}"
130130

131+
def assert_prompt_meets_eval_harness_floor(
132+
prompt_path: Path,
133+
harness,
134+
repo_root: str,
135+
min_score: float = PROMPT_EVAL_MIN_SCORE,
136+
) -> None:
137+
"""Prompt must meet eval_harness score floor (default PROMPT_EVAL_MIN_SCORE)."""
138+
rel = str(prompt_path.relative_to(repo_root))
139+
result = harness.evaluate_file(rel)
140+
assert result["score"] >= min_score, (
141+
f"{prompt_path.name} eval_harness score {result['score']}% < {min_score}%: "
142+
f"{[(c['id'], c['result']) for c in result['checks']]}"
143+
)
144+

reflective-prompt-library/plans/tests/test_agent_prompts_eval_harness.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import assert_category_workflow_skill_coverage, assert_human_review_preamble, assert_primary_workflow_surface_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition, PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings # noqa: E402
12+
from prompt_eval_helpers import assert_category_workflow_skill_coverage, assert_human_review_preamble, assert_primary_workflow_surface_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition, PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings, assert_prompt_meets_eval_harness_floor # noqa: E402
1313

1414
REQUIRED_HEADINGS = PROMPT_CONTRACT_HEADINGS
1515
MIN_SCORE = PROMPT_EVAL_MIN_SCORE
@@ -55,12 +55,7 @@ def test_agent_prompt_has_contract_headings(prompt_path: Path):
5555

5656
@pytest.mark.parametrize("prompt_path", AGENT_PROMPTS, ids=lambda p: p.name)
5757
def test_agent_prompt_meets_eval_harness_floor(prompt_path: Path, harness: EvalHarness):
58-
rel = str(prompt_path.relative_to(REPO_ROOT))
59-
result = harness.evaluate_file(rel)
60-
assert result["score"] >= MIN_SCORE, (
61-
f"{prompt_path.name} eval_harness score {result['score']}% < {MIN_SCORE}%: "
62-
f"{[(c['id'], c['result']) for c in result['checks']]}"
63-
)
58+
assert_prompt_meets_eval_harness_floor(prompt_path, harness, REPO_ROOT, MIN_SCORE)
6459

6560
def test_agent_prompts_reference_workflow_skills():
6661
for prompt_path in AGENT_PROMPTS:

reflective-prompt-library/plans/tests/test_context_prompts_eval_harness.py

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import assert_category_workflow_skill_coverage, assert_human_review_preamble, assert_primary_workflow_surface_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition, PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings # noqa: E402
12+
from prompt_eval_helpers import assert_category_workflow_skill_coverage, assert_human_review_preamble, assert_primary_workflow_surface_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition, PROMPT_CONTRACT_HEADINGS, PROMPT_EVAL_MIN_SCORE, assert_prompt_contract_headings, assert_prompt_meets_eval_harness_floor # noqa: E402
1313

1414
REQUIRED_HEADINGS = PROMPT_CONTRACT_HEADINGS
1515
MIN_SCORE = PROMPT_EVAL_MIN_SCORE
@@ -52,12 +52,7 @@ def test_context_prompt_has_contract_headings(prompt_path: Path):
5252

5353
@pytest.mark.parametrize("prompt_path", CONTEXT_PROMPTS, ids=lambda p: p.name)
5454
def test_context_prompt_meets_eval_harness_floor(prompt_path: Path, harness: EvalHarness):
55-
rel = str(prompt_path.relative_to(REPO_ROOT))
56-
result = harness.evaluate_file(rel)
57-
assert result["score"] >= MIN_SCORE, (
58-
f"{prompt_path.name} eval_harness score {result['score']}% < {MIN_SCORE}%: "
59-
f"{[(c['id'], c['result']) for c in result['checks']]}"
60-
)
55+
assert_prompt_meets_eval_harness_floor(prompt_path, harness, REPO_ROOT, MIN_SCORE)
6156

6257
def test_context_prompts_reference_workflow_skills():
6358
for prompt_path in CONTEXT_PROMPTS:

reflective-prompt-library/plans/tests/test_core_prompts_eval_harness.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
PROMPT_CONTRACT_HEADINGS,
1414
PROMPT_EVAL_MIN_SCORE,
1515
assert_primary_workflow_surface_preamble,
16-
assert_category_workflow_skill_coverage, assert_prompt_contract_headings, # noqa: E402
16+
assert_category_workflow_skill_coverage, assert_prompt_contract_headings,
17+
assert_prompt_meets_eval_harness_floor, # noqa: E402
1718
assert_human_review_exempt_have_no_preamble_section,
1819
assert_human_review_preamble,
1920
assert_human_review_required_matches_detection,
@@ -61,12 +62,7 @@ def test_core_prompt_has_contract_headings(prompt_path: Path):
6162

6263
@pytest.mark.parametrize("prompt_path", CORE_PROMPTS, ids=lambda p: p.name)
6364
def test_core_prompt_meets_eval_harness_floor(prompt_path: Path, harness: EvalHarness):
64-
rel = str(prompt_path.relative_to(REPO_ROOT))
65-
result = harness.evaluate_file(rel)
66-
assert result["score"] >= MIN_SCORE, (
67-
f"{prompt_path.name} eval_harness score {result['score']}% < {MIN_SCORE}%: "
68-
f"{[(c['id'], c['result']) for c in result['checks']]}"
69-
)
65+
assert_prompt_meets_eval_harness_floor(prompt_path, harness, REPO_ROOT, MIN_SCORE)
7066

7167

7268
def test_core_prompts_reference_workflow_skills():

0 commit comments

Comments
 (0)