Skip to content

Commit 64e82e0

Browse files
committed
Round 90: library-wide Human Review required/exempt set parity
Add DRY HR set parity helpers in prompt_eval_helpers and frozen *_HUMAN_REVIEW_REQUIRED / *_HUMAN_REVIEW_EXEMPT sets with partition tests for 01-thinking through 06-repo harness files. Refactor 00-core tests to shared helpers. Sync governance (GLOSSARY step 22, panel record, READMEs, QUALITY_GATES 580+ pytest floor).
1 parent 6dd08db commit 64e82e0

16 files changed

Lines changed: 350 additions & 28 deletions

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Full library docs: [reflective-prompt-library/README.md](reflective-prompt-libra
2121
## Governance
2222

2323
- **Contributing:** [CONTRIBUTING.md](CONTRIBUTING.md) — quality gates, routing maintenance (R8–R12), `make all`
24-
- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–89)
24+
- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–90)
2525
- **Operator playbook:** [GLOSSARY.md](reflective-prompt-library/GLOSSARY.md) — Governance Maintenance Playbook
2626

2727
The repository contains:

reflective-prompt-library/GLOSSARY.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -337,7 +337,7 @@ Curated top-of-cheatsheet summary of high-confusion routing traps (ROUTE-002 hol
337337

338338
## Governance Maintenance Playbook / 治理維護手冊
339339

340-
Ongoing upkeep after panel close (Rounds 1–89). Not agent instructions — operator checklist.
340+
Ongoing upkeep after panel close (Rounds 1–90). Not agent instructions — operator checklist.
341341

342342
**Operational test:** Before router tuning, add fresh ROUTE-002/003 holdout phrases; run `make all`; record decisions in `PROJECT_KNOWLEDGE.md` Decision Index when governance surface changes.
343343

@@ -362,3 +362,4 @@ Ongoing upkeep after panel close (Rounds 1–89). Not agent instructions — ope
362362
19. When editing Human Review guards, use `prompt_eval_helpers.assert_human_review_preamble` in all `test_*_prompts_eval_harness.py` files (thinking lenses + composable categories).
363363
20. When adding or editing risk-bearing `00-core/` prompts with `## Human Review`, keep preamble escalation routed to `reflective-risk` and run `test_core_prompts_eval_harness.py` Human Review guards via `prompt_eval_helpers.py`.
364364
21. When editing `00-core/` Human Review coverage, keep `CORE_HUMAN_REVIEW_REQUIRED` and `CORE_HUMAN_REVIEW_EXEMPT` in `test_core_prompts_eval_harness.py` aligned with preamble `## Human Review` sections; run core HR parity tests.
365+
22. When editing Human Review coverage on thinking lenses or composable prompts (`01-thinking``06-repo`), keep frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` sets in `test_*_prompts_eval_harness.py` aligned with preamble `## Human Review` sections; use `prompt_eval_helpers.assert_human_review_*` parity helpers and run HR set partition tests.

reflective-prompt-library/PROJECT_KNOWLEDGE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ deferred promotions are recurrence-gated — see [panel backlog](plans/multi-age
7373
## Decision Index
7474

7575
- 2026-06-25 Round 85 panel — composable prompt Primary workflow surface preamble guards (`test_*_prompts_eval_harness.py`) + Supporting-lens exemption → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
76+
- 2026-06-25 Round 90 panel — library-wide Human Review required/exempt set parity (`01-thinking``06-repo`) + DRY `prompt_eval_helpers` HR set guards → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7677
- 2026-06-25 Round 89 panel — `00-core` Human Review required/exempt set parity (`CORE_HUMAN_REVIEW_REQUIRED` / `CORE_HUMAN_REVIEW_EXEMPT`) → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
7778
- 2026-06-25 Round 88 panel — `00-core` Human Review preamble guards on risk-bearing prompts + `test_core_prompts_eval_harness.py`[record](plans/multi-agent-panel-consensus-2026-06-25.md)
7879
- 2026-06-25 Round 87 panel — Human Review helper DRY + GLOSSARY playbook step repair → [record](plans/multi-agent-panel-consensus-2026-06-25.md)

reflective-prompt-library/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Pick **Strictness L1–L6** first (`skills/reflective-dispatch/SKILL.md`, [GLOSS
3030

3131
## Governance Panel Record
3232

33-
Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–89, options A–FW) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
33+
Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–90, options A–GB) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
3434

3535
## Directory Map
3636

reflective-prompt-library/plans/QUALITY_GATES_SUMMARY.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ ROUTE-002 measures unseen phrasing separately from ROUTE-001. Round 7 (2026-06-2
314314
2. **ROUTE-001/002/003 in CI** — 128 + 102 + 53 paraphrases at 100% consistency (seeded fixtures); `validate_route_fixture.py` gates minimum coverage
315315
3. **Governance validators** — links, lint, governance metadata, PROJECT_KNOWLEDGE, benchmark fixture, skill examples
316316
4. **Harness policy docs** — CONTRIBUTING, AGENTS, SKILL_INSTALLATION, maintenance playbook
317-
5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py`, `test_domain_prompts_eval_harness.py`, `test_repo_prompts_eval_harness.py`, `test_validate_governance.py`, `test_validate_links.py`, `test_lint_skills.py`, `test_skill_module_contract.py` (Escalation subsection + Trigger/Methods/Output/Never; 560+ pytest anti-drift suite in CI); reciprocal thinking-lens ↔ skill checks and `00-core` + composable `Primary workflow surface(s)` ↔ `*_SKILL_LINKS` parity in `test_prompt_cross_links.py` (including strict Primary workflow surfaces parity via `test_thinking_lens_primary_surfaces_match_consumer_graph`); Human Review + Escalation route-target guards in thinking/skill contract tests; composable `Primary workflow surface(s)` / Supporting-lens preamble guards and composable `## Human Review` preamble guards (route to `reflective-risk`) via `prompt_eval_helpers.assert_human_review_preamble` in `test_*_prompts_eval_harness.py`
317+
5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py`, `test_domain_prompts_eval_harness.py`, `test_repo_prompts_eval_harness.py`, `test_validate_governance.py`, `test_validate_links.py`, `test_lint_skills.py`, `test_skill_module_contract.py` (Escalation subsection + Trigger/Methods/Output/Never; 580+ pytest anti-drift suite in CI); reciprocal thinking-lens ↔ skill checks and `00-core` + composable `Primary workflow surface(s)` ↔ `*_SKILL_LINKS` parity in `test_prompt_cross_links.py` (including strict Primary workflow surfaces parity via `test_thinking_lens_primary_surfaces_match_consumer_graph`); Human Review + Escalation route-target guards in thinking/skill contract tests; composable `Primary workflow surface(s)` / Supporting-lens preamble guards and composable `## Human Review` preamble guards (route to `reflective-risk`) via `prompt_eval_helpers.assert_human_review_preamble` in `test_*_prompts_eval_harness.py`; frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` set parity across all prompt categories (Round 90)
318318

319319
### Ongoing maintenance (not blockers)
320320

@@ -384,4 +384,4 @@ Phase 1 quality-gate tooling and documentation are **complete**. Routing consist
384384
- ✅ Benchmark fixture gate plus optional manual benchmark runs
385385
- ✅ Research-backed design decisions
386386

387-
The project is positioned to grow sustainably with quality discipline built in from the start. **No open implementation blockers** remain from panel Rounds 1–89; work is recurrence-gated maintenance per playbook. The next measurable quality target is **holdout expansion before router tuning** and optional manual baseline-vs-skill benchmark runs — not shipping new core skills without promotion evidence.
387+
The project is positioned to grow sustainably with quality discipline built in from the start. **No open implementation blockers** remain from panel Rounds 1–90; work is recurrence-gated maintenance per playbook. The next measurable quality target is **holdout expansion before router tuning** and optional manual baseline-vs-skill benchmark runs — not shipping new core skills without promotion evidence.

reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2539,3 +2539,50 @@ User directive (repeat): review prompts, plans, skills, and Socratic/critical-th
25392539
## Panel status (updated)
25402540

25412541
**Resealed 2026-06-25** after **Round 89** (options FT–FW). `00-core` Human Review coverage is now explicit via frozen required/exempt sets; full library HR contract parity closed (thinking R81, composable R86, core R88–R89). Holdout expansion remains recurrence-gated maintenance.
2542+
2543+
---
2544+
2545+
## Round 90 — library-wide Human Review required/exempt set parity (2026-06-25)
2546+
2547+
**Options FX–GB** | Six-lens panel (Opus, Codex, Gemini, Composer, Sakana, GLM)
2548+
2549+
### Round 90 options
2550+
2551+
| ID | Proposal | Verdict |
2552+
| --- | --- | --- |
2553+
| FX | DRY Human Review set parity helpers in `prompt_eval_helpers.py` + refactor `test_core_prompts_eval_harness.py` | **Agree** |
2554+
| FY | Frozen `*_HUMAN_REVIEW_REQUIRED` / `*_HUMAN_REVIEW_EXEMPT` sets + pytest parity for `01-thinking``06-repo` | **Agree** |
2555+
| FZ | GLOSSARY playbook step 22 + governance sync | **Agree** |
2556+
| GA | ROUTE holdout expansion | **Defer** |
2557+
| GB | Router / tenth skill / benchmark CI | **Reject** |
2558+
2559+
### Round 90 verdict table
2560+
2561+
| ID | Option | Verdict | Action |
2562+
| --- | --- | --- | --- |
2563+
| FX | HR set parity helpers | **Agree** | `assert_human_review_required_matches_detection`, `assert_human_review_exempt_have_no_preamble_section`, `assert_human_review_sets_partition` |
2564+
| FY | Library HR frozen sets | **Agree** | codify required/exempt per category in all `test_*_prompts_eval_harness.py` files |
2565+
| FZ | Playbook + docs | **Agree** | step 22; panel round 90 sync |
2566+
| GA | Holdout expansion | **Defer** | maintenance |
2567+
| GB | Router/tenth skill/benchmark CI | **Reject** | no change |
2568+
2569+
**All roles agree.**
2570+
2571+
## Implemented Changes (Round 90)
2572+
2573+
- `plans/tests/prompt_eval_helpers.py`: shared Human Review set parity helpers
2574+
- `plans/tests/test_core_prompts_eval_harness.py`: refactor to shared helpers
2575+
- `plans/tests/test_{thinking,engineering,context,agent,domain,repo}_prompts_eval_harness.py`: frozen HR required/exempt sets + partition parity tests
2576+
- `GLOSSARY.md`: playbook Rounds 1–90; step 22 for library-wide HR set parity
2577+
- `QUALITY_GATES_SUMMARY.md`: HR set parity note; panel Rounds 1–90; 580+ pytest floor
2578+
- `PROJECT_KNOWLEDGE.md`: Decision Index Round 90 entry
2579+
- `README.md`, `reflective-prompt-library/README.md`, `test_readme_governance.py`: panel round 90 sync
2580+
2581+
## Verification (Round 90)
2582+
2583+
- `make all`: pytest + ROUTE-001/002/003 100%
2584+
2585+
## Panel status (updated)
2586+
2587+
**Resealed 2026-06-25** after **Round 90** (options FX–GB). Human Review coverage is now explicit via frozen required/exempt sets across all prompt categories (`00-core``06-repo`). Holdout expansion remains recurrence-gated maintenance.
2588+

reflective-prompt-library/plans/tests/prompt_eval_helpers.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,39 @@ def assert_human_review_preamble(prompt_path: Path) -> None:
2727
assert "reflective-risk" in preamble, (
2828
f"{prompt_path.name} Human Review should route to reflective-risk"
2929
)
30+
31+
32+
def assert_human_review_required_matches_detection(
33+
required: frozenset[str],
34+
prompts: tuple[Path, ...],
35+
) -> None:
36+
"""Frozen required set must match prompts that declare ## Human Review in preambles."""
37+
detected = {p.name for p in prompts_with_human_review(prompts)}
38+
assert detected == required, (
39+
f"detected Human Review preambles {sorted(detected)} "
40+
f"!= frozen required {sorted(required)}"
41+
)
42+
43+
44+
def assert_human_review_exempt_have_no_preamble_section(
45+
exempt: frozenset[str],
46+
prompts: tuple[Path, ...],
47+
) -> None:
48+
"""Exempt prompts keep Human Review cues in fenced templates only."""
49+
by_name = {p.name: p for p in prompts}
50+
for name in sorted(exempt):
51+
assert not has_human_review_preamble(by_name[name]), (
52+
f"{name} should not declare ## Human Review in preamble"
53+
)
54+
55+
56+
def assert_human_review_sets_partition(
57+
all_names: frozenset[str],
58+
required: frozenset[str],
59+
exempt: frozenset[str],
60+
) -> None:
61+
"""Required + exempt sets must cover all prompts without overlap."""
62+
assert required | exempt == all_names, (
63+
f"required ∪ exempt {sorted(required | exempt)} != all prompts {sorted(all_names)}"
64+
)
65+
assert not required & exempt, "required and exempt Human Review sets must not overlap"

reflective-prompt-library/plans/tests/test_agent_prompts_eval_harness.py

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review # noqa: E402
12+
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition # noqa: E402
1313

1414
AGENT_DIR = Path(__file__).parent.parent.parent / "04-agent"
1515
REPO_ROOT = str(Path(__file__).parent.parent.parent.parent)
@@ -25,6 +25,22 @@
2525
AGENT_PROMPTS = tuple(sorted(AGENT_DIR.glob("*.md")))
2626
AGENT_PROMPTS_WITH_HUMAN_REVIEW = prompts_with_human_review(AGENT_PROMPTS)
2727
SUPPORTING_LENS_AGENT_PROMPTS = frozenset({"runtime-trust-boundary.md"})
28+
AGENT_HUMAN_REVIEW_REQUIRED = frozenset({
29+
"agent-scaffold-provenance.md",
30+
"agent-selection.md",
31+
"memory-consolidation.md",
32+
"retro.md",
33+
"review-rating-fix.md",
34+
"runtime-trust-boundary.md",
35+
"sop-compiler.md",
36+
})
37+
AGENT_HUMAN_REVIEW_EXEMPT = frozenset({
38+
"workflow-engine.md",
39+
"workflow-recipes.md",
40+
})
41+
42+
AGENT_PROMPTS_WITH_HUMAN_REVIEW = prompts_with_human_review(AGENT_PROMPTS)
43+
2844

2945

3046
@pytest.fixture(scope="module")
@@ -87,3 +103,27 @@ def test_agent_prompts_have_workflow_surface_preamble_line():
87103
def test_agent_prompt_has_human_review_section(prompt_path: Path):
88104
"""Prompts with Human Review declare escalation outside zh-TW templates."""
89105
assert_human_review_preamble(prompt_path)
106+
107+
def test_agent_human_review_required_set_matches_detection():
108+
"""Frozen required set must match prompts that declare ## Human Review in preambles."""
109+
assert_human_review_required_matches_detection(
110+
AGENT_HUMAN_REVIEW_REQUIRED, AGENT_PROMPTS
111+
)
112+
113+
114+
def test_agent_human_review_exempt_prompts_have_no_preamble_section():
115+
"""Exempt prompts keep Human Review cues in fenced templates only."""
116+
assert_human_review_exempt_have_no_preamble_section(
117+
AGENT_HUMAN_REVIEW_EXEMPT, AGENT_PROMPTS
118+
)
119+
120+
121+
def test_agent_human_review_sets_partition_prompts():
122+
"""Required + exempt sets must cover all prompts without overlap."""
123+
all_names = frozenset(p.name for p in AGENT_PROMPTS)
124+
assert_human_review_sets_partition(
125+
all_names,
126+
AGENT_HUMAN_REVIEW_REQUIRED,
127+
AGENT_HUMAN_REVIEW_EXEMPT,
128+
)
129+

reflective-prompt-library/plans/tests/test_context_prompts_eval_harness.py

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
sys.path.insert(0, str(Path(__file__).parent))
1010

1111
from eval_harness import EvalHarness # noqa: E402
12-
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review # noqa: E402
12+
from prompt_eval_helpers import assert_human_review_preamble, prompts_with_human_review, assert_human_review_required_matches_detection, assert_human_review_exempt_have_no_preamble_section, assert_human_review_sets_partition # noqa: E402
1313

1414
CONTEXT_DIR = Path(__file__).parent.parent.parent / "03-context"
1515
REPO_ROOT = str(Path(__file__).parent.parent.parent.parent)
@@ -24,6 +24,20 @@
2424

2525
CONTEXT_PROMPTS = tuple(sorted(CONTEXT_DIR.glob("*.md")))
2626
CONTEXT_PROMPTS_WITH_HUMAN_REVIEW = prompts_with_human_review(CONTEXT_PROMPTS)
27+
CONTEXT_HUMAN_REVIEW_REQUIRED = frozenset({
28+
"context-handoff.md",
29+
"low-token.md",
30+
"small-context.md",
31+
})
32+
CONTEXT_HUMAN_REVIEW_EXEMPT = frozenset({
33+
"context-engineering.md",
34+
"gemini-long-document.md",
35+
"large-context.md",
36+
"medium-context.md",
37+
})
38+
39+
CONTEXT_PROMPTS_WITH_HUMAN_REVIEW = prompts_with_human_review(CONTEXT_PROMPTS)
40+
2741

2842

2943
@pytest.fixture(scope="module")
@@ -80,3 +94,27 @@ def test_context_prompts_have_primary_workflow_surfaces_line():
8094
def test_context_prompt_has_human_review_section(prompt_path: Path):
8195
"""Prompts with Human Review declare escalation outside zh-TW templates."""
8296
assert_human_review_preamble(prompt_path)
97+
98+
def test_context_human_review_required_set_matches_detection():
99+
"""Frozen required set must match prompts that declare ## Human Review in preambles."""
100+
assert_human_review_required_matches_detection(
101+
CONTEXT_HUMAN_REVIEW_REQUIRED, CONTEXT_PROMPTS
102+
)
103+
104+
105+
def test_context_human_review_exempt_prompts_have_no_preamble_section():
106+
"""Exempt prompts keep Human Review cues in fenced templates only."""
107+
assert_human_review_exempt_have_no_preamble_section(
108+
CONTEXT_HUMAN_REVIEW_EXEMPT, CONTEXT_PROMPTS
109+
)
110+
111+
112+
def test_context_human_review_sets_partition_prompts():
113+
"""Required + exempt sets must cover all prompts without overlap."""
114+
all_names = frozenset(p.name for p in CONTEXT_PROMPTS)
115+
assert_human_review_sets_partition(
116+
all_names,
117+
CONTEXT_HUMAN_REVIEW_REQUIRED,
118+
CONTEXT_HUMAN_REVIEW_EXEMPT,
119+
)
120+

reflective-prompt-library/plans/tests/test_core_prompts_eval_harness.py

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,10 @@
1010

1111
from eval_harness import EvalHarness # noqa: E402
1212
from prompt_eval_helpers import ( # noqa: E402
13+
assert_human_review_exempt_have_no_preamble_section,
1314
assert_human_review_preamble,
14-
has_human_review_preamble,
15+
assert_human_review_required_matches_detection,
16+
assert_human_review_sets_partition,
1517
prompts_with_human_review,
1618
)
1719

@@ -98,20 +100,23 @@ def test_core_prompt_has_human_review_section(prompt_path: Path):
98100

99101
def test_core_human_review_required_set_matches_detection():
100102
"""Frozen required set must match prompts that declare ## Human Review in preambles."""
101-
detected = {p.name for p in CORE_PROMPTS_WITH_HUMAN_REVIEW}
102-
assert detected == CORE_HUMAN_REVIEW_REQUIRED
103+
assert_human_review_required_matches_detection(
104+
CORE_HUMAN_REVIEW_REQUIRED, CORE_PROMPTS
105+
)
103106

104107

105108
def test_core_human_review_exempt_prompts_have_no_preamble_section():
106109
"""L1 opener prompts keep Human Review cues in fenced templates only."""
107-
for name in CORE_HUMAN_REVIEW_EXEMPT:
108-
assert not has_human_review_preamble(CORE_DIR / name), (
109-
f"{name} should not declare ## Human Review in preamble (exempt opener)"
110-
)
110+
assert_human_review_exempt_have_no_preamble_section(
111+
CORE_HUMAN_REVIEW_EXEMPT, CORE_PROMPTS
112+
)
111113

112114

113115
def test_core_human_review_sets_partition_core_prompts():
114116
"""Required + exempt sets must cover all 00-core prompts without overlap."""
115-
all_names = {p.name for p in CORE_PROMPTS}
116-
assert CORE_HUMAN_REVIEW_REQUIRED | CORE_HUMAN_REVIEW_EXEMPT == all_names
117-
assert not CORE_HUMAN_REVIEW_REQUIRED & CORE_HUMAN_REVIEW_EXEMPT
117+
all_names = frozenset(p.name for p in CORE_PROMPTS)
118+
assert_human_review_sets_partition(
119+
all_names,
120+
CORE_HUMAN_REVIEW_REQUIRED,
121+
CORE_HUMAN_REVIEW_EXEMPT,
122+
)

0 commit comments

Comments
 (0)