Round 74: standardize 03-context prompt contracts + eval anti-drift

johnteee · johnteee · commit 35b52a7ecae8 · 2026-06-25T15:33:54.000+08:00
Six-lens panel consensus (DO–DQ): add Purpose/Scope/Acceptance/Falsifiability
preambles to all seven 03-context prompts with workflow-skill and thinking-lens
cross-links; add test_context_prompts_eval_harness.py and extend cross-link pytest.
Governance sync: panel record, Decision Index, QUALITY_GATES 330+ floor.
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@ Full library docs: [reflective-prompt-library/README.md](reflective-prompt-libra
 ## Governance
 
 - **Contributing:** [CONTRIBUTING.md](CONTRIBUTING.md) — quality gates, routing maintenance (R8–R12), `make all`
-- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–73)
+- **Panel record:** [multi-agent-panel-consensus](reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md) — six-lens Socratic consensus (Rounds 1–74)
 - **Operator playbook:** [GLOSSARY.md](reflective-prompt-library/GLOSSARY.md) — Governance Maintenance Playbook
 
 The repository contains:
diff --git a/reflective-prompt-library/03-context/context-engineering.md b/reflective-prompt-library/03-context/context-engineering.md
@@ -2,6 +2,24 @@
 
 Use this before long tasks where context discipline matters.
 
+## Purpose
+
+Enforce context discipline before long tasks. Primary workflow surfaces: `reflective-dispatch` (context-load deferral) and `reflective-research`. Pairs with `01-thinking/falsifiability.md` and `01-thinking/why-what-how-done.md`.
+
+## Scope
+
+- In scope: selective reads, artifact summaries, index-then-batch processing, and missing-info flags.
+- Out of scope: replacing frozen workflow skill contracts or router fairness rules.
+
+## Acceptance Criteria
+
+- Context-used, context-ignored, and missing-info sections appear at the end.
+- Large inputs are indexed before synthesis.
+
+## Falsifiability
+
+State what would prove the agent read irrelevant material anyway.
+
 ```markdown
 請以 Context Engineering 模式處理任務。
 
diff --git a/reflective-prompt-library/03-context/context-handoff.md b/reflective-prompt-library/03-context/context-handoff.md
@@ -2,6 +2,28 @@
 
 Use this when switching models, tools, agents, or sessions.
 
+## Purpose
+
+Produce session handoff summaries when switching models, tools, agents, or sessions. Primary workflow surface: `reflective-handoff-retro`. Pairs with `01-thinking/why-what-how-done.md` and `01-thinking/socratic-reviewer.md`.
+
+## Scope
+
+- In scope: goal, state, decisions, artifacts, risks, blockers, and next action for a successor agent.
+- Out of scope: full retrospective synthesis or repository edits (`reflective-implement`).
+
+## Acceptance Criteria
+
+- Output follows the handoff field structure without narrative drift.
+- Do-not-do guidance explicit when blast-radius warrants `reflective-risk`.
+
+## Falsifiability
+
+Name one handoff field that would be wrong if the successor could not resume work.
+
+## Human Review
+
+Require human confirmation before handoff when irreversible or high-blast-radius work remains open.
+
 ```markdown
 請將目前任務整理成 Context Handoff Summary，供下一個 Agent 接手。
 
diff --git a/reflective-prompt-library/03-context/gemini-long-document.md b/reflective-prompt-library/03-context/gemini-long-document.md
@@ -2,6 +2,24 @@
 
 Use this when processing long documents. It is especially suited for Gemini-style large-context workflows.
 
+## Purpose
+
+Structure-first processing for long documents (Gemini-style workflows). Primary workflow surface: `reflective-research`. Pairs with `01-thinking/critical-thinking-check.md` and `01-thinking/falsifiability.md`.
+
+## Scope
+
+- In scope: document map, relevant sections, claims, evidence, contradictions, and synthesis.
+- Out of scope: full verbatim summary or repository edits.
+
+## Acceptance Criteria
+
+- Seven output sections are populated before recommendation.
+- Missing information is flagged explicitly.
+
+## Falsifiability
+
+Name one contradiction that would change the recommendation.
+
 ```markdown
 你將處理長文件。請不要直接摘要全文，而是先建立結構索引。
 
diff --git a/reflective-prompt-library/03-context/large-context.md b/reflective-prompt-library/03-context/large-context.md
@@ -2,6 +2,24 @@
 
 Use this for 200K-1M context windows while avoiding context rot.
 
+## Purpose
+
+Use 200K–1M windows without context rot via index-extract-synthesize. Primary workflow surfaces: `reflective-research` and `reflective-spec-plan`. Pairs with `01-thinking/falsifiability.md` and `01-thinking/critical-thinking-check.md`.
+
+## Scope
+
+- In scope: three-stage pipeline, selective extraction, and synthesis artifacts.
+- Out of scope: assuming long context equals reliable understanding.
+
+## Acceptance Criteria
+
+- All three stages are completed in order.
+- Pairs with `context-engineering.md` per the composition note below.
+
+## Falsifiability
+
+State what contradiction in source material would invalidate the synthesis.
+
 ```markdown
 你在大型 context window 中工作，但不要假設長 context 等於可靠理解。
 
diff --git a/reflective-prompt-library/03-context/low-token.md b/reflective-prompt-library/03-context/low-token.md
@@ -2,6 +2,28 @@
 
 Use this when budget, latency, or model quota is tight.
 
+## Purpose
+
+Budget-aware terse output when latency or quota is tight. Primary workflow surfaces: `reflective-dispatch` (L1 fast path) and `reflective-brief`. Pairs with `01-thinking/critical-thinking-check.md` and `01-thinking/why-what-how-done.md`.
+
+## Scope
+
+- In scope: minimal decision, reason, plan, and acceptance output under strict length caps.
+- Out of scope: full spec slicing (`reflective-spec-plan`) or repository implementation.
+
+## Acceptance Criteria
+
+- Fixed output slots are filled without narrative padding.
+- Stop condition is explicit.
+
+## Falsifiability
+
+Name one omitted slot that would make the answer non-actionable.
+
+## Human Review
+
+Escalate to `reflective-risk` when compression would hide safety-critical assumptions.
+
 ```markdown
 低 token 模式。請只輸出必要內容。
 
diff --git a/reflective-prompt-library/03-context/medium-context.md b/reflective-prompt-library/03-context/medium-context.md
@@ -2,6 +2,24 @@
 
 Use this for 32K-128K context windows and ordinary ChatGPT / Claude / Codex tasks.
 
+## Purpose
+
+Balance completeness and context cost for 32K–128K windows. Primary workflow surfaces: `reflective-spec-plan` and `reflective-brief`. Pairs with `01-thinking/why-what-how-done.md` and `01-thinking/falsifiability.md`.
+
+## Scope
+
+- In scope: goal through self-check with cited evidence, not full input duplication.
+- Out of scope: repository edits without `reflective-implement`.
+
+## Acceptance Criteria
+
+- Uncertainty is explicitly marked.
+- Composable with `02-engineering/task-start.md` as noted below.
+
+## Falsifiability
+
+Name one acceptance criterion that would fail if evidence were misquoted.
+
 ```markdown
 你在中型 context window 中工作。請平衡完整性與節省 context。
 
diff --git a/reflective-prompt-library/03-context/small-context.md b/reflective-prompt-library/03-context/small-context.md
@@ -2,6 +2,28 @@
 
 Use this for 4K-16K context windows, small models, mobile, or low-cost model runs.
 
+## Purpose
+
+Operate under small context windows (4K–16K) or low-cost models. Primary workflow surfaces: `reflective-brief` and `reflective-dispatch`. Pairs with `01-thinking/critical-thinking-check.md` and `01-thinking/why-what-how-done.md`.
+
+## Scope
+
+- In scope: conclusion-first answers, minimal assumptions, and capped risks and plan steps.
+- Out of scope: long-chain reasoning or full engineering ticket packs.
+
+## Acceptance Criteria
+
+- At most three risks and three to five plan steps unless escalated.
+- Next action is directly executable.
+
+## Falsifiability
+
+State what evidence would require escalating to `medium-context.md`.
+
+## Human Review
+
+Escalate when window limits would hide safety-critical unknowns.
+
 ```markdown
 你在小 context window 中工作。請極度節省 token。
 
diff --git a/reflective-prompt-library/PROJECT_KNOWLEDGE.md b/reflective-prompt-library/PROJECT_KNOWLEDGE.md
@@ -75,6 +75,7 @@ deferred promotions are recurrence-gated — see [panel backlog](plans/multi-age
 > Pointers to the causal trail — plans, reflections, tests, commits. Detail is
 > not duplicated here; this is a map, not an archive.
 
+- 2026-06-25 Round 74 panel — standardize `03-context/` prompt contracts (Purpose/Scope/Acceptance/Falsifiability) + thinking/workflow cross-links + `test_context_prompts_eval_harness.py` → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
 - 2026-06-25 Round 73 panel — standardize `04-agent/` prompt contracts (Purpose/Scope/Acceptance/Falsifiability) + thinking/workflow cross-links + `test_agent_prompts_eval_harness.py` → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
 - 2026-06-25 Round 72 panel — standardize `00-core/` prompt contracts (Purpose/Scope/Acceptance/Falsifiability) + eval_harness anti-drift → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
 - 2026-06-25 Round 71 panel — thinking ↔ engineering cross-links (`01-thinking/` in all 8 engineering prompts; thinking Prompt Sources on implement/spec-plan/handoff-retro) + `test_prompt_cross_links.py` → [record](plans/multi-agent-panel-consensus-2026-06-25.md)
diff --git a/reflective-prompt-library/README.md b/reflective-prompt-library/README.md
@@ -30,7 +30,7 @@ Pick **Strictness L1–L6** first (`skills/reflective-dispatch/SKILL.md`, [GLOSS
 
 ## Governance Panel Record
 
-Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–73, options A–DN) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
+Multi-agent Socratic consensus on project goals and the nine skills (Rounds 1–74, options A–DQ) is recorded in [plans/multi-agent-panel-consensus-2026-06-25.md](plans/multi-agent-panel-consensus-2026-06-25.md). Run `make all` before claiming routing or governance changes are verified.
 
 ## Directory Map
 
diff --git a/reflective-prompt-library/plans/QUALITY_GATES_SUMMARY.md b/reflective-prompt-library/plans/QUALITY_GATES_SUMMARY.md
@@ -314,7 +314,7 @@ ROUTE-002 measures unseen phrasing separately from ROUTE-001. Round 7 (2026-06-2
 2. **ROUTE-001/002/003 in CI** — 128 + 102 + 53 paraphrases at 100% consistency (seeded fixtures); `validate_route_fixture.py` gates minimum coverage
 3. **Governance validators** — links, lint, governance metadata, PROJECT_KNOWLEDGE, benchmark fixture, skill examples
 4. **Harness policy docs** — CONTRIBUTING, AGENTS, SKILL_INSTALLATION, maintenance playbook
-5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_agent_prompts_eval_harness.py` (290+ pytest anti-drift suite in CI)
+5. **Doc anti-drift** — `test_routing_contract.py`, cheatsheet parity tests, `test_readme_governance.py`, `test_thinking_prompts_eval_harness.py`, `test_engineering_prompts_eval_harness.py`, `test_prompt_cross_links.py`, `test_core_prompts_eval_harness.py`, `test_agent_prompts_eval_harness.py`, `test_context_prompts_eval_harness.py` (330+ pytest anti-drift suite in CI)
 
 ### Ongoing maintenance (not blockers)
 
diff --git a/reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md b/reflective-prompt-library/plans/multi-agent-panel-consensus-2026-06-25.md
@@ -1763,4 +1763,61 @@ User directive (repeat): review prompts, plans, skills, and Socratic/critical-th
 
 ## Panel status (updated)
 
-**Resealed 2026-06-25** after **Round 73** (options DL–DN). Agent-prompt contract pass complete; 03-context / 05-domain Purpose sweep remains recurrence-gated.
+## Round 74 — Context prompt contract review (2026-06-25)
+
+User directive (repeat): review prompts, plans, skills, and Socratic/critical-thinking lenses in parallel until all roles agree, then implement.
+
+### DO: Standardize `03-context/` prompt contracts + cross-links?
+
+| Lens | Position |
+| --- | --- |
+| Opus | **Agree** — context layer was 50–67% eval_harness; contracts close the recurrence-gated gap from Round 73 |
+| Codex | **Agree** — seven files bounded; falsifiable via `test_context_prompts_eval_harness.py` + cross-link pytest |
+| Gemini | **Agree** — window-size and handoff prompts are cost-critical; defer 05-domain |
+| Composer | **Agree** — IDE users load context prompts with skills; reciprocal links needed |
+| Sakana | **Agree** — no tenth skill; supporting lenses for existing nine |
+| GLM | **Agree** — English contracts outside zh-TW fences; Human Review where eval risk triggers |
+
+**Socratic Q:** Why 03-context now?
+**Answer:** Round 73 deferred it; context discipline and handoff are prerequisites for strictness routing and session continuity.
+
+**Consensus:** **Agree** — Purpose/Scope/Acceptance Criteria/Falsifiability on all seven `03-context/` prompts; thinking + workflow cross-links; `test_context_prompts_eval_harness.py`; extend `test_prompt_cross_links.py`.
+
+### DP: Expand to `05-domain/` now?
+
+| Lens | Position |
+| --- | --- |
+| All six | **Reject** — recurrence-gated after context layer |
+
+### DQ: Router / holdout / tenth skill?
+
+| Lens | Position |
+| --- | --- |
+| All six | **Reject** — ROUTE-001/002/003 at 100%; nine-skill freeze holds |
+
+### Round 74 verdict table
+
+| ID | Option | Verdict | Action |
+| --- | --- | --- | --- |
+| DO | Context prompt contracts + cross-links | **Agree** | 7 files + pytest anti-drift |
+| DP | 05-domain Purpose sweep | **Reject** | backlog |
+| DQ | Router/holdout/tenth skill | **Reject** | no change |
+
+**All roles agree.**
+
+## Implemented Changes (Round 74)
+
+- `03-context/*.md`: Purpose, Scope, Acceptance Criteria, Falsifiability + workflow skill mapping; thinking lens links; Human Review where applicable
+- `plans/tests/test_context_prompts_eval_harness.py`: structural + 80%+ score floor anti-drift
+- `plans/tests/test_prompt_cross_links.py`: context ↔ thinking ↔ skill cross-links
+- `QUALITY_GATES_SUMMARY.md`: context prompt test mention; pytest floor 330+
+- `PROJECT_KNOWLEDGE.md`: Decision Index Round 74 entry
+- `README.md`, `reflective-prompt-library/README.md`, `test_readme_governance.py`: panel round 74 sync
+
+## Verification (Round 74)
+
+- `make all`: pytest + ROUTE-001/002/003 100%
+
+## Panel status (updated)
+
+**Resealed 2026-06-25** after **Round 74** (options DO–DQ). Context-prompt contract pass complete; `05-domain` Purpose sweep remains recurrence-gated.
diff --git a/reflective-prompt-library/plans/tests/test_context_prompts_eval_harness.py b/reflective-prompt-library/plans/tests/test_context_prompts_eval_harness.py
@@ -0,0 +1,63 @@
+"""Anti-drift: 03-context prompts must satisfy eval_harness structural rubric."""
+
+import sys
+from pathlib import Path
+
+import pytest
+
+sys.path.insert(0, str(Path(__file__).parent.parent))
+
+from eval_harness import EvalHarness  # noqa: E402
+
+CONTEXT_DIR = Path(__file__).parent.parent.parent / "03-context"
+REPO_ROOT = str(Path(__file__).parent.parent.parent.parent)
+MIN_SCORE = 80.0
+
+REQUIRED_HEADINGS = (
+    "## Purpose",
+    "## Scope",
+    "## Acceptance Criteria",
+    "## Falsifiability",
+)
+
+CONTEXT_PROMPTS = tuple(sorted(CONTEXT_DIR.glob("*.md")))
+
+
+@pytest.fixture(scope="module")
+def harness() -> EvalHarness:
+    return EvalHarness(repo_root=REPO_ROOT)
+
+
+@pytest.mark.parametrize("prompt_path", CONTEXT_PROMPTS, ids=lambda p: p.name)
+def test_context_prompt_has_contract_headings(prompt_path: Path):
+    text = prompt_path.read_text(encoding="utf-8")
+    preamble = text.split("```", 1)[0]
+    for heading in REQUIRED_HEADINGS:
+        assert heading in preamble, f"{prompt_path.name} missing {heading} outside template block"
+
+
+@pytest.mark.parametrize("prompt_path", CONTEXT_PROMPTS, ids=lambda p: p.name)
+def test_context_prompt_meets_eval_harness_floor(prompt_path: Path, harness: EvalHarness):
+    rel = str(prompt_path.relative_to(REPO_ROOT))
+    result = harness.evaluate_file(rel)
+    assert result["score"] >= MIN_SCORE, (
+        f"{prompt_path.name} eval_harness score {result['score']}% < {MIN_SCORE}%: "
+        f"{[(c['id'], c['result']) for c in result['checks']]}"
+    )
+
+
+def test_context_prompts_reference_workflow_skills():
+    for prompt_path in CONTEXT_PROMPTS:
+        text = prompt_path.read_text(encoding="utf-8")
+        assert "reflective-" in text, f"{prompt_path.name} should map to at least one workflow skill"
+
+
+def test_context_prompts_cover_context_workflow_surfaces():
+    text = "\n".join(p.read_text(encoding="utf-8") for p in CONTEXT_PROMPTS)
+    for skill in (
+        "reflective-dispatch",
+        "reflective-brief",
+        "reflective-handoff-retro",
+        "reflective-research",
+    ):
+        assert skill in text, f"03-context should reference {skill}"
diff --git a/reflective-prompt-library/plans/tests/test_prompt_cross_links.py b/reflective-prompt-library/plans/tests/test_prompt_cross_links.py
diff --git a/reflective-prompt-library/plans/tests/test_readme_governance.py b/reflective-prompt-library/plans/tests/test_readme_governance.py