feat: substantially improve NLP-specific HotpotQA prompt templates

KhrulkovV · claude · KhrulkovV · commit b2bbaf0b5719 · 2026-03-03T23:52:46.000+03:00
Per prompt-engineer review, the initial overrides were too shallow — mostly vocabulary substitution without domain-specific mechanistic guidance. mutation/system.txt: - Add PROMPT MUTATION CONSTRAINTS (hard rules: frozen steps, format invariants) - Add PROMPT ENGINEERING PRINCIPLES (BM25 bag-of-words, system_prompt crowding, example_reasoning length, step-6 evidence blindspot, aim verbosity) - Add MUTATION FOCUS AREAS (retrieval path > answer path > global > reasoning depth) - Add ARCHETYPE INTERPRETATION for prompt-engineering domain insights/system.txt: - Add CHAIN-SPECIFIC FAILURE MODES section (BM25 query pollution, hop-2 entity bridging gap, system_prompt crowding, answer extraction brittleness, step-6 evidence blindspot, example_reasoning length tax) - Expand CATEGORIES with hop2_entity_naming, example_reasoning_cost, step6_evidence_blindspot, reasoning_scaffold, aim_verbosity lineage/system.txt: - Add MECHANISM VOCABULARY (BM25 query contamination, hop-2 entity bridging failure, context crowding, answer extraction failure, evidence consolidation gap, val-set overfitting signal) - Split "refinement" into retrieval_refinement / synthesis_refinement / global_refinement lineage/user.txt: - Reorder REGRESSION CHECKLIST by impact (step 3 query contamination first, step 6 format loss second, then entity bridging, etc.) - Add example_reasoning length check (#8) - Add val-test gap connection to step 5 complexity check (#5) - Add richer mechanism examples including val-test gap and context crowding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
diff --git a/gigaevo/prompts/hotpotqa/insights/system.txt b/gigaevo/prompts/hotpotqa/insights/system.txt
@@ -9,7 +9,7 @@ AVAILABLE METRICS:
 ---
 
 ANALYSIS FRAMEWORK — For each insight, identify:
-1. **OBSERVATION**: What pattern/value/structure exists in the code? (cite evidence)
+1. **OBSERVATION**: What pattern/value/structure exists in the code? (cite evidence: step number, quoted instruction text, metrics)
 2. **IMPACT**: How does this affect the metric? (causal mechanism for prompt-chain execution)
 3. **MUTATION HINT**: What should the mutation agent do? (preserve, remove, tune, replace)
 
@@ -24,7 +24,12 @@ Specific 1-2 word label in snake_case describing the insight type.
   - "evidence_synthesis": how step 5 combines multi-hop facts
   - "instruction_clarity": whether a step's instruction is precise or ambiguous
   - "entity_bridging": how step 3 identifies and names the hop-2 entity from step 2
+  - "hop2_entity_naming": whether step 2 explicitly names the bridge entity for step 3
   - "format_constraint": explicit output format rules in a step's stage_action
+  - "example_reasoning_cost": whether a step's example_reasoning is disproportionately long
+  - "step6_evidence_blindspot": whether step 5 adequately consolidates for step 6's blind spot
+  - "reasoning_scaffold": whether reasoning_questions guide toward the right intermediate outputs
+  - "aim_verbosity": whether a step's aim is unnecessarily long and redundant with stage_action
 
 TAGS (pick one):
 - **beneficial**: Pattern that HELPS the metric — mutation should PRESERVE or EXTEND it
@@ -37,6 +42,31 @@ SEVERITY: high (major metric impact), medium (moderate), low (minor refinement)
 
 ---
 
+CHAIN-SPECIFIC FAILURE MODES TO CHECK:
+- **BM25 query pollution**: Step 3's output is used verbatim as the BM25 retrieval query
+  (via $history[-1]). Any prose, preamble, or reasoning text in step 3's output degrades
+  retrieval. Flag if step 3's stage_action does not include an explicit "output ONLY the
+  search terms" constraint. BM25 rewards named entities over relation words — shorter,
+  entity-focused queries outperform full sentences.
+- **hop-2 entity bridging gap**: Step 3 receives ONLY step 2's summary (not step 1's raw
+  passages). If step 2 fails to name the bridge entity explicitly, step 3 cannot form a
+  useful query. Flag if step 2's stage_action lacks an instruction to surface entity names
+  needed for the second-hop search.
+- **system_prompt crowding**: system_prompt is prepended to ALL 4 LLM steps. Every word
+  in system_prompt consumes context in every call. Flag if system_prompt exceeds ~20 words
+  or contains instructions that duplicate per-step aims.
+- **answer extraction brittleness**: The evaluator extracts answers via regex on
+  "Answer: <answer>". Flag if step 6 lacks an explicit format constraint, or if its
+  example_reasoning shows a multi-sentence response rather than a single "Answer: X" line.
+- **step 6 evidence blindspot**: Step 6 depends on steps 2 and 5 (not step 4). It never
+  sees raw second-hop passages. Flag if step 5's stage_action does not produce a unified
+  evidence summary — step 6 needs step 5 to fully consolidate, or it will lack second-hop facts.
+- **example_reasoning length tax**: example_reasoning is included verbatim in the assembled
+  prompt. Flag if any step's example_reasoning exceeds ~120 words — longer examples crowd
+  the actual instruction and may reduce LLM compliance with format constraints.
+
+---
+
 REQUIREMENTS:
 - Each insight ≤50 words with concrete evidence (step numbers, quoted instructions, metrics)
 - Ground in code/metrics/errors — NO speculation or hallucination
@@ -81,4 +111,4 @@ EXAMPLE:
     "severity": "medium",
     "insight": "step 6 instruction lacks explicit 'Answer: <answer>' format requirement; causes extraction failure on ~1% of samples; mutation should ADD explicit format constraint matching normalization."
   }}
-]
+]
diff --git a/gigaevo/prompts/hotpotqa/lineage/system.txt b/gigaevo/prompts/hotpotqa/lineage/system.txt
@@ -7,19 +7,29 @@ STRATEGY TYPES (pick the most precise):
 - "avoidance": Removed/bypassed failing prompt pattern (e.g., eliminated verbose conflict-resolution rules, removed ambiguous step instruction)
 - "generalization": Added flexibility (e.g., made an instruction question-type-aware, unified duplicate phrasings)
 - "exploration": Introduced novel prompt approach (e.g., new step decomposition strategy, different evidence-synthesis framing)
-- "refinement": Tuned existing prompt (e.g., tightened query format constraint, shortened system_prompt, clarified answer format)
+- "retrieval_refinement": Tuned step 2 or step 3 to improve BM25 query quality or hop-2 entity naming
+- "synthesis_refinement": Tuned step 5 or step 6 for better evidence consolidation or answer format
+- "global_refinement": Shortened or tightened system_prompt
 
-ANALYSIS FRAMEWORK - For each significant prompt change, explain:
+ANALYSIS FRAMEWORK — For each significant prompt change, explain:
 1. **WHAT changed**: The specific instruction or phrasing modification (reference diff hunks)
 2. **WHY it helped/hurt**: The underlying mechanism (how the change affects LLM step behavior and downstream retrieval/reasoning)
 3. **TRANSFERABLE LESSON**: What principle can be applied to future mutations
 
 DEPTH REQUIREMENTS:
 - Go beyond surface descriptions ("changed step 3 text") to root causes ("cleaner step 3 output reduces BM25 query noise, improving hop-2 document recall")
 - Connect prompt changes to chain execution behavior
-- Identify **generalizing vs. overfitting** patterns: did the change improve robustness or optimize for specific question types?
+- Identify **generalizing vs. overfitting** patterns: did the change improve robustness or optimize for specific question types? Note when reduced val-test gap indicates better generalization.
 - When metrics improved, explain the mechanism; when they degraded, explain what was lost
 
+MECHANISM VOCABULARY — use these precise terms when applicable:
+- "BM25 query contamination": step 3 emits prose/reasoning alongside the query, degrading retrieval
+- "hop-2 entity bridging failure": step 2 does not name the bridge entity, so step 3 queries for the wrong thing
+- "context crowding": system_prompt or long example_reasoning shrinks effective context for per-step instructions
+- "answer extraction failure": step 6 does not produce "Answer: <answer>" format, causing regex miss
+- "evidence consolidation gap": step 5 output is too thin, leaving step 6 without second-hop facts
+- "val-set overfitting signal": complex multi-rule instructions in steps 5 or 6 often improve val EM but widen val-test gap — flag this pattern explicitly
+
 FORMAT:
 - Each insight: JSON with "strategy" and "description" (≤50 words)
 - Reference diff blocks: "(@@ -X,Y +A,B @@)"
@@ -28,25 +38,25 @@ FORMAT:
 EXAMPLES:
 [
   {
-    "strategy": "refinement",
-    "description": "(@@ -12,3 +14,2 @@) Tightened step 3 to 'Output ONLY: <entity> <relation>', removing prose preamble; cleaner BM25 query improves hop-2 document recall; correlates with +0.018 val EM gain."
+    "strategy": "retrieval_refinement",
+    "description": "(@@ -12,3 +14,2 @@) Tightened step 3 to 'Output ONLY: <entity> <relation>', removing prose preamble; eliminates BM25 query contamination; cleaner query improves hop-2 document recall; correlates with +0.018 val EM gain."
   },
   {
     "strategy": "avoidance",
-    "description": "(@@ -45,8 +0,0 @@) Removed 5-rule conflict-resolution hierarchy in step 5; over-engineered rules were optimizing for val-set patterns; simplification reduced val-test gap from 8pp to 5pp."
+    "description": "(@@ -45,8 +0,0 @@) Removed 5-rule conflict-resolution hierarchy in step 5; over-engineered rules show val-set overfitting signal; simplification reduced val-test gap from 8pp to 5pp without harming val EM."
   },
   {
     "strategy": "exploration",
-    "description": "(@@ -22,2 +24,4 @@) Added question-type routing in step 2 (who/when/where tags); improved entity extraction for bridge questions; +0.022 val EM but risk of overfitting to question taxonomy."
+    "description": "(@@ -22,2 +24,4 @@) Added question-type routing in step 2 (who/when/where tags); improved hop-2 entity bridging for bridge questions; +0.022 val EM but risk of overfitting to question taxonomy."
   },
   {
     "strategy": "imitation",
-    "description": "(@@ -8,1 +8,1 @@) Preserved 'Answer: <answer>' format constraint in step 6; maintaining extraction reliability that prevented 3 failure cases in parent."
+    "description": "(@@ -8,1 +8,1 @@) Preserved 'Answer: <answer>' format constraint in step 6; maintaining extraction reliability prevented regex extraction failures seen in parent."
   },
   {
-    "strategy": "generalization",
-    "description": "(@@ -30 @@) + (@@ -67 @@) Unified duplicate entity-bridging instructions across steps 2 and 3; reduces redundancy and context crowding; single point of change for future mutation."
+    "strategy": "global_refinement",
+    "description": "(@@ -30 @@) + (@@ -67 @@) Compressed system_prompt from 45 to 12 words; reduced context crowding across all 4 LLM steps; unified duplicate instructions into per-step stage_action for cleaner mutation targets."
   }
 ]
 
-Respond with only valid JSON array, no commentary.
+Respond with only valid JSON array, no commentary.
diff --git a/gigaevo/prompts/hotpotqa/lineage/user.txt b/gigaevo/prompts/hotpotqa/lineage/user.txt
@@ -17,19 +17,35 @@ Child: {child_errors}
 
 Analyze the diff for **logical changes** (one insight per logical change). Related hunks implementing the same modification should be grouped into a single insight. For each logical change, explain:
 
-1. **The mechanism**: WHY did this specific prompt change affect the metric? (e.g., "tighter step 3 output format → cleaner BM25 query → better hop-2 retrieval", "removed verbose step 5 rules → simpler evidence synthesis → better generalization", "added explicit answer format → fewer extraction failures")
-
-2. **The trade-off**: What did the change gain vs lose? (e.g., "gained retrieval precision but lost flexibility for multi-entity questions", "faster convergence on common patterns but narrower coverage of edge cases")
-
-3. **Actionable lesson**: What should future mutations learn? (e.g., "preserve clean format constraints in step 3", "avoid adding more conflict-resolution rules to step 5", "always include explicit 'Answer: <answer>' in step 6")
-
-**REGRESSION CHECKLIST** (use when performance degraded):
-- Did a step's explicit output format constraint get weakened or removed?
-- Did step 3 instruction become more verbose or ambiguous (potentially polluting BM25 queries)?
-- Did step 5 conflict-resolution rules increase in complexity without accuracy gain?
-- Did the system_prompt grow significantly longer (crowding per-step instructions)?
-- Did a useful entity-bridging instruction in step 3 get removed or generalized away?
-- Did step 6 lose its explicit 'Answer: <answer>' format requirement?
+1. **The mechanism**: WHY did this specific prompt change affect the metric? (e.g.,
+   "tighter step 3 output format → cleaner BM25 query → better hop-2 retrieval",
+   "removed verbose step 5 rules → simpler evidence synthesis → better generalization AND narrower val-test gap",
+   "added explicit answer format → fewer extraction failures",
+   "step 6 example_reasoning grew to 150 words → context crowding → format non-compliance → regex extraction failure")
+
+2. **The trade-off**: What did the change gain vs lose? (e.g., "gained retrieval precision
+   but lost flexibility for multi-entity questions", "faster convergence on common patterns
+   but narrower coverage of edge cases")
+
+3. **Actionable lesson**: What should future mutations learn? (e.g., "preserve clean format
+   constraints in step 3", "avoid adding more conflict-resolution rules to step 5",
+   "always include explicit 'Answer: <answer>' in step 6")
+
+**REGRESSION CHECKLIST** (ordered by typical impact; use when performance degraded):
+1. Did step 3 instruction lose its "output ONLY search terms" constraint? (BM25 query
+   contamination — affects retrieval for ALL questions)
+2. Did step 6 lose the explicit 'Answer: <answer>' format requirement? (regex extraction
+   failures on all affected questions)
+3. Did step 2 lose an explicit instruction to name the bridge entity? (hop-2 query
+   formation fails; step 3 searches for the wrong thing)
+4. Did a step's explicit output format constraint get weakened or removed?
+5. Did step 5 conflict-resolution rules increase in complexity without measured gain?
+   (likely val-set overfitting — check val-test gap, not just val EM delta)
+6. Did the system_prompt grow significantly longer? (context crowding affects all 4 LLM
+   steps simultaneously)
+7. Did a useful entity-bridging instruction in step 3 get removed or generalized away?
+8. Did any step gain a long example_reasoning block (>120 words)? (context crowding
+   reduces format compliance; check step 3 and step 6 first)
 
 **QUALITY BAR:**
 ❌ BAD: "Changed step 3 text" (just restates the diff)
@@ -44,4 +60,4 @@ Reference hunks by their headers. Group related hunks when they implement one lo
 PARENT CODE (reference):
 ```python
 {parent_code}
-```
+```
diff --git a/gigaevo/prompts/hotpotqa/mutation/system.txt b/gigaevo/prompts/hotpotqa/mutation/system.txt
@@ -7,4 +7,58 @@ OBJECTIVE:
 {task_description}
 
 AVAILABLE METRICS:
-{metrics_description}
+{metrics_description}
+
+---
+
+PROMPT MUTATION CONSTRAINTS — HARD RULES:
+- You may ONLY modify text content in: system_prompt, and the aim, stage_action,
+  reasoning_questions, example_reasoning fields of non-frozen LLM steps (2, 3, 5, 6).
+- Step topology (types, dependencies, count) is FIXED. Do not alter it.
+- Frozen steps (1 and 4) must remain byte-identical to the baseline.
+- Preserve unless explicitly targeting: step 3's "output ONLY" constraint and step 6's
+  "Answer: <answer>" format — these protect BM25 retrieval quality and answer extraction.
+
+PROMPT ENGINEERING PRINCIPLES FOR THIS CHAIN:
+- Shorter instructions with explicit constraints outperform longer instructions with
+  implicit expectations. When in doubt, compress and constrain rather than expand and explain.
+- Step 3's output is used VERBATIM as a BM25 query. Any prose, preamble, or reasoning
+  text it emits will degrade retrieval for ALL questions. This is the highest-priority
+  format constraint in the chain.
+- BM25 is a bag-of-words model: it rewards named entities and key attributes, not relation
+  words or sentence structure. Step 3 instructions should push toward e.g. "Marie Curie
+  nationality" not "Find information about Marie Curie's national origin."
+- system_prompt is prepended to ALL 4 LLM steps. Every word costs context budget in every
+  call. Keep it under 20 words focused on global reasoning style, not per-step instructions.
+- example_reasoning fields are included verbatim in the assembled prompt. Long examples
+  (>120 words) crowd the actual instruction and reduce LLM compliance with format constraints.
+- Step 6 depends on steps 2 and 5 — it never sees raw second-hop passages. Step 5 must
+  fully consolidate evidence; step 6 can only answer from what step 5 produces.
+- aim should be a single-sentence objective (8-15 words). A verbose aim is redundant with
+  a well-written stage_action and wastes context budget on every call.
+
+MUTATION FOCUS AREAS (in priority order):
+1. RETRIEVAL PATH (steps 2, 3): Does step 2 name the bridge entity? Does step 3 emit
+   only bare search terms? Improvements here help ALL multi-hop questions.
+2. ANSWER PATH (steps 5, 6): Does step 5 produce a unified evidence set? Does step 6
+   emit a clean "Answer: X"? Improvements here prevent extraction failures.
+3. GLOBAL CONTEXT (system_prompt): Is it short and role-focused? Compression here
+   improves ALL steps simultaneously.
+4. REASONING DEPTH (reasoning_questions, example_reasoning): Add only when evidence
+   shows the LLM is failing to apply the right reasoning pattern. Remove when bloated.
+
+ARCHETYPE INTERPRETATION FOR PROMPT MUTATION:
+When selecting archetypes, interpret them in the prompt-engineering domain:
+- "Precision Optimization" → tighten a format constraint, compress a verbose instruction
+  (e.g., shorten step 3 stage_action to pure search-term format)
+- "Proven Pattern Extension" → replicate a working format constraint to another step
+  (e.g., if step 3's "ONLY output X" works, apply similar discipline to step 6)
+- "Harmful Pattern Removal" → remove verbose/complex rules that show no fitness gain
+  (e.g., strip a 5-rule conflict-resolution hierarchy from step 5)
+- "Computational Reinvention" → try a fundamentally different instruction strategy for a step
+  (e.g., replace prose step 2 instruction with a structured entity-extraction template)
+- "Solution Space Exploration" → experiment with a different reasoning scaffold
+  (e.g., add question-type routing in step 2, or a two-sentence evidence format in step 5)
+- "Approach Synthesis" → combine a clean format constraint with a focused reasoning question
+- "Guided Innovation" → preserve clean format constraints while modifying reasoning depth
+- "Conservative Exploration" → minor wording change to one step within format constraints