Skip to content

Commit b2bbaf0

Browse files
KhrulkovVclaude
andcommitted
feat: substantially improve NLP-specific HotpotQA prompt templates
Per prompt-engineer review, the initial overrides were too shallow — mostly vocabulary substitution without domain-specific mechanistic guidance. mutation/system.txt: - Add PROMPT MUTATION CONSTRAINTS (hard rules: frozen steps, format invariants) - Add PROMPT ENGINEERING PRINCIPLES (BM25 bag-of-words, system_prompt crowding, example_reasoning length, step-6 evidence blindspot, aim verbosity) - Add MUTATION FOCUS AREAS (retrieval path > answer path > global > reasoning depth) - Add ARCHETYPE INTERPRETATION for prompt-engineering domain insights/system.txt: - Add CHAIN-SPECIFIC FAILURE MODES section (BM25 query pollution, hop-2 entity bridging gap, system_prompt crowding, answer extraction brittleness, step-6 evidence blindspot, example_reasoning length tax) - Expand CATEGORIES with hop2_entity_naming, example_reasoning_cost, step6_evidence_blindspot, reasoning_scaffold, aim_verbosity lineage/system.txt: - Add MECHANISM VOCABULARY (BM25 query contamination, hop-2 entity bridging failure, context crowding, answer extraction failure, evidence consolidation gap, val-set overfitting signal) - Split "refinement" into retrieval_refinement / synthesis_refinement / global_refinement lineage/user.txt: - Reorder REGRESSION CHECKLIST by impact (step 3 query contamination first, step 6 format loss second, then entity bridging, etc.) - Add example_reasoning length check (#8) - Add val-test gap connection to step 5 complexity check (#5) - Add richer mechanism examples including val-test gap and context crowding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent d585d05 commit b2bbaf0

4 files changed

Lines changed: 138 additions & 28 deletions

File tree

gigaevo/prompts/hotpotqa/insights/system.txt

Lines changed: 32 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ AVAILABLE METRICS:
99
---
1010

1111
ANALYSIS FRAMEWORK — For each insight, identify:
12-
1. **OBSERVATION**: What pattern/value/structure exists in the code? (cite evidence)
12+
1. **OBSERVATION**: What pattern/value/structure exists in the code? (cite evidence: step number, quoted instruction text, metrics)
1313
2. **IMPACT**: How does this affect the metric? (causal mechanism for prompt-chain execution)
1414
3. **MUTATION HINT**: What should the mutation agent do? (preserve, remove, tune, replace)
1515

@@ -24,7 +24,12 @@ Specific 1-2 word label in snake_case describing the insight type.
2424
- "evidence_synthesis": how step 5 combines multi-hop facts
2525
- "instruction_clarity": whether a step's instruction is precise or ambiguous
2626
- "entity_bridging": how step 3 identifies and names the hop-2 entity from step 2
27+
- "hop2_entity_naming": whether step 2 explicitly names the bridge entity for step 3
2728
- "format_constraint": explicit output format rules in a step's stage_action
29+
- "example_reasoning_cost": whether a step's example_reasoning is disproportionately long
30+
- "step6_evidence_blindspot": whether step 5 adequately consolidates for step 6's blind spot
31+
- "reasoning_scaffold": whether reasoning_questions guide toward the right intermediate outputs
32+
- "aim_verbosity": whether a step's aim is unnecessarily long and redundant with stage_action
2833

2934
TAGS (pick one):
3035
- **beneficial**: Pattern that HELPS the metric — mutation should PRESERVE or EXTEND it
@@ -37,6 +42,31 @@ SEVERITY: high (major metric impact), medium (moderate), low (minor refinement)
3742

3843
---
3944

45+
CHAIN-SPECIFIC FAILURE MODES TO CHECK:
46+
- **BM25 query pollution**: Step 3's output is used verbatim as the BM25 retrieval query
47+
(via $history[-1]). Any prose, preamble, or reasoning text in step 3's output degrades
48+
retrieval. Flag if step 3's stage_action does not include an explicit "output ONLY the
49+
search terms" constraint. BM25 rewards named entities over relation words — shorter,
50+
entity-focused queries outperform full sentences.
51+
- **hop-2 entity bridging gap**: Step 3 receives ONLY step 2's summary (not step 1's raw
52+
passages). If step 2 fails to name the bridge entity explicitly, step 3 cannot form a
53+
useful query. Flag if step 2's stage_action lacks an instruction to surface entity names
54+
needed for the second-hop search.
55+
- **system_prompt crowding**: system_prompt is prepended to ALL 4 LLM steps. Every word
56+
in system_prompt consumes context in every call. Flag if system_prompt exceeds ~20 words
57+
or contains instructions that duplicate per-step aims.
58+
- **answer extraction brittleness**: The evaluator extracts answers via regex on
59+
"Answer: <answer>". Flag if step 6 lacks an explicit format constraint, or if its
60+
example_reasoning shows a multi-sentence response rather than a single "Answer: X" line.
61+
- **step 6 evidence blindspot**: Step 6 depends on steps 2 and 5 (not step 4). It never
62+
sees raw second-hop passages. Flag if step 5's stage_action does not produce a unified
63+
evidence summary — step 6 needs step 5 to fully consolidate, or it will lack second-hop facts.
64+
- **example_reasoning length tax**: example_reasoning is included verbatim in the assembled
65+
prompt. Flag if any step's example_reasoning exceeds ~120 words — longer examples crowd
66+
the actual instruction and may reduce LLM compliance with format constraints.
67+
68+
---
69+
4070
REQUIREMENTS:
4171
- Each insight ≤50 words with concrete evidence (step numbers, quoted instructions, metrics)
4272
- Ground in code/metrics/errors — NO speculation or hallucination
@@ -81,4 +111,4 @@ EXAMPLE:
81111
"severity": "medium",
82112
"insight": "step 6 instruction lacks explicit 'Answer: <answer>' format requirement; causes extraction failure on ~1% of samples; mutation should ADD explicit format constraint matching normalization."
83113
}}
84-
]
114+
]

gigaevo/prompts/hotpotqa/lineage/system.txt

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,19 +7,29 @@ STRATEGY TYPES (pick the most precise):
77
- "avoidance": Removed/bypassed failing prompt pattern (e.g., eliminated verbose conflict-resolution rules, removed ambiguous step instruction)
88
- "generalization": Added flexibility (e.g., made an instruction question-type-aware, unified duplicate phrasings)
99
- "exploration": Introduced novel prompt approach (e.g., new step decomposition strategy, different evidence-synthesis framing)
10-
- "refinement": Tuned existing prompt (e.g., tightened query format constraint, shortened system_prompt, clarified answer format)
10+
- "retrieval_refinement": Tuned step 2 or step 3 to improve BM25 query quality or hop-2 entity naming
11+
- "synthesis_refinement": Tuned step 5 or step 6 for better evidence consolidation or answer format
12+
- "global_refinement": Shortened or tightened system_prompt
1113

12-
ANALYSIS FRAMEWORK - For each significant prompt change, explain:
14+
ANALYSIS FRAMEWORK For each significant prompt change, explain:
1315
1. **WHAT changed**: The specific instruction or phrasing modification (reference diff hunks)
1416
2. **WHY it helped/hurt**: The underlying mechanism (how the change affects LLM step behavior and downstream retrieval/reasoning)
1517
3. **TRANSFERABLE LESSON**: What principle can be applied to future mutations
1618

1719
DEPTH REQUIREMENTS:
1820
- Go beyond surface descriptions ("changed step 3 text") to root causes ("cleaner step 3 output reduces BM25 query noise, improving hop-2 document recall")
1921
- Connect prompt changes to chain execution behavior
20-
- Identify **generalizing vs. overfitting** patterns: did the change improve robustness or optimize for specific question types?
22+
- Identify **generalizing vs. overfitting** patterns: did the change improve robustness or optimize for specific question types? Note when reduced val-test gap indicates better generalization.
2123
- When metrics improved, explain the mechanism; when they degraded, explain what was lost
2224

25+
MECHANISM VOCABULARY — use these precise terms when applicable:
26+
- "BM25 query contamination": step 3 emits prose/reasoning alongside the query, degrading retrieval
27+
- "hop-2 entity bridging failure": step 2 does not name the bridge entity, so step 3 queries for the wrong thing
28+
- "context crowding": system_prompt or long example_reasoning shrinks effective context for per-step instructions
29+
- "answer extraction failure": step 6 does not produce "Answer: <answer>" format, causing regex miss
30+
- "evidence consolidation gap": step 5 output is too thin, leaving step 6 without second-hop facts
31+
- "val-set overfitting signal": complex multi-rule instructions in steps 5 or 6 often improve val EM but widen val-test gap — flag this pattern explicitly
32+
2333
FORMAT:
2434
- Each insight: JSON with "strategy" and "description" (≤50 words)
2535
- Reference diff blocks: "(@@ -X,Y +A,B @@)"
@@ -28,25 +38,25 @@ FORMAT:
2838
EXAMPLES:
2939
[
3040
{
31-
"strategy": "refinement",
32-
"description": "(@@ -12,3 +14,2 @@) Tightened step 3 to 'Output ONLY: <entity> <relation>', removing prose preamble; cleaner BM25 query improves hop-2 document recall; correlates with +0.018 val EM gain."
41+
"strategy": "retrieval_refinement",
42+
"description": "(@@ -12,3 +14,2 @@) Tightened step 3 to 'Output ONLY: <entity> <relation>', removing prose preamble; eliminates BM25 query contamination; cleaner query improves hop-2 document recall; correlates with +0.018 val EM gain."
3343
},
3444
{
3545
"strategy": "avoidance",
36-
"description": "(@@ -45,8 +0,0 @@) Removed 5-rule conflict-resolution hierarchy in step 5; over-engineered rules were optimizing for val-set patterns; simplification reduced val-test gap from 8pp to 5pp."
46+
"description": "(@@ -45,8 +0,0 @@) Removed 5-rule conflict-resolution hierarchy in step 5; over-engineered rules show val-set overfitting signal; simplification reduced val-test gap from 8pp to 5pp without harming val EM."
3747
},
3848
{
3949
"strategy": "exploration",
40-
"description": "(@@ -22,2 +24,4 @@) Added question-type routing in step 2 (who/when/where tags); improved entity extraction for bridge questions; +0.022 val EM but risk of overfitting to question taxonomy."
50+
"description": "(@@ -22,2 +24,4 @@) Added question-type routing in step 2 (who/when/where tags); improved hop-2 entity bridging for bridge questions; +0.022 val EM but risk of overfitting to question taxonomy."
4151
},
4252
{
4353
"strategy": "imitation",
44-
"description": "(@@ -8,1 +8,1 @@) Preserved 'Answer: <answer>' format constraint in step 6; maintaining extraction reliability that prevented 3 failure cases in parent."
54+
"description": "(@@ -8,1 +8,1 @@) Preserved 'Answer: <answer>' format constraint in step 6; maintaining extraction reliability prevented regex extraction failures seen in parent."
4555
},
4656
{
47-
"strategy": "generalization",
48-
"description": "(@@ -30 @@) + (@@ -67 @@) Unified duplicate entity-bridging instructions across steps 2 and 3; reduces redundancy and context crowding; single point of change for future mutation."
57+
"strategy": "global_refinement",
58+
"description": "(@@ -30 @@) + (@@ -67 @@) Compressed system_prompt from 45 to 12 words; reduced context crowding across all 4 LLM steps; unified duplicate instructions into per-step stage_action for cleaner mutation targets."
4959
}
5060
]
5161

52-
Respond with only valid JSON array, no commentary.
62+
Respond with only valid JSON array, no commentary.

gigaevo/prompts/hotpotqa/lineage/user.txt

Lines changed: 30 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -17,19 +17,35 @@ Child: {child_errors}
1717

1818
Analyze the diff for **logical changes** (one insight per logical change). Related hunks implementing the same modification should be grouped into a single insight. For each logical change, explain:
1919

20-
1. **The mechanism**: WHY did this specific prompt change affect the metric? (e.g., "tighter step 3 output format → cleaner BM25 query → better hop-2 retrieval", "removed verbose step 5 rules → simpler evidence synthesis → better generalization", "added explicit answer format → fewer extraction failures")
21-
22-
2. **The trade-off**: What did the change gain vs lose? (e.g., "gained retrieval precision but lost flexibility for multi-entity questions", "faster convergence on common patterns but narrower coverage of edge cases")
23-
24-
3. **Actionable lesson**: What should future mutations learn? (e.g., "preserve clean format constraints in step 3", "avoid adding more conflict-resolution rules to step 5", "always include explicit 'Answer: <answer>' in step 6")
25-
26-
**REGRESSION CHECKLIST** (use when performance degraded):
27-
- Did a step's explicit output format constraint get weakened or removed?
28-
- Did step 3 instruction become more verbose or ambiguous (potentially polluting BM25 queries)?
29-
- Did step 5 conflict-resolution rules increase in complexity without accuracy gain?
30-
- Did the system_prompt grow significantly longer (crowding per-step instructions)?
31-
- Did a useful entity-bridging instruction in step 3 get removed or generalized away?
32-
- Did step 6 lose its explicit 'Answer: <answer>' format requirement?
20+
1. **The mechanism**: WHY did this specific prompt change affect the metric? (e.g.,
21+
"tighter step 3 output format → cleaner BM25 query → better hop-2 retrieval",
22+
"removed verbose step 5 rules → simpler evidence synthesis → better generalization AND narrower val-test gap",
23+
"added explicit answer format → fewer extraction failures",
24+
"step 6 example_reasoning grew to 150 words → context crowding → format non-compliance → regex extraction failure")
25+
26+
2. **The trade-off**: What did the change gain vs lose? (e.g., "gained retrieval precision
27+
but lost flexibility for multi-entity questions", "faster convergence on common patterns
28+
but narrower coverage of edge cases")
29+
30+
3. **Actionable lesson**: What should future mutations learn? (e.g., "preserve clean format
31+
constraints in step 3", "avoid adding more conflict-resolution rules to step 5",
32+
"always include explicit 'Answer: <answer>' in step 6")
33+
34+
**REGRESSION CHECKLIST** (ordered by typical impact; use when performance degraded):
35+
1. Did step 3 instruction lose its "output ONLY search terms" constraint? (BM25 query
36+
contamination — affects retrieval for ALL questions)
37+
2. Did step 6 lose the explicit 'Answer: <answer>' format requirement? (regex extraction
38+
failures on all affected questions)
39+
3. Did step 2 lose an explicit instruction to name the bridge entity? (hop-2 query
40+
formation fails; step 3 searches for the wrong thing)
41+
4. Did a step's explicit output format constraint get weakened or removed?
42+
5. Did step 5 conflict-resolution rules increase in complexity without measured gain?
43+
(likely val-set overfitting — check val-test gap, not just val EM delta)
44+
6. Did the system_prompt grow significantly longer? (context crowding affects all 4 LLM
45+
steps simultaneously)
46+
7. Did a useful entity-bridging instruction in step 3 get removed or generalized away?
47+
8. Did any step gain a long example_reasoning block (>120 words)? (context crowding
48+
reduces format compliance; check step 3 and step 6 first)
3349

3450
**QUALITY BAR:**
3551
❌ BAD: "Changed step 3 text" (just restates the diff)
@@ -44,4 +60,4 @@ Reference hunks by their headers. Group related hunks when they implement one lo
4460
PARENT CODE (reference):
4561
```python
4662
{parent_code}
47-
```
63+
```

gigaevo/prompts/hotpotqa/mutation/system.txt

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,58 @@ OBJECTIVE:
77
{task_description}
88

99
AVAILABLE METRICS:
10-
{metrics_description}
10+
{metrics_description}
11+
12+
---
13+
14+
PROMPT MUTATION CONSTRAINTS — HARD RULES:
15+
- You may ONLY modify text content in: system_prompt, and the aim, stage_action,
16+
reasoning_questions, example_reasoning fields of non-frozen LLM steps (2, 3, 5, 6).
17+
- Step topology (types, dependencies, count) is FIXED. Do not alter it.
18+
- Frozen steps (1 and 4) must remain byte-identical to the baseline.
19+
- Preserve unless explicitly targeting: step 3's "output ONLY" constraint and step 6's
20+
"Answer: <answer>" format — these protect BM25 retrieval quality and answer extraction.
21+
22+
PROMPT ENGINEERING PRINCIPLES FOR THIS CHAIN:
23+
- Shorter instructions with explicit constraints outperform longer instructions with
24+
implicit expectations. When in doubt, compress and constrain rather than expand and explain.
25+
- Step 3's output is used VERBATIM as a BM25 query. Any prose, preamble, or reasoning
26+
text it emits will degrade retrieval for ALL questions. This is the highest-priority
27+
format constraint in the chain.
28+
- BM25 is a bag-of-words model: it rewards named entities and key attributes, not relation
29+
words or sentence structure. Step 3 instructions should push toward e.g. "Marie Curie
30+
nationality" not "Find information about Marie Curie's national origin."
31+
- system_prompt is prepended to ALL 4 LLM steps. Every word costs context budget in every
32+
call. Keep it under 20 words focused on global reasoning style, not per-step instructions.
33+
- example_reasoning fields are included verbatim in the assembled prompt. Long examples
34+
(>120 words) crowd the actual instruction and reduce LLM compliance with format constraints.
35+
- Step 6 depends on steps 2 and 5 — it never sees raw second-hop passages. Step 5 must
36+
fully consolidate evidence; step 6 can only answer from what step 5 produces.
37+
- aim should be a single-sentence objective (8-15 words). A verbose aim is redundant with
38+
a well-written stage_action and wastes context budget on every call.
39+
40+
MUTATION FOCUS AREAS (in priority order):
41+
1. RETRIEVAL PATH (steps 2, 3): Does step 2 name the bridge entity? Does step 3 emit
42+
only bare search terms? Improvements here help ALL multi-hop questions.
43+
2. ANSWER PATH (steps 5, 6): Does step 5 produce a unified evidence set? Does step 6
44+
emit a clean "Answer: X"? Improvements here prevent extraction failures.
45+
3. GLOBAL CONTEXT (system_prompt): Is it short and role-focused? Compression here
46+
improves ALL steps simultaneously.
47+
4. REASONING DEPTH (reasoning_questions, example_reasoning): Add only when evidence
48+
shows the LLM is failing to apply the right reasoning pattern. Remove when bloated.
49+
50+
ARCHETYPE INTERPRETATION FOR PROMPT MUTATION:
51+
When selecting archetypes, interpret them in the prompt-engineering domain:
52+
- "Precision Optimization" → tighten a format constraint, compress a verbose instruction
53+
(e.g., shorten step 3 stage_action to pure search-term format)
54+
- "Proven Pattern Extension" → replicate a working format constraint to another step
55+
(e.g., if step 3's "ONLY output X" works, apply similar discipline to step 6)
56+
- "Harmful Pattern Removal" → remove verbose/complex rules that show no fitness gain
57+
(e.g., strip a 5-rule conflict-resolution hierarchy from step 5)
58+
- "Computational Reinvention" → try a fundamentally different instruction strategy for a step
59+
(e.g., replace prose step 2 instruction with a structured entity-extraction template)
60+
- "Solution Space Exploration" → experiment with a different reasoning scaffold
61+
(e.g., add question-type routing in step 2, or a two-sentence evidence format in step 5)
62+
- "Approach Synthesis" → combine a clean format constraint with a focused reasoning question
63+
- "Guided Innovation" → preserve clean format constraints while modifying reasoning depth
64+
- "Conservative Exploration" → minor wording change to one step within format constraints

0 commit comments

Comments
 (0)