You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- "reasoning_scaffold": whether reasoning_questions guide toward the right intermediate outputs
32
+
- "aim_verbosity": whether a step's aim is unnecessarily long and redundant with stage_action
28
33
29
34
TAGS (pick one):
30
35
- **beneficial**: Pattern that HELPS the metric — mutation should PRESERVE or EXTEND it
@@ -37,6 +42,31 @@ SEVERITY: high (major metric impact), medium (moderate), low (minor refinement)
37
42
38
43
---
39
44
45
+
CHAIN-SPECIFIC FAILURE MODES TO CHECK:
46
+
- **BM25 query pollution**: Step 3's output is used verbatim as the BM25 retrieval query
47
+
(via $history[-1]). Any prose, preamble, or reasoning text in step 3's output degrades
48
+
retrieval. Flag if step 3's stage_action does not include an explicit "output ONLY the
49
+
search terms" constraint. BM25 rewards named entities over relation words — shorter,
50
+
entity-focused queries outperform full sentences.
51
+
- **hop-2 entity bridging gap**: Step 3 receives ONLY step 2's summary (not step 1's raw
52
+
passages). If step 2 fails to name the bridge entity explicitly, step 3 cannot form a
53
+
useful query. Flag if step 2's stage_action lacks an instruction to surface entity names
54
+
needed for the second-hop search.
55
+
- **system_prompt crowding**: system_prompt is prepended to ALL 4 LLM steps. Every word
56
+
in system_prompt consumes context in every call. Flag if system_prompt exceeds ~20 words
57
+
or contains instructions that duplicate per-step aims.
58
+
- **answer extraction brittleness**: The evaluator extracts answers via regex on
59
+
"Answer: <answer>". Flag if step 6 lacks an explicit format constraint, or if its
60
+
example_reasoning shows a multi-sentence response rather than a single "Answer: X" line.
61
+
- **step 6 evidence blindspot**: Step 6 depends on steps 2 and 5 (not step 4). It never
62
+
sees raw second-hop passages. Flag if step 5's stage_action does not produce a unified
63
+
evidence summary — step 6 needs step 5 to fully consolidate, or it will lack second-hop facts.
64
+
- **example_reasoning length tax**: example_reasoning is included verbatim in the assembled
65
+
prompt. Flag if any step's example_reasoning exceeds ~120 words — longer examples crowd
66
+
the actual instruction and may reduce LLM compliance with format constraints.
67
+
68
+
---
69
+
40
70
REQUIREMENTS:
41
71
- Each insight ≤50 words with concrete evidence (step numbers, quoted instructions, metrics)
42
72
- Ground in code/metrics/errors — NO speculation or hallucination
@@ -81,4 +111,4 @@ EXAMPLE:
81
111
"severity": "medium",
82
112
"insight": "step 6 instruction lacks explicit 'Answer: <answer>' format requirement; causes extraction failure on ~1% of samples; mutation should ADD explicit format constraint matching normalization."
- Connect prompt changes to chain execution behavior
20
-
- Identify **generalizing vs. overfitting** patterns: did the change improve robustness or optimize for specific question types?
22
+
- Identify **generalizing vs. overfitting** patterns: did the change improve robustness or optimize for specific question types? Note when reduced val-test gap indicates better generalization.
21
23
- When metrics improved, explain the mechanism; when they degraded, explain what was lost
22
24
25
+
MECHANISM VOCABULARY — use these precise terms when applicable:
- "hop-2 entity bridging failure": step 2 does not name the bridge entity, so step 3 queries for the wrong thing
28
+
- "context crowding": system_prompt or long example_reasoning shrinks effective context for per-step instructions
29
+
- "answer extraction failure": step 6 does not produce "Answer: <answer>" format, causing regex miss
30
+
- "evidence consolidation gap": step 5 output is too thin, leaving step 6 without second-hop facts
31
+
- "val-set overfitting signal": complex multi-rule instructions in steps 5 or 6 often improve val EM but widen val-test gap — flag this pattern explicitly
32
+
23
33
FORMAT:
24
34
- Each insight: JSON with "strategy" and "description" (≤50 words)
25
35
- Reference diff blocks: "(@@ -X,Y +A,B @@)"
@@ -28,25 +38,25 @@ FORMAT:
28
38
EXAMPLES:
29
39
[
30
40
{
31
-
"strategy": "refinement",
32
-
"description": "(@@ -12,3 +14,2 @@) Tightened step 3 to 'Output ONLY: <entity> <relation>', removing prose preamble; cleaner BM25 query improves hop-2 document recall; correlates with +0.018 val EM gain."
41
+
"strategy": "retrieval_refinement",
42
+
"description": "(@@ -12,3 +14,2 @@) Tightened step 3 to 'Output ONLY: <entity> <relation>', removing prose preamble; eliminates BM25 query contamination; cleaner query improves hop-2 document recall; correlates with +0.018 val EM gain."
33
43
},
34
44
{
35
45
"strategy": "avoidance",
36
-
"description": "(@@ -45,8 +0,0 @@) Removed 5-rule conflict-resolution hierarchy in step 5; over-engineered rules were optimizing for val-set patterns; simplification reduced val-test gap from 8pp to 5pp."
46
+
"description": "(@@ -45,8 +0,0 @@) Removed 5-rule conflict-resolution hierarchy in step 5; over-engineered rules show val-set overfitting signal; simplification reduced val-test gap from 8pp to 5pp without harming val EM."
37
47
},
38
48
{
39
49
"strategy": "exploration",
40
-
"description": "(@@ -22,2 +24,4 @@) Added question-type routing in step 2 (who/when/where tags); improved entity extraction for bridge questions; +0.022 val EM but risk of overfitting to question taxonomy."
50
+
"description": "(@@ -22,2 +24,4 @@) Added question-type routing in step 2 (who/when/where tags); improved hop-2 entity bridging for bridge questions; +0.022 val EM but risk of overfitting to question taxonomy."
41
51
},
42
52
{
43
53
"strategy": "imitation",
44
-
"description": "(@@ -8,1 +8,1 @@) Preserved 'Answer: <answer>' format constraint in step 6; maintaining extraction reliability that prevented 3 failure cases in parent."
54
+
"description": "(@@ -8,1 +8,1 @@) Preserved 'Answer: <answer>' format constraint in step 6; maintaining extraction reliability prevented regex extraction failures seen in parent."
45
55
},
46
56
{
47
-
"strategy": "generalization",
48
-
"description": "(@@ -30 @@) + (@@ -67 @@) Unified duplicate entity-bridging instructions across steps 2 and 3; reduces redundancy and context crowding; single point of change for future mutation."
57
+
"strategy": "global_refinement",
58
+
"description": "(@@ -30 @@) + (@@ -67 @@) Compressed system_prompt from 45 to 12 words; reduced context crowding across all 4 LLM steps; unified duplicate instructions into per-step stage_action for cleaner mutation targets."
49
59
}
50
60
]
51
61
52
-
Respond with only valid JSON array, no commentary.
62
+
Respond with only valid JSON array, no commentary.
Copy file name to clipboardExpand all lines: gigaevo/prompts/hotpotqa/lineage/user.txt
+30-14Lines changed: 30 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -17,19 +17,35 @@ Child: {child_errors}
17
17
18
18
Analyze the diff for **logical changes** (one insight per logical change). Related hunks implementing the same modification should be grouped into a single insight. For each logical change, explain:
19
19
20
-
1. **The mechanism**: WHY did this specific prompt change affect the metric? (e.g., "tighter step 3 output format → cleaner BM25 query → better hop-2 retrieval", "removed verbose step 5 rules → simpler evidence synthesis → better generalization", "added explicit answer format → fewer extraction failures")
21
-
22
-
2. **The trade-off**: What did the change gain vs lose? (e.g., "gained retrieval precision but lost flexibility for multi-entity questions", "faster convergence on common patterns but narrower coverage of edge cases")
23
-
24
-
3. **Actionable lesson**: What should future mutations learn? (e.g., "preserve clean format constraints in step 3", "avoid adding more conflict-resolution rules to step 5", "always include explicit 'Answer: <answer>' in step 6")
25
-
26
-
**REGRESSION CHECKLIST** (use when performance degraded):
27
-
- Did a step's explicit output format constraint get weakened or removed?
28
-
- Did step 3 instruction become more verbose or ambiguous (potentially polluting BM25 queries)?
29
-
- Did step 5 conflict-resolution rules increase in complexity without accuracy gain?
30
-
- Did the system_prompt grow significantly longer (crowding per-step instructions)?
31
-
- Did a useful entity-bridging instruction in step 3 get removed or generalized away?
32
-
- Did step 6 lose its explicit 'Answer: <answer>' format requirement?
20
+
1. **The mechanism**: WHY did this specific prompt change affect the metric? (e.g.,
0 commit comments