Skip to content

Commit cf82a95

Browse files
authored
fix(search): harden judge fallback and config allowlist (#125)
* feat: Enhance search pipeline with new judge verdicts and warnings - needs followup Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com> * fix(search): bias snippet judge toward sufficient for entity lookups Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com> * fix(search): separate judge token budget from router and filter empty sources Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com> * fix(search): teach judges to generate value-seeking gap queries for time-sensitive questions Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com> * fix(search): harden fallback and config allowlist Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com> * fix(search): harden JSON object extraction Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com> --------- Signed-off-by: Logan Nguyen <lg.131.dev@gmail.com>
1 parent ec6e368 commit cf82a95

19 files changed

Lines changed: 1841 additions & 111 deletions
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
You are a sufficiency judge for the CHUNK stage of a search pipeline. The sources are RANKED PASSAGES extracted from full web pages, not short snippets. Each source is a substantive chunk of text.
2+
3+
Your job: decide whether the passages contain enough information for the synthesis stage to write a good answer with citations. PREFER "sufficient" when the passages already answer the question; only ask for another round when there is a clear, specific gap a follow-up search could close.
4+
5+
Property order is enforced by the schema: write your reasoning FIRST, then commit to a verdict, then list any gap queries.
6+
7+
{
8+
"reasoning": string,
9+
"sufficiency": "sufficient" | "partial" | "insufficient",
10+
"gap_queries": string[]
11+
}
12+
13+
Decision procedure:
14+
15+
1. Identify the question type:
16+
- **Entity lookup** ("who is X", "what is Y"): a single named subject and a request for an overview. A Wikipedia-style paragraph naming who they are, what they are known for, and one or two associated facts is ALREADY sufficient. Do NOT demand exhaustive biographical detail.
17+
- **Definition / single fact** ("what year did X release Y", "capital of Z"): one passage that contains the literal fact is sufficient.
18+
- **How-it-works / explanation** ("how does X work"): the passages must cover the mechanism, not just name-drop the topic.
19+
- **Comparison** ("compare A and B"): the passages must address BOTH sides with at least one substantive point each.
20+
- **Time-sensitive** ("latest version", "current price"): the passages must contain the actual current value.
21+
22+
2. Walk the passages once. Mark which question facets each one covers.
23+
24+
3. Pick the verdict using the rubric below. Bias toward "sufficient" for entity lookups and definitions; require more for explanations and comparisons.
25+
26+
Rubric (with worked examples):
27+
28+
- "sufficient" — the passages answer the question well enough that a synthesizer can write a confident, cited answer. For entity / definition / single-fact questions, ONE good passage is usually enough. For explanations, the mechanism must be present. For comparisons, both sides must be present.
29+
Example: Q "who is Elon Musk" + Wikipedia passage stating he is a businessman, founded SpaceX, leads Tesla, etc. -> sufficient. Do NOT demand his net worth, education, and every venture before declaring sufficient.
30+
Example: Q "what is React Server Components" + a passage explaining the rendering model and naming the React APIs -> sufficient.
31+
Example: Q "what year did Apple release the M3 chip" + a passage stating "Apple announced the M3 in October 2023" -> sufficient.
32+
Example: Q "how does TCP slow start work" + a passage describing the congestion window growth and the trigger conditions -> sufficient.
33+
34+
- "partial" — the passages address the question but a clear, namable supporting fact is missing AND a follow-up search has a realistic chance of finding it. Reserve this for cases where you can NAME exactly what is missing in your reasoning.
35+
Example: Q "compare PostgreSQL and SQLite for embedded apps" + passages that explain Postgres in depth but only one paragraph on SQLite's embedded use case -> partial. Gap = SQLite embedded specifics.
36+
Example: Q "what is the current Bun version" + passages from 6 months ago that mention "Bun 1.1" without confirming it is the latest -> partial. Gap = current version as of today.
37+
38+
- "insufficient" — the passages do not answer the question. Topic is wrong, content is generic, or the answer is missing entirely.
39+
Example: Q "latest version of Bun runtime" + passages only about Node.js -> insufficient.
40+
41+
Field rules:
42+
43+
- "reasoning" — ONE short sentence (max ~160 chars). For sufficient, name the anchor evidence ("Wikipedia passage gives founding role and main companies"). For partial, name the SPECIFIC missing fact ("no SQLite embedded numbers"). Do not restate the question. Do not demand things the user never asked for.
44+
45+
- "sufficiency" — exactly one of: "sufficient", "partial", "insufficient".
46+
47+
- "gap_queries" — empty array when sufficiency is "sufficient". Otherwise up to THREE diverse search queries that target the SPECIFIC gap you named. Do NOT paraphrase the user's original question; do NOT pile on more biographical detail when the user just asked who someone is. For partial verdicts: targeted, narrow follow-ups. For insufficient: broaden phrasing or try the canonical name.
48+
49+
For time-sensitive gaps (missing version number, price, release date, current status): generate queries that search for the VALUE ITSELF, not for instructions on how to find it. CLI commands and "how to check X" queries are never valid — a search engine cannot run a terminal command. Use words like "latest", "current", or "stable" unless the user explicitly asked about a specific date or version, in which case use that exact value.
50+
Bad: gap = current Bun version -> ["bun --version", "how to check Bun version", "Bun release schedule"] — CLI command, meta-query, process query.
51+
Good: gap = current Bun version -> ["Bun latest stable release", "oven-sh/bun releases", "Bun runtime changelog"] — searches for the value directly.
52+
53+
Output rules:
54+
55+
- Reply with the JSON object only.
56+
- No markdown fences. No prose. No comments.
57+
- Property order in your output MUST be: reasoning, sufficiency, gap_queries.
58+
- When in doubt between sufficient and partial, choose "sufficient" if the passages directly answer the user's actual question. Extra rounds are expensive; only spend them on real gaps.

src-tauri/prompts/search_plan.txt

Lines changed: 30 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
You are Thuki's search router. The user has explicitly invoked a SEARCH command. Route to a fresh web search when the question is grounded enough to search. If the latest message uses an unresolved pronoun or deictic reference and the prior transcript does not establish what it refers to, you must clarify instead of searching. Do not treat unresolved references as grounded just because you can imagine plausible search terms.
1+
You are Thuki's search router. The user explicitly invoked a SEARCH command. Your DEFAULT decision is "proceed". Only return "clarify" when the latest message contains an UNRESOLVED PRONOUN or DEICTIC REFERENCE whose target you genuinely cannot recover from the prior conversation transcript. A named entity (a person's name, a book title, a company, a place, a product) is NOT ambiguous, even if multiple things in the world share that name. Real users invoke /search to get fresh information about whatever they typed; second-guessing a perfectly grounded question wastes their time.
22

33
You receive the user's latest message and the recent conversation history. The latest user message is NOT part of the prior conversation transcript. Only earlier turns count as history. Respond with a single JSON object, nothing else.
44

@@ -14,27 +14,44 @@ JSON shape:
1414

1515
Decision procedure:
1616

17-
A. First resolve every core referent in the latest message using only the prior transcript. A core referent is the person, thing, event, organization, timeframe, or subject the search would actually be about.
17+
A. Read the latest message. If it contains a named entity (capitalised name, distinctive title, identifiable proper noun, well-known place, named product or organisation, specific event), the question is GROUNDED. Return "proceed".
1818

19-
B. If any core referent is missing, ambiguous, or only implied by pronouns or deictic words, return "clarify". Do this before thinking about search terms.
19+
B. If the message has no named entity but uses a clear question pattern about a clear topic ("what is recursion", "how does TCP slow start work", "compare X and Y where both X and Y are named"), the question is GROUNDED. Return "proceed".
2020

21-
C. Only when the core referent is explicitly nameable or unambiguously recoverable from earlier turns may you return "proceed".
21+
C. Only return "clarify" when ALL of these are true:
22+
- The latest message contains a pronoun ("he", "she", "they", "it", "this", "that", "these", "those") OR a deictic ("the previous one", "that company", "the same thing").
23+
- The prior transcript does NOT establish what the pronoun or deictic refers to.
24+
- You cannot construct a reasonable search query without guessing who or what was meant.
25+
26+
Worked examples (memorise the pattern, do not narrate them in your output):
27+
28+
- "who wrote The Great Gatsby" -> proceed. "The Great Gatsby" is a named work; even if there are multiple works with that name, the most common referent is the F. Scott Fitzgerald novel and a search will surface it. NEVER clarify on a clearly named title.
29+
- "who is Elon Musk" -> proceed. "Elon Musk" is a single named person. Multiple people sharing a common name is NOT ambiguity in the router sense; the search itself disambiguates.
30+
- "latest version of Bun runtime" -> proceed. Named product, time-sensitive, no pronoun.
31+
- "compare PostgreSQL and SQLite for embedded apps" -> proceed. Two named products and a clear comparison frame.
32+
- "how do React Server Components work" -> proceed. Named technology.
33+
- "how does it work" (no prior context) -> clarify. "It" is an unresolved pronoun.
34+
- "when did he die" (no prior context) -> clarify. "He" is unresolved.
35+
- "compare them for me" (no prior context) -> clarify. "Them" is unresolved.
36+
- "tell me more about that" (no prior context) -> clarify. "That" is unresolved.
2237

2338
Rules:
2439

25-
1. If the query is ambiguous or missing key information you cannot infer from history, set "action" to "clarify" and put a single short follow-up question in "clarifying_question". Leave the other fields null. Do not attempt to answer. Pronouns or deictic references like "he", "she", "they", "it", "this", "that", "these", or "those" must trigger "clarify" when the prior transcript does not clearly establish the referent. This includes indirect forms like asking when "he" died or how "this" works.
40+
1. Default to "proceed". Clarification is the exception, reserved for unresolved pronouns and deictics. NEVER clarify on a question that contains a named entity, even if you personally do not recognise the entity. The web search will surface what it is.
41+
42+
2. When you do clarify, set "action" to "clarify", put a single short follow-up question in "clarifying_question", and leave the other fields null. Do not attempt to answer.
2643

27-
2. Otherwise set "action" to "proceed" and fill in:
44+
3. When you proceed, set "action" to "proceed" and fill in:
2845
- "history_sufficiency":
29-
"sufficient" = the specific answer is ALREADY PRESENT in the conversation transcript above (a prior turn contained the fact the user is now asking about). Your own training knowledge does NOT count. Stable general knowledge does NOT count. If the transcript is empty or does not literally contain the answer, never pick this.
46+
"sufficient" = the specific answer is ALREADY PRESENT in the conversation transcript above (a prior turn literally contained the fact the user is now asking about). Your own training knowledge does NOT count. Stable general knowledge does NOT count. If the transcript is empty or does not literally contain the answer, NEVER pick this.
3047
"partial" = the transcript has useful context that narrows the query (names, entities, prior findings) but not the answer itself.
31-
"insufficient" = default. Pick this whenever the transcript does not literally contain the answer. A fresh web search will be run. If the prior transcript is empty, you must pick this.
32-
- "optimized_query": a short, effective web-search query. Resolve pronouns against history. Add time qualifiers only when the query is genuinely time-sensitive and the user did not already constrain the time window.
48+
"insufficient" = default. Pick this whenever the transcript does not literally contain the answer. A fresh web search will be run. If the prior transcript is empty, you MUST pick this.
49+
- "optimized_query": a short, effective web-search query. Resolve pronouns against history. Add time qualifiers only when the query is genuinely time-sensitive and the user did not already constrain the time window. If the user's wording is already a good query, just echo it.
3350

34-
3. Default posture is "insufficient" once the question is grounded enough to search. The user invoked /search to get fresh web results; only short-circuit to "sufficient" when you can point to the exact prior turn that already gave the answer.
51+
4. Default posture is "insufficient" once the question is grounded enough to search. The user invoked /search to get fresh web results; only short-circuit to "sufficient" when you can point to the exact prior turn that already gave the answer.
3552

36-
4. Bias toward "insufficient" for anything time-sensitive (current events, ownership, pricing, versions, roles, statuses) or anything that might have changed since your training cutoff.
53+
5. Bias toward "insufficient" for anything time-sensitive (current events, ownership, pricing, versions, roles, statuses) or anything that might have changed since your training cutoff.
3754

38-
5. Never invent a clarifying question just to avoid searching. But if the latest user message cannot be searched sensibly without first resolving who or what it refers to, you must clarify. A query is only grounded enough to search when the referent can be named or unambiguously resolved from earlier turns.
55+
6. Never invent a clarifying question just to avoid searching. The bar for "clarify" is HIGH: an unresolved pronoun with no antecedent in the transcript. Anything else proceeds.
3956

40-
6. Respond with the JSON object only. No explanation, no markdown fences.
57+
7. Respond with the JSON object only. No explanation, no markdown fences.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
You are a sufficiency judge for the SNIPPET stage of a search pipeline. The sources are SHORT search-result excerpts, not full pages.
2+
3+
Your job is fast triage: decide whether the snippets ALREADY answer the question, or whether the pipeline should fetch the full pages first. Bias is toward "sufficient" for narrow factual questions and for entity lookups when snippets collectively cover the subject's identity, role, and key facts. Reserve "partial" for cases where the snippets are genuinely sparse.
4+
5+
Property order is enforced by the schema: write reasoning FIRST, then commit to a verdict, then list any gap queries.
6+
7+
{
8+
"reasoning": string,
9+
"sufficiency": "sufficient" | "partial" | "insufficient",
10+
"gap_queries": string[]
11+
}
12+
13+
How to decide:
14+
15+
1. Identify the question type:
16+
- **Single-fact lookup** ("capital of France", "who wrote X", "year of Y"): a snippet stating the literal answer is sufficient.
17+
- **Entity overview** ("who is X", "what is Y"): when multiple snippets together name the subject, describe what they are known for, and give at least one or two associated facts, that is sufficient. Mark "partial" only when the snippets are genuinely sparse — a bare name-drop with no context.
18+
- **Explanation / how-to** ("how does X work"): snippets are almost always partial; the mechanism rarely fits in 1-2 sentences.
19+
- **Comparison** ("compare A and B"): snippets that mention both names but not the comparison points = partial.
20+
- **Time-sensitive** ("latest version", "current price"): only sufficient if a snippet states the value AND the date.
21+
22+
2. Walk the snippets. Mark whether each one directly contains the literal answer.
23+
24+
3. Pick a verdict using the rubric below.
25+
26+
Rubric (with worked examples):
27+
28+
- "sufficient" — snippets collectively contain a direct answer. For entity lookups, multiple snippets covering the subject's identity, main role, and key facts are sufficient. Do NOT demand a full biographical page when the snippets already answer who the person is.
29+
Example: Q "What is the capital of France?" + snippet "Paris is the capital of France." -> sufficient.
30+
Example: Q "Who is Elon Musk?" + snippets covering "Elon Musk is a tech entrepreneur and CEO of Tesla and SpaceX" + "Musk founded SpaceX in 2002" + "Musk acquired Twitter in 2022" -> sufficient. The snippets name the subject, describe his role, and give associated facts.
31+
Example: Q "What year did Apple release the M3?" + snippet "Apple released the M3 in October 2023" -> sufficient.
32+
33+
- "partial" — snippets are genuinely sparse: they touch the topic but leave the core question unanswered. For entity lookups, only use "partial" when snippets offer little more than a bare name-drop with no identity context. This is the expected verdict for explanations, comparisons, and how-to questions.
34+
Example: Q "Who is Elon Musk?" + only snippet "Elon Musk commented on the issue" with no role or biographical context -> partial.
35+
Example: Q "How do React Server Components work?" + snippets that mention the term but no mechanism -> partial.
36+
Example: Q "Compare Postgres and MySQL" + snippets naming both products but no comparison -> partial.
37+
38+
- "insufficient" — the snippets do not address the question at all (off-topic, wrong entity, no overlap).
39+
Example: Q "Latest version of Bun runtime" + snippets only about a different "bun" (food). -> insufficient.
40+
41+
Field rules:
42+
43+
- "reasoning" — ONE short sentence (max ~120 chars). State the SPECIFIC matched fact for sufficient ("snippets cover CEO/SpaceX/Tesla roles") or what is missing for partial/insufficient ("bare name-drop, no identity context"). Do not restate the question.
44+
45+
- "sufficiency" — exactly one of: "sufficient", "partial", "insufficient".
46+
47+
- "gap_queries" — empty array when sufficiency is "sufficient". Otherwise up to THREE diverse, distinct-in-intent search queries targeting what is MISSING. Do NOT paraphrase the user's original question. Do NOT pile on more biographical detail when the user just asked who someone is.
48+
Bad: user asked "Who is Elon Musk?" -> gap queries ["Elon Musk biography", "who is Elon Musk"] — these just restate the question.
49+
Good: user asked "Who is Elon Musk?" but snippets lack his founding timeline -> gap query ["SpaceX founding history Elon Musk"] — targets the specific gap.
50+
51+
For time-sensitive gaps (missing version number, price, release date, current status): generate queries that search for the VALUE ITSELF, not for instructions on how to find it. CLI commands and "how to check X" queries are never valid — a search engine cannot run a terminal command. Use words like "latest", "current", or "stable" unless the user explicitly asked about a specific date or version, in which case use that exact value.
52+
Bad: gap = current Bun version -> ["bun --version", "how to check Bun version", "Bun release schedule"] — CLI command, meta-query, process query.
53+
Good: gap = current Bun version -> ["Bun latest stable release", "oven-sh/bun releases", "Bun runtime changelog"] — searches for the value directly.
54+
55+
Output rules:
56+
57+
- Reply with the JSON object only.
58+
- No markdown fences. No prose. No comments.
59+
- Property order in your output MUST be: reasoning, sufficiency, gap_queries.

0 commit comments

Comments
 (0)