You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/sourcebench.md
+8-6Lines changed: 8 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,8 +105,8 @@ We constructed a dataset of 100 queries spanning informational, factual, argumen
105
105
106
106
We evaluated 3,996 cited sources across 12 systems, including search-equipped LLMs (e.g., GPT-5, Gemini-3-Pro, Grok-4.1), traditional SERP (Google), and AI search tools (e.g., Exa, Tavily, Gensee).
The full leaderboard is presented in the table below. GPT-5 leads the pack (89.1) with a substantial margin, particularly in the Meta Metric (4.5), suggesting an internal filtering mechanism that rigorously prioritizes institutional authority. Gensee secures the #3 spot by optimizing for Content Relevance (4.3).
@@ -146,16 +146,16 @@ There is a striking inverse relationship between a model's SourceBench score and
146
146
Top-performing systems like **GPT-5** overlap with Google only 16% of the time, functioning as "**Discovery Engines**" that find high-quality, buried evidence.
147
147
Conversely, lower-scoring systems (e.g., Tavily) overlap 55% with Google, essentially acting as "**Summarization Layers**" over standard SERPs.
<divclass="caption">Figure 2: SourceBench Score (Green) vs. Google Overlap (Gray).</div>
153
153
</div>
154
154
155
155
### Insight 3: Better Search > Better Reasoning
156
156
Instead of relying on a model to "think" its way through noise, providing superior, well-curated context allows simpler models to achieve better outcomes. In our controlled experiment with DeepSeek, a non-reasoning model ("Chat") with high-quality search tools outperformed a reasoning model with low-quality search tools.
0 commit comments