Skip to content

Commit 18a94fa

Browse files
committed
update sourcebench md
1 parent 5be887a commit 18a94fa

File tree

1 file changed

+8
-6
lines changed

1 file changed

+8
-6
lines changed

content/posts/sourcebench.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -105,8 +105,8 @@ We constructed a dataset of 100 queries spanning informational, factual, argumen
105105

106106
We evaluated 3,996 cited sources across 12 systems, including search-equipped LLMs (e.g., GPT-5, Gemini-3-Pro, Grok-4.1), traditional SERP (Google), and AI search tools (e.g., Exa, Tavily, Gensee).
107107

108-
<div class="relative h-[400px] w-full border border-gray-100 rounded-lg p-6 bg-white shadow-sm mt-8">
109-
<canvas id="leaderboardChart"></canvas>
108+
<div class="chart-wrapper">
109+
<canvas id="leaderboardChart"></canvas>
110110
</div>
111111

112112
The full leaderboard is presented in the table below. GPT-5 leads the pack (89.1) with a substantial margin, particularly in the Meta Metric (4.5), suggesting an internal filtering mechanism that rigorously prioritizes institutional authority. Gensee secures the #3 spot by optimizing for Content Relevance (4.3).
@@ -146,16 +146,16 @@ There is a striking inverse relationship between a model's SourceBench score and
146146
Top-performing systems like **GPT-5** overlap with Google only 16% of the time, functioning as "**Discovery Engines**" that find high-quality, buried evidence.
147147
Conversely, lower-scoring systems (e.g., Tavily) overlap 55% with Google, essentially acting as "**Summarization Layers**" over standard SERPs.
148148

149-
<div class="relative h-[300px] w-full border border-gray-100 rounded-lg p-6 bg-white shadow-sm">
150-
<canvas id="inverseChart"></canvas>
149+
<div class="chart-wrapper">
150+
<canvas id="inverseChart"></canvas>
151151
</div>
152152
<div class="caption">Figure 2: SourceBench Score (Green) vs. Google Overlap (Gray).</div>
153153
</div>
154154

155155
### Insight 3: Better Search > Better Reasoning
156156
Instead of relying on a model to "think" its way through noise, providing superior, well-curated context allows simpler models to achieve better outcomes. In our controlled experiment with DeepSeek, a non-reasoning model ("Chat") with high-quality search tools outperformed a reasoning model with low-quality search tools.
157-
<div class="relative h-[250px] w-full border border-gray-100 rounded-lg p-6 bg-white shadow-sm">
158-
<canvas id="deepseekChart"></canvas>
157+
<div class="chart-wrapper">
158+
<canvas id="deepseekChart"></canvas>
159159
</div>
160160
<div class="caption">Figure 3: DeepSeek experiment results.</div>
161161
</div>
@@ -273,6 +273,7 @@ const deepSeekData = [
273273
{ name: 'Chat + High Search', score: 75.9, color: '#8b5cf6' },
274274
];
275275

276+
document.addEventListener('DOMContentLoaded', function() {
276277
new Chart(document.getElementById('leaderboardChart'), {
277278
type: 'bar',
278279
data: {
@@ -350,6 +351,7 @@ new Chart(document.getElementById('deepseekChart'), {
350351
scales: { x: { min: 65, max: 80, grid: { color: '#f1f5f9' } }, y: { grid: { display: false } } }
351352
}
352353
});
354+
});
353355

354356

355357
</script>

0 commit comments

Comments
 (0)