Skip to content

Commit 9f9a406

Browse files
authored
Add blog post: Comparing frontier Claude models (June 2026) (#1144)
## Summary - New blog post comparing Sonnet 4.6, Opus 4.6, and Opus 4.8 on a fixed set of 5 RHEL triage issues using the e2e test harness - Adds recharts-based interactive charts (duration, tool calls, input tokens, output tokens, cost) - Introduces `src/components/ModelEvalCharts/` as the first reusable chart component in the repo - Post converted to `.mdx` to support JSX chart components ## Test plan - [x] Charts render correctly in browser (verified locally via `npm start`) - [x] All 60 data points in charts verified against original raw tables - [x] All GitHub profile links in acknowledgments verified (HTTP 200) Example: <img width="2510" height="1761" alt="image" src="https://github.com/user-attachments/assets/11e9b595-6a78-4b32-a72f-303640633710" /> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
2 parents c4bdbba + 06065b4 commit 9f9a406

4 files changed

Lines changed: 2002 additions & 1601 deletions

File tree

package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,8 @@
2323
"clsx": "^1.2.1",
2424
"prism-react-renderer": "^1.3.5",
2525
"react": "^17.0.2",
26-
"react-dom": "^17.0.2"
26+
"react-dom": "^17.0.2",
27+
"recharts": "^2.15.4"
2728
},
2829
"devDependencies": {
2930
"@docusaurus/module-type-aliases": "2.4.1",
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
title: Comparing frontier Claude models for our AI workflows (June 2026)
3+
authors: ttomecek
4+
tags:
5+
- AI
6+
- automation
7+
- development
8+
- eval
9+
---
10+
11+
import {
12+
DurationChart,
13+
ToolCallsChart,
14+
InputTokensChart,
15+
OutputTokensChart,
16+
CostChart,
17+
} from "@site/src/components/ModelEvalCharts";
18+
19+
We evaluated three Claude frontier models: Sonnet 4.6, Opus 4.6, and Opus 4.8.
20+
This was done on a fixed set of five RHEL triage issues using our [end-to-end
21+
test
22+
harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md). The
23+
harness is validating our triaging agent against a fixed set of known issues.
24+
Triage is the first and most consequential step in the workflow: the agent reads
25+
a Jira CVE issue and decides how it should be resolved — backport an upstream
26+
patch, rebase to a newer version, or mark it as not-affected. A wrong call here
27+
invalidates everything downstream, which is why it's the natural starting point
28+
for a model benchmark.
29+
30+
The 4.6 models ran with `REASONING_EFFORT=high`, which enables native extended
31+
thinking. Opus 4.8 ran without it — a LiteLLM/BeeAI provider ID mismatch caused
32+
the model to reject the thinking parameter, so we disabled it entirely
33+
([tracked here](https://github.com/i-am-bee/beeai-framework/issues/1463)). The harness runs all five issues concurrently and captures per-issue
34+
metrics from the agent framework: wall-clock duration, tool call count, and
35+
total token usage.
36+
37+
This is the first time we've done this kind of work and it was quite a learning
38+
experience. This blog post contains data from a single run with each model.
39+
40+
I already have ideas for improvement for a future run. We need to extend the
41+
scope to backporting agent as well and have at least ten cases. Doing multiple
42+
runs per model with aggregation would also give us more grounded results.
43+
44+
<!-- truncate -->
45+
46+
## Analysis
47+
48+
### Resolutions
49+
50+
All three models reached identical conclusions on all five issues. The triage
51+
decisions were unambiguous: three backports (RHEL-15216, RHEL-112546,
52+
RHEL-177992) and two not-affected (RHEL-114607, RHEL-174694). RHEL-114607 is
53+
worth highlighting the issue concerns CVE-2025-59375 in expat, which only
54+
affects versions before 2.7.2. RHEL 10 already ships 2.7.3, so the models
55+
correctly concluded not-affected rather than recommending a rebase to 2.7.5.
56+
57+
### Speed
58+
59+
Sonnet 4.6 was the fastest model on four of the five issues, sometimes by a
60+
large margin. RHEL-112546 illustrates the biggest spread: Sonnet resolved it in
61+
98s with just 9 tool calls, while Opus 4.6 took 267s and 37 tool calls, and
62+
Opus 4.8 took 154s and 25 tool calls. Both Opus models invested significantly
63+
more effort on this libtiff CVE, exploring more patch URLs and performing
64+
deeper code analysis.
65+
66+
<DurationChart />
67+
68+
RHEL-174694 shows the opposite pattern: Opus 4.8 took 389s with 17 tool calls —
69+
more than twice Sonnet's 167s — despite using fewer tool calls. The extended
70+
thinking budget appeared to cause the model to deliberate at length before
71+
settling on a conclusion that Sonnet reached more directly.
72+
73+
<ToolCallsChart />
74+
75+
Opus 4.8 is consistently faster than Opus 4.6 (except for RHEL-174694), which
76+
suggests the architecture improvements between the two generations translate
77+
into more efficient reasoning chains.
78+
79+
The stark difference between the numbers is not just caused by the model
80+
evolution but also the fact how non-deterministic task this triage is. We are
81+
also not utilizing Opus 4.8's adaptive thinking.
82+
83+
### Token usage and cost
84+
85+
The token numbers reveal a notable pattern: Opus 4.8 uses more input tokens
86+
than Opus 4.6 for the same issues, but far fewer output tokens. On
87+
RHEL-112546, Opus 4.6 produced 11,268 output tokens versus Opus 4.8's 5,864
88+
— the newer model reasons more concisely even when it reads more context.
89+
90+
<InputTokensChart />
91+
92+
<OutputTokensChart />
93+
94+
Cost differences are significant but concrete numbers highly depend on the
95+
actual plan.
96+
97+
<CostChart />
98+
99+
### Takeaway
100+
101+
For the triage workload we tested, Sonnet 4.6 offers the best price-performance
102+
ratio by a wide margin. Opus models invested more in investigation on hard
103+
issues but did not produce different conclusions. Opus 4.8's speed advantage
104+
over Opus 4.6 is real but does not close the cost gap.
105+
106+
On the other hand, this is an evaluation harness, so we need to make a real
107+
judgement in our day to day work while processing real issues.
108+
109+
None of this analysis would be possible without the incredible work of the
110+
whole team and especially [Tomas Korbar](https://github.com/TomasKorbar) who
111+
authored the E2E test suite, [Ondrej Pohorelsky](https://github.com/opohorel)
112+
who contributed the initial support for Opus 4.8 to
113+
[ai-workflows](https://github.com/packit/ai-workflows), [Nikola
114+
Forro](https://github.com/nforro) - author of our minimal trace-server, [Laura
115+
Barcziova](https://github.com/lbarcziova) whose scripts and Claude skills I
116+
used for this research, [Matej Focko](https://github.com/mfocko) for consulting
117+
with me all the time, and [Maja Massarini](https://github.com/majamassarini)
118+
for the polished Makefile & compose setup that carried the test runs.
Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
import React, { useState, useEffect } from "react";
2+
import {
3+
BarChart,
4+
Bar,
5+
XAxis,
6+
YAxis,
7+
CartesianGrid,
8+
Tooltip,
9+
Legend,
10+
ResponsiveContainer,
11+
} from "recharts";
12+
13+
const COST_DATA = [
14+
{ model: "Sonnet 4.6", "Cost (USD)": 12 },
15+
{ model: "Opus 4.6", "Cost (USD)": 26 },
16+
{ model: "Opus 4.8", "Cost (USD)": 96 },
17+
];
18+
19+
const INPUT_TOKENS_DATA = [
20+
{
21+
issue: "RHEL-15216",
22+
"Sonnet 4.6": 1236132,
23+
"Opus 4.6": 1099444,
24+
"Opus 4.8": 1421854,
25+
},
26+
{
27+
issue: "RHEL-112546",
28+
"Sonnet 4.6": 395277,
29+
"Opus 4.6": 1974683,
30+
"Opus 4.8": 1440652,
31+
},
32+
{
33+
issue: "RHEL-114607",
34+
"Sonnet 4.6": 421587,
35+
"Opus 4.6": 239754,
36+
"Opus 4.8": 1045918,
37+
},
38+
{
39+
issue: "RHEL-177992",
40+
"Sonnet 4.6": 514430,
41+
"Opus 4.6": 933522,
42+
"Opus 4.8": 1113364,
43+
},
44+
{
45+
issue: "RHEL-174694",
46+
"Sonnet 4.6": 1307129,
47+
"Opus 4.6": 805407,
48+
"Opus 4.8": 1244698,
49+
},
50+
];
51+
52+
const OUTPUT_TOKENS_DATA = [
53+
{
54+
issue: "RHEL-15216",
55+
"Sonnet 4.6": 10376,
56+
"Opus 4.6": 4820,
57+
"Opus 4.8": 4376,
58+
},
59+
{
60+
issue: "RHEL-112546",
61+
"Sonnet 4.6": 4646,
62+
"Opus 4.6": 11268,
63+
"Opus 4.8": 5864,
64+
},
65+
{
66+
issue: "RHEL-114607",
67+
"Sonnet 4.6": 5159,
68+
"Opus 4.6": 1680,
69+
"Opus 4.8": 4630,
70+
},
71+
{
72+
issue: "RHEL-177992",
73+
"Sonnet 4.6": 4829,
74+
"Opus 4.6": 4603,
75+
"Opus 4.8": 4488,
76+
},
77+
{
78+
issue: "RHEL-174694",
79+
"Sonnet 4.6": 8553,
80+
"Opus 4.6": 3635,
81+
"Opus 4.8": 4507,
82+
},
83+
];
84+
85+
const TOOL_CALLS_DATA = [
86+
{ issue: "RHEL-15216", "Sonnet 4.6": 26, "Opus 4.6": 24, "Opus 4.8": 26 },
87+
{ issue: "RHEL-112546", "Sonnet 4.6": 9, "Opus 4.6": 37, "Opus 4.8": 25 },
88+
{ issue: "RHEL-114607", "Sonnet 4.6": 10, "Opus 4.6": 6, "Opus 4.8": 20 },
89+
{ issue: "RHEL-177992", "Sonnet 4.6": 12, "Opus 4.6": 21, "Opus 4.8": 21 },
90+
{ issue: "RHEL-174694", "Sonnet 4.6": 20, "Opus 4.6": 13, "Opus 4.8": 17 },
91+
];
92+
93+
const DURATION_DATA = [
94+
{ issue: "RHEL-15216", "Sonnet 4.6": 208, "Opus 4.6": 140, "Opus 4.8": 130 },
95+
{ issue: "RHEL-112546", "Sonnet 4.6": 98, "Opus 4.6": 267, "Opus 4.8": 154 },
96+
{ issue: "RHEL-114607", "Sonnet 4.6": 104, "Opus 4.6": 48, "Opus 4.8": 124 },
97+
{ issue: "RHEL-177992", "Sonnet 4.6": 109, "Opus 4.6": 127, "Opus 4.8": 115 },
98+
{ issue: "RHEL-174694", "Sonnet 4.6": 167, "Opus 4.6": 186, "Opus 4.8": 389 },
99+
];
100+
101+
const fmtM = (v) =>
102+
v >= 1e6
103+
? `${(v / 1e6).toFixed(1)}M`
104+
: v >= 1e3
105+
? `${(v / 1e3).toFixed(0)}k`
106+
: v;
107+
108+
export function InputTokensChart() {
109+
const [isClient, setIsClient] = useState(false);
110+
useEffect(() => {
111+
setIsClient(true);
112+
}, []);
113+
114+
if (!isClient) return null;
115+
116+
return (
117+
<ResponsiveContainer width="100%" height={350}>
118+
<BarChart
119+
data={INPUT_TOKENS_DATA}
120+
margin={{ top: 5, right: 30, left: 40, bottom: 5 }}
121+
>
122+
<CartesianGrid strokeDasharray="3 3" />
123+
<XAxis dataKey="issue" tick={{ fontSize: 12 }} />
124+
<YAxis
125+
tickFormatter={fmtM}
126+
label={{
127+
value: "input tokens",
128+
angle: -90,
129+
position: "insideLeft",
130+
offset: -5,
131+
}}
132+
/>
133+
<Tooltip formatter={(value) => [value.toLocaleString()]} />
134+
<Legend />
135+
<Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
136+
<Bar dataKey="Opus 4.6" fill="#f28e2b" />
137+
<Bar dataKey="Opus 4.8" fill="#e15759" />
138+
</BarChart>
139+
</ResponsiveContainer>
140+
);
141+
}
142+
143+
export function OutputTokensChart() {
144+
const [isClient, setIsClient] = useState(false);
145+
useEffect(() => {
146+
setIsClient(true);
147+
}, []);
148+
149+
if (!isClient) return null;
150+
151+
return (
152+
<ResponsiveContainer width="100%" height={350}>
153+
<BarChart
154+
data={OUTPUT_TOKENS_DATA}
155+
margin={{ top: 5, right: 30, left: 40, bottom: 5 }}
156+
>
157+
<CartesianGrid strokeDasharray="3 3" />
158+
<XAxis dataKey="issue" tick={{ fontSize: 12 }} />
159+
<YAxis
160+
tickFormatter={fmtM}
161+
label={{
162+
value: "output tokens",
163+
angle: -90,
164+
position: "insideLeft",
165+
offset: -5,
166+
}}
167+
/>
168+
<Tooltip formatter={(value) => [value.toLocaleString()]} />
169+
<Legend />
170+
<Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
171+
<Bar dataKey="Opus 4.6" fill="#f28e2b" />
172+
<Bar dataKey="Opus 4.8" fill="#e15759" />
173+
</BarChart>
174+
</ResponsiveContainer>
175+
);
176+
}
177+
178+
export function ToolCallsChart() {
179+
const [isClient, setIsClient] = useState(false);
180+
useEffect(() => {
181+
setIsClient(true);
182+
}, []);
183+
184+
if (!isClient) return null;
185+
186+
return (
187+
<ResponsiveContainer width="100%" height={350}>
188+
<BarChart
189+
data={TOOL_CALLS_DATA}
190+
margin={{ top: 5, right: 30, left: 20, bottom: 5 }}
191+
>
192+
<CartesianGrid strokeDasharray="3 3" />
193+
<XAxis dataKey="issue" tick={{ fontSize: 12 }} />
194+
<YAxis
195+
label={{
196+
value: "tool calls",
197+
angle: -90,
198+
position: "insideLeft",
199+
offset: -5,
200+
}}
201+
/>
202+
<Tooltip />
203+
<Legend />
204+
<Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
205+
<Bar dataKey="Opus 4.6" fill="#f28e2b" />
206+
<Bar dataKey="Opus 4.8" fill="#e15759" />
207+
</BarChart>
208+
</ResponsiveContainer>
209+
);
210+
}
211+
212+
export function CostChart() {
213+
const [isClient, setIsClient] = useState(false);
214+
useEffect(() => {
215+
setIsClient(true);
216+
}, []);
217+
218+
if (!isClient) return null;
219+
220+
return (
221+
<ResponsiveContainer width="100%" height={220}>
222+
<BarChart
223+
data={COST_DATA}
224+
layout="vertical"
225+
margin={{ top: 5, right: 40, left: 10, bottom: 5 }}
226+
>
227+
<CartesianGrid strokeDasharray="3 3" />
228+
<XAxis type="number" tickFormatter={(v) => `$${v}`} />
229+
<YAxis type="category" dataKey="model" width={80} />
230+
<Tooltip formatter={(value) => [`$${value}`]} />
231+
<Bar dataKey="Cost (USD)" fill="#4e79a7" />
232+
</BarChart>
233+
</ResponsiveContainer>
234+
);
235+
}
236+
237+
export function DurationChart() {
238+
const [isClient, setIsClient] = useState(false);
239+
useEffect(() => {
240+
setIsClient(true);
241+
}, []);
242+
243+
if (!isClient) return null;
244+
245+
return (
246+
<ResponsiveContainer width="100%" height={350}>
247+
<BarChart
248+
data={DURATION_DATA}
249+
margin={{ top: 5, right: 30, left: 20, bottom: 5 }}
250+
>
251+
<CartesianGrid strokeDasharray="3 3" />
252+
<XAxis dataKey="issue" tick={{ fontSize: 12 }} />
253+
<YAxis unit="s" />
254+
<Tooltip formatter={(value) => [`${value}s`]} />
255+
<Legend />
256+
<Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
257+
<Bar dataKey="Opus 4.6" fill="#f28e2b" />
258+
<Bar dataKey="Opus 4.8" fill="#e15759" />
259+
</BarChart>
260+
</ResponsiveContainer>
261+
);
262+
}

0 commit comments

Comments
 (0)