Add blog post: Comparing frontier Claude models (June 2026) (#1144)

TomasTomecek · web-flow · commit 9f9a4066567a · 2026-06-09T13:51:53.000+02:00
## Summary - New blog post comparing Sonnet 4.6, Opus 4.6, and Opus 4.8 on a fixed set of 5 RHEL triage issues using the e2e test harness - Adds recharts-based interactive charts (duration, tool calls, input tokens, output tokens, cost) - Introduces `src/components/ModelEvalCharts/` as the first reusable chart component in the repo - Post converted to `.mdx` to support JSX chart components ## Test plan - [x] Charts render correctly in browser (verified locally via `npm start`) - [x] All 60 data points in charts verified against original raw tables - [x] All GitHub profile links in acknowledgments verified (HTTP 200) Example: <img width="2510" height="1761" alt="image" src="https://github.com/user-attachments/assets/11e9b595-6a78-4b32-a72f-303640633710" /> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
diff --git a/package.json b/package.json
@@ -23,7 +23,8 @@
     "clsx": "^1.2.1",
     "prism-react-renderer": "^1.3.5",
     "react": "^17.0.2",
-    "react-dom": "^17.0.2"
+    "react-dom": "^17.0.2",
+    "recharts": "^2.15.4"
   },
   "devDependencies": {
     "@docusaurus/module-type-aliases": "2.4.1",
diff --git a/posts/comparing-frontier-models-june2026/index.mdx b/posts/comparing-frontier-models-june2026/index.mdx
@@ -0,0 +1,118 @@
+---
+title: Comparing frontier Claude models for our AI workflows (June 2026)
+authors: ttomecek
+tags:
+  - AI
+  - automation
+  - development
+  - eval
+---
+
+import {
+  DurationChart,
+  ToolCallsChart,
+  InputTokensChart,
+  OutputTokensChart,
+  CostChart,
+} from "@site/src/components/ModelEvalCharts";
+
+We evaluated three Claude frontier models: Sonnet 4.6, Opus 4.6, and Opus 4.8.
+This was done on a fixed set of five RHEL triage issues using our [end-to-end
+test
+harness](https://github.com/packit/ai-workflows/blob/main/e2e-ci-setup.md). The
+harness is validating our triaging agent against a fixed set of known issues.
+Triage is the first and most consequential step in the workflow: the agent reads
+a Jira CVE issue and decides how it should be resolved — backport an upstream
+patch, rebase to a newer version, or mark it as not-affected. A wrong call here
+invalidates everything downstream, which is why it's the natural starting point
+for a model benchmark.
+
+The 4.6 models ran with `REASONING_EFFORT=high`, which enables native extended
+thinking. Opus 4.8 ran without it — a LiteLLM/BeeAI provider ID mismatch caused
+the model to reject the thinking parameter, so we disabled it entirely
+([tracked here](https://github.com/i-am-bee/beeai-framework/issues/1463)). The harness runs all five issues concurrently and captures per-issue
+metrics from the agent framework: wall-clock duration, tool call count, and
+total token usage.
+
+This is the first time we've done this kind of work and it was quite a learning
+experience. This blog post contains data from a single run with each model.
+
+I already have ideas for improvement for a future run. We need to extend the
+scope to backporting agent as well and have at least ten cases. Doing multiple
+runs per model with aggregation would also give us more grounded results.
+
+<!-- truncate -->
+
+## Analysis
+
+### Resolutions
+
+All three models reached identical conclusions on all five issues. The triage
+decisions were unambiguous: three backports (RHEL-15216, RHEL-112546,
+RHEL-177992) and two not-affected (RHEL-114607, RHEL-174694). RHEL-114607 is
+worth highlighting — the issue concerns CVE-2025-59375 in expat, which only
+affects versions before 2.7.2. RHEL 10 already ships 2.7.3, so the models
+correctly concluded not-affected rather than recommending a rebase to 2.7.5.
+
+### Speed
+
+Sonnet 4.6 was the fastest model on four of the five issues, sometimes by a
+large margin. RHEL-112546 illustrates the biggest spread: Sonnet resolved it in
+98s with just 9 tool calls, while Opus 4.6 took 267s and 37 tool calls, and
+Opus 4.8 took 154s and 25 tool calls. Both Opus models invested significantly
+more effort on this libtiff CVE, exploring more patch URLs and performing
+deeper code analysis.
+
+<DurationChart />
+
+RHEL-174694 shows the opposite pattern: Opus 4.8 took 389s with 17 tool calls —
+more than twice Sonnet's 167s — despite using fewer tool calls. The extended
+thinking budget appeared to cause the model to deliberate at length before
+settling on a conclusion that Sonnet reached more directly.
+
+<ToolCallsChart />
+
+Opus 4.8 is consistently faster than Opus 4.6 (except for RHEL-174694), which
+suggests the architecture improvements between the two generations translate
+into more efficient reasoning chains.
+
+The stark difference between the numbers is not just caused by the model
+evolution but also the fact how non-deterministic task this triage is. We are
+also not utilizing Opus 4.8's adaptive thinking.
+
+### Token usage and cost
+
+The token numbers reveal a notable pattern: Opus 4.8 uses more input tokens
+than Opus 4.6 for the same issues, but far fewer output tokens. On
+RHEL-112546, Opus 4.6 produced 11,268 output tokens versus Opus 4.8's 5,864
+— the newer model reasons more concisely even when it reads more context.
+
+<InputTokensChart />
+
+<OutputTokensChart />
+
+Cost differences are significant but concrete numbers highly depend on the
+actual plan.
+
+<CostChart />
+
+### Takeaway
+
+For the triage workload we tested, Sonnet 4.6 offers the best price-performance
+ratio by a wide margin. Opus models invested more in investigation on hard
+issues but did not produce different conclusions. Opus 4.8's speed advantage
+over Opus 4.6 is real but does not close the cost gap.
+
+On the other hand, this is an evaluation harness, so we need to make a real
+judgement in our day to day work while processing real issues.
+
+None of this analysis would be possible without the incredible work of the
+whole team and especially [Tomas Korbar](https://github.com/TomasKorbar) who
+authored the E2E test suite, [Ondrej Pohorelsky](https://github.com/opohorel)
+who contributed the initial support for Opus 4.8 to
+[ai-workflows](https://github.com/packit/ai-workflows), [Nikola
+Forro](https://github.com/nforro) - author of our minimal trace-server, [Laura
+Barcziova](https://github.com/lbarcziova) whose scripts and Claude skills I
+used for this research, [Matej Focko](https://github.com/mfocko) for consulting
+with me all the time, and [Maja Massarini](https://github.com/majamassarini)
+for the polished Makefile & compose setup that carried the test runs.
diff --git a/src/components/ModelEvalCharts/index.jsx b/src/components/ModelEvalCharts/index.jsx
@@ -0,0 +1,262 @@
+import React, { useState, useEffect } from "react";
+import {
+  BarChart,
+  Bar,
+  XAxis,
+  YAxis,
+  CartesianGrid,
+  Tooltip,
+  Legend,
+  ResponsiveContainer,
+} from "recharts";
+
+const COST_DATA = [
+  { model: "Sonnet 4.6", "Cost (USD)": 12 },
+  { model: "Opus 4.6", "Cost (USD)": 26 },
+  { model: "Opus 4.8", "Cost (USD)": 96 },
+];
+
+const INPUT_TOKENS_DATA = [
+  {
+    issue: "RHEL-15216",
+    "Sonnet 4.6": 1236132,
+    "Opus 4.6": 1099444,
+    "Opus 4.8": 1421854,
+  },
+  {
+    issue: "RHEL-112546",
+    "Sonnet 4.6": 395277,
+    "Opus 4.6": 1974683,
+    "Opus 4.8": 1440652,
+  },
+  {
+    issue: "RHEL-114607",
+    "Sonnet 4.6": 421587,
+    "Opus 4.6": 239754,
+    "Opus 4.8": 1045918,
+  },
+  {
+    issue: "RHEL-177992",
+    "Sonnet 4.6": 514430,
+    "Opus 4.6": 933522,
+    "Opus 4.8": 1113364,
+  },
+  {
+    issue: "RHEL-174694",
+    "Sonnet 4.6": 1307129,
+    "Opus 4.6": 805407,
+    "Opus 4.8": 1244698,
+  },
+];
+
+const OUTPUT_TOKENS_DATA = [
+  {
+    issue: "RHEL-15216",
+    "Sonnet 4.6": 10376,
+    "Opus 4.6": 4820,
+    "Opus 4.8": 4376,
+  },
+  {
+    issue: "RHEL-112546",
+    "Sonnet 4.6": 4646,
+    "Opus 4.6": 11268,
+    "Opus 4.8": 5864,
+  },
+  {
+    issue: "RHEL-114607",
+    "Sonnet 4.6": 5159,
+    "Opus 4.6": 1680,
+    "Opus 4.8": 4630,
+  },
+  {
+    issue: "RHEL-177992",
+    "Sonnet 4.6": 4829,
+    "Opus 4.6": 4603,
+    "Opus 4.8": 4488,
+  },
+  {
+    issue: "RHEL-174694",
+    "Sonnet 4.6": 8553,
+    "Opus 4.6": 3635,
+    "Opus 4.8": 4507,
+  },
+];
+
+const TOOL_CALLS_DATA = [
+  { issue: "RHEL-15216", "Sonnet 4.6": 26, "Opus 4.6": 24, "Opus 4.8": 26 },
+  { issue: "RHEL-112546", "Sonnet 4.6": 9, "Opus 4.6": 37, "Opus 4.8": 25 },
+  { issue: "RHEL-114607", "Sonnet 4.6": 10, "Opus 4.6": 6, "Opus 4.8": 20 },
+  { issue: "RHEL-177992", "Sonnet 4.6": 12, "Opus 4.6": 21, "Opus 4.8": 21 },
+  { issue: "RHEL-174694", "Sonnet 4.6": 20, "Opus 4.6": 13, "Opus 4.8": 17 },
+];
+
+const DURATION_DATA = [
+  { issue: "RHEL-15216", "Sonnet 4.6": 208, "Opus 4.6": 140, "Opus 4.8": 130 },
+  { issue: "RHEL-112546", "Sonnet 4.6": 98, "Opus 4.6": 267, "Opus 4.8": 154 },
+  { issue: "RHEL-114607", "Sonnet 4.6": 104, "Opus 4.6": 48, "Opus 4.8": 124 },
+  { issue: "RHEL-177992", "Sonnet 4.6": 109, "Opus 4.6": 127, "Opus 4.8": 115 },
+  { issue: "RHEL-174694", "Sonnet 4.6": 167, "Opus 4.6": 186, "Opus 4.8": 389 },
+];
+
+const fmtM = (v) =>
+  v >= 1e6
+    ? `${(v / 1e6).toFixed(1)}M`
+    : v >= 1e3
+      ? `${(v / 1e3).toFixed(0)}k`
+      : v;
+
+export function InputTokensChart() {
+  const [isClient, setIsClient] = useState(false);
+  useEffect(() => {
+    setIsClient(true);
+  }, []);
+
+  if (!isClient) return null;
+
+  return (
+    <ResponsiveContainer width="100%" height={350}>
+      <BarChart
+        data={INPUT_TOKENS_DATA}
+        margin={{ top: 5, right: 30, left: 40, bottom: 5 }}
+      >
+        <CartesianGrid strokeDasharray="3 3" />
+        <XAxis dataKey="issue" tick={{ fontSize: 12 }} />
+        <YAxis
+          tickFormatter={fmtM}
+          label={{
+            value: "input tokens",
+            angle: -90,
+            position: "insideLeft",
+            offset: -5,
+          }}
+        />
+        <Tooltip formatter={(value) => [value.toLocaleString()]} />
+        <Legend />
+        <Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
+        <Bar dataKey="Opus 4.6" fill="#f28e2b" />
+        <Bar dataKey="Opus 4.8" fill="#e15759" />
+      </BarChart>
+    </ResponsiveContainer>
+  );
+}
+
+export function OutputTokensChart() {
+  const [isClient, setIsClient] = useState(false);
+  useEffect(() => {
+    setIsClient(true);
+  }, []);
+
+  if (!isClient) return null;
+
+  return (
+    <ResponsiveContainer width="100%" height={350}>
+      <BarChart
+        data={OUTPUT_TOKENS_DATA}
+        margin={{ top: 5, right: 30, left: 40, bottom: 5 }}
+      >
+        <CartesianGrid strokeDasharray="3 3" />
+        <XAxis dataKey="issue" tick={{ fontSize: 12 }} />
+        <YAxis
+          tickFormatter={fmtM}
+          label={{
+            value: "output tokens",
+            angle: -90,
+            position: "insideLeft",
+            offset: -5,
+          }}
+        />
+        <Tooltip formatter={(value) => [value.toLocaleString()]} />
+        <Legend />
+        <Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
+        <Bar dataKey="Opus 4.6" fill="#f28e2b" />
+        <Bar dataKey="Opus 4.8" fill="#e15759" />
+      </BarChart>
+    </ResponsiveContainer>
+  );
+}
+
+export function ToolCallsChart() {
+  const [isClient, setIsClient] = useState(false);
+  useEffect(() => {
+    setIsClient(true);
+  }, []);
+
+  if (!isClient) return null;
+
+  return (
+    <ResponsiveContainer width="100%" height={350}>
+      <BarChart
+        data={TOOL_CALLS_DATA}
+        margin={{ top: 5, right: 30, left: 20, bottom: 5 }}
+      >
+        <CartesianGrid strokeDasharray="3 3" />
+        <XAxis dataKey="issue" tick={{ fontSize: 12 }} />
+        <YAxis
+          label={{
+            value: "tool calls",
+            angle: -90,
+            position: "insideLeft",
+            offset: -5,
+          }}
+        />
+        <Tooltip />
+        <Legend />
+        <Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
+        <Bar dataKey="Opus 4.6" fill="#f28e2b" />
+        <Bar dataKey="Opus 4.8" fill="#e15759" />
+      </BarChart>
+    </ResponsiveContainer>
+  );
+}
+
+export function CostChart() {
+  const [isClient, setIsClient] = useState(false);
+  useEffect(() => {
+    setIsClient(true);
+  }, []);
+
+  if (!isClient) return null;
+
+  return (
+    <ResponsiveContainer width="100%" height={220}>
+      <BarChart
+        data={COST_DATA}
+        layout="vertical"
+        margin={{ top: 5, right: 40, left: 10, bottom: 5 }}
+      >
+        <CartesianGrid strokeDasharray="3 3" />
+        <XAxis type="number" tickFormatter={(v) => `$${v}`} />
+        <YAxis type="category" dataKey="model" width={80} />
+        <Tooltip formatter={(value) => [`$${value}`]} />
+        <Bar dataKey="Cost (USD)" fill="#4e79a7" />
+      </BarChart>
+    </ResponsiveContainer>
+  );
+}
+
+export function DurationChart() {
+  const [isClient, setIsClient] = useState(false);
+  useEffect(() => {
+    setIsClient(true);
+  }, []);
+
+  if (!isClient) return null;
+
+  return (
+    <ResponsiveContainer width="100%" height={350}>
+      <BarChart
+        data={DURATION_DATA}
+        margin={{ top: 5, right: 30, left: 20, bottom: 5 }}
+      >
+        <CartesianGrid strokeDasharray="3 3" />
+        <XAxis dataKey="issue" tick={{ fontSize: 12 }} />
+        <YAxis unit="s" />
+        <Tooltip formatter={(value) => [`${value}s`]} />
+        <Legend />
+        <Bar dataKey="Sonnet 4.6" fill="#4e79a7" />
+        <Bar dataKey="Opus 4.6" fill="#f28e2b" />
+        <Bar dataKey="Opus 4.8" fill="#e15759" />
+      </BarChart>
+    </ResponsiveContainer>
+  );
+}
diff --git a/yarn.lock b/yarn.lock