feat: evaluation results for 10 models (Claude/GPT/Mistral families) by raifdmueller · Pull Request #361 · LLM-Coding/Semantic-Anchors

raifdmueller · 2026-03-26T12:41:28Z

Summary

Extended evaluation from 6 to 10 models. 193 questions × 10 models.

Leaderboard

Model	Score*	Fails	Type
claude-opus-4-6	99%	0	Anthropic flagship
gpt-5.4-2026-03-05	99%	0	OpenAI flagship
claude-sonnet-4-20250514	99%	2	Anthropic mid-tier
claude-haiku-4-5-20251001	98%	9	Anthropic small
gpt-4o	97%	13	OpenAI (older gen)
gpt-5.4-mini-2026-03-17	97%	13	OpenAI small
mistral-large-2512	96%	18	Mistral flagship
devstral-2512	96%	18	Mistral code-specialized
mistral-medium-2508	85%	77	Mistral mid-tier
mistral-small-2603	74%	116	Mistral small

excluding sanity check

Key Findings

Opus and GPT-5.4 both achieve 0 fails — despite Claude generating the questions. No Claude bias.
Claude's tier drop is minimal (Opus 99% → Haiku 98%), Mistral's is huge (Large 96% → Small 74%)
Devstral (code model) matches Mistral Large — code training includes SE methodology knowledge
39 questions score 100% across all 10 models (Clean Architecture, Cynefin, DDD, Five Whys, etc.)
Problematic anchors: PRD (65%), LASR (75%), EARS (80%) — niche topics underrepresented in training data

New in this PR

4 new model summaries (Opus, Haiku, GPT-5.4, GPT-5.4 Mini)
--claude-model parameter for flexible Claude model selection
Report with 10-model heatmap
Exact model IDs throughout (no more -latest aliases)

Part of EPIC #329.

Test plan

All 10 models complete (193 questions each)
Report generated with 10 models
Summaries stored (17KB each, not full results)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Neue Funktionen
- Unterstützung für die Konfiguration von Claude-Modellen über die Befehlszeile hinzugefügt
- Verbeserte Modellverwaltung für flexiblere Modellauswahl beim Claude-Backend
Sonstiges
- Evaluierungsergebnisse für verschiedene Modellkonfigurationen aktualisiert

…d GPT-5.4 New models tested (193 questions each): - claude-opus-4-6: 99%, 0 fails (tied #1 with GPT-5.4) - claude-haiku-4-5-20251001: 98%, 9 fails - gpt-5.4-2026-03-05: 99%, 0 fails (tied #1 with Opus) - gpt-5.4-mini-2026-03-17: 97%, 13 fails Full leaderboard (10 models): 99% claude-opus-4-6, gpt-5.4-2026-03-05 (0 fails each) 99% claude-sonnet-4-20250514 (2 fails) 98% claude-haiku-4-5-20251001 (9 fails) 97% gpt-4o, gpt-5.4-mini-2026-03-17 (13 fails each) 96% mistral-large-2512, devstral-2512 (18 fails each) 85% mistral-medium-2508 (77 fails) 74% mistral-small-2603 (116 fails) Key finding: Claude Opus and GPT-5.4 both achieve 0 fails — despite Claude having generated the questions and distractors. No Claude bias. Added --claude-model parameter for flexible Claude model selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-26T12:41:41Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 764f9d69-6178-424f-aa4b-b81d34d49034

📥 Commits

Reviewing files that changed from the base of the PR and between 0e0a749 and 1fd7632.

📒 Files selected for processing (11)

evaluations/generate-report.py
evaluations/pilot.py
evaluations/report.html
evaluations/summaries/pilot-20260326-093311_gpt-5.4-2026-03-05.json
evaluations/summaries/pilot-20260326-093311_gpt-5.4-mini-2026-03-17.json
evaluations/summaries/pilot-20260326-093514_gpt-5.4-2026-03-05.json
evaluations/summaries/pilot-20260326-095239_gpt-5.4-mini-2026-03-17.json
evaluations/summaries/pilot-20260326-100007_claude-opus-4-6.json
evaluations/summaries/pilot-20260326-104417_claude-haiku-4-5-20251001.json
evaluations/summaries/pilot-20260326-110102_gpt-5.4-2026-03-05.json
website/public/evaluation-report.html

Walkthrough

Die Pull Request aktualisiert die Evaluations-Infrastruktur, um benutzerdefinierte Claude-Modell-IDs zu unterstützen. Es werden eine Factory-Funktion für die Claude-API-Aufrufe eingeführt, ein CLI-Parameter für das Claude-Modell hinzugefügt, und die Modell-Anzeigelogik wird aktualisiert, um konfigurierte Modell-IDs zu bevorzugen. Zusätzlich werden neue Evaluationsergebnisse als JSON-Dateien dokumentiert.

Changes

Cohort / File(s)	Zusammenfassung
Claude-API-Konfiguration `evaluations/pilot.py`	Factory-Funktion `make_claude_api_caller(claude_model)` eingeführt, um Claude-Modelle zur Laufzeit zu binden; `run_pilot(...)` um Parameter `claude_model` erweitert; CLI-Support für `--claude-model` hinzugefügt; Modell-Dispatch aktualisiert zur Verwendung der neuen Factory.
Modell-Anzeigelogik `evaluations/generate-report.py`	`get_model_display(backend, config)` aktualisiert, um konfigurierte Claude-Modell-IDs (`config["claude_model"]`) gegenüber Standardwerten zu bevorzugen.
Evaluationsergebnisse `evaluations/summaries/pilot-20260326-*.json`	Acht neue JSON-Zusammenfassungsdateien mit Evaluationsergebnissen für verschiedene Modelle (OpenAI gpt-5.4-Varianten, Claude Opus und Haiku) hinzugefügt, jeweils mit Konfiguration, Scores und Laufzeitmetriken.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: evaluate 6 models including Mistral Small/Medium/Devstral #353: Modifiziert die gleichen Evaluierungs-Code-Pfade — aktualisiert generate-report.py's get_model_display-Logik und pilot.py's Handhabung/Speicherung exakter Claude-Modell-IDs, einschließlich Dateiname- und Zusammenfassungs-Verhalten.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rdmueller merged commit 5c3c065 into LLM-Coding:main Mar 26, 2026
4 of 6 checks passed

raifdmueller mentioned this pull request Mar 26, 2026

Run full evaluation and publish results #337

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluation results for 10 models (Claude/GPT/Mistral families)#361

feat: evaluation results for 10 models (Claude/GPT/Mistral families)#361
rdmueller merged 1 commit into
LLM-Coding:mainfrom
raifdmueller:feat/mistral-models-evaluation

raifdmueller commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

Review failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raifdmueller commented Mar 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Leaderboard

Key Findings

New in this PR

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raifdmueller commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading