Skip to content

feat: evaluation results for 10 models (Claude/GPT/Mistral families)#361

Merged
rdmueller merged 1 commit into
LLM-Coding:mainfrom
raifdmueller:feat/mistral-models-evaluation
Mar 26, 2026
Merged

feat: evaluation results for 10 models (Claude/GPT/Mistral families)#361
rdmueller merged 1 commit into
LLM-Coding:mainfrom
raifdmueller:feat/mistral-models-evaluation

Conversation

@raifdmueller
Copy link
Copy Markdown
Contributor

@raifdmueller raifdmueller commented Mar 26, 2026

Summary

Extended evaluation from 6 to 10 models. 193 questions × 10 models.

Leaderboard

Model Score* Fails Type
claude-opus-4-6 99% 0 Anthropic flagship
gpt-5.4-2026-03-05 99% 0 OpenAI flagship
claude-sonnet-4-20250514 99% 2 Anthropic mid-tier
claude-haiku-4-5-20251001 98% 9 Anthropic small
gpt-4o 97% 13 OpenAI (older gen)
gpt-5.4-mini-2026-03-17 97% 13 OpenAI small
mistral-large-2512 96% 18 Mistral flagship
devstral-2512 96% 18 Mistral code-specialized
mistral-medium-2508 85% 77 Mistral mid-tier
mistral-small-2603 74% 116 Mistral small

excluding sanity check

Key Findings

  • Opus and GPT-5.4 both achieve 0 fails — despite Claude generating the questions. No Claude bias.
  • Claude's tier drop is minimal (Opus 99% → Haiku 98%), Mistral's is huge (Large 96% → Small 74%)
  • Devstral (code model) matches Mistral Large — code training includes SE methodology knowledge
  • 39 questions score 100% across all 10 models (Clean Architecture, Cynefin, DDD, Five Whys, etc.)
  • Problematic anchors: PRD (65%), LASR (75%), EARS (80%) — niche topics underrepresented in training data

New in this PR

  • 4 new model summaries (Opus, Haiku, GPT-5.4, GPT-5.4 Mini)
  • --claude-model parameter for flexible Claude model selection
  • Report with 10-model heatmap
  • Exact model IDs throughout (no more -latest aliases)

Part of EPIC #329.

Test plan

  • All 10 models complete (193 questions each)
  • Report generated with 10 models
  • Summaries stored (17KB each, not full results)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Neue Funktionen

    • Unterstützung für die Konfiguration von Claude-Modellen über die Befehlszeile hinzugefügt
    • Verbeserte Modellverwaltung für flexiblere Modellauswahl beim Claude-Backend
  • Sonstiges

    • Evaluierungsergebnisse für verschiedene Modellkonfigurationen aktualisiert

…d GPT-5.4

New models tested (193 questions each):
- claude-opus-4-6: 99%, 0 fails (tied #1 with GPT-5.4)
- claude-haiku-4-5-20251001: 98%, 9 fails
- gpt-5.4-2026-03-05: 99%, 0 fails (tied #1 with Opus)
- gpt-5.4-mini-2026-03-17: 97%, 13 fails

Full leaderboard (10 models):
  99% claude-opus-4-6, gpt-5.4-2026-03-05 (0 fails each)
  99% claude-sonnet-4-20250514 (2 fails)
  98% claude-haiku-4-5-20251001 (9 fails)
  97% gpt-4o, gpt-5.4-mini-2026-03-17 (13 fails each)
  96% mistral-large-2512, devstral-2512 (18 fails each)
  85% mistral-medium-2508 (77 fails)
  74% mistral-small-2603 (116 fails)

Key finding: Claude Opus and GPT-5.4 both achieve 0 fails — despite
Claude having generated the questions and distractors. No Claude bias.

Added --claude-model parameter for flexible Claude model selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 764f9d69-6178-424f-aa4b-b81d34d49034

📥 Commits

Reviewing files that changed from the base of the PR and between 0e0a749 and 1fd7632.

📒 Files selected for processing (11)
  • evaluations/generate-report.py
  • evaluations/pilot.py
  • evaluations/report.html
  • evaluations/summaries/pilot-20260326-093311_gpt-5.4-2026-03-05.json
  • evaluations/summaries/pilot-20260326-093311_gpt-5.4-mini-2026-03-17.json
  • evaluations/summaries/pilot-20260326-093514_gpt-5.4-2026-03-05.json
  • evaluations/summaries/pilot-20260326-095239_gpt-5.4-mini-2026-03-17.json
  • evaluations/summaries/pilot-20260326-100007_claude-opus-4-6.json
  • evaluations/summaries/pilot-20260326-104417_claude-haiku-4-5-20251001.json
  • evaluations/summaries/pilot-20260326-110102_gpt-5.4-2026-03-05.json
  • website/public/evaluation-report.html

Walkthrough

Die Pull Request aktualisiert die Evaluations-Infrastruktur, um benutzerdefinierte Claude-Modell-IDs zu unterstützen. Es werden eine Factory-Funktion für die Claude-API-Aufrufe eingeführt, ein CLI-Parameter für das Claude-Modell hinzugefügt, und die Modell-Anzeigelogik wird aktualisiert, um konfigurierte Modell-IDs zu bevorzugen. Zusätzlich werden neue Evaluationsergebnisse als JSON-Dateien dokumentiert.

Changes

Cohort / File(s) Zusammenfassung
Claude-API-Konfiguration
evaluations/pilot.py
Factory-Funktion make_claude_api_caller(claude_model) eingeführt, um Claude-Modelle zur Laufzeit zu binden; run_pilot(...) um Parameter claude_model erweitert; CLI-Support für --claude-model hinzugefügt; Modell-Dispatch aktualisiert zur Verwendung der neuen Factory.
Modell-Anzeigelogik
evaluations/generate-report.py
get_model_display(backend, config) aktualisiert, um konfigurierte Claude-Modell-IDs (config["claude_model"]) gegenüber Standardwerten zu bevorzugen.
Evaluationsergebnisse
evaluations/summaries/pilot-20260326-*.json
Acht neue JSON-Zusammenfassungsdateien mit Evaluationsergebnissen für verschiedene Modelle (OpenAI gpt-5.4-Varianten, Claude Opus und Haiku) hinzugefügt, jeweils mit Konfiguration, Scores und Laufzeitmetriken.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rdmueller rdmueller merged commit 5c3c065 into LLM-Coding:main Mar 26, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants