feat: evaluation results for 10 models (Claude/GPT/Mistral families)#361
Conversation
…d GPT-5.4 New models tested (193 questions each): - claude-opus-4-6: 99%, 0 fails (tied #1 with GPT-5.4) - claude-haiku-4-5-20251001: 98%, 9 fails - gpt-5.4-2026-03-05: 99%, 0 fails (tied #1 with Opus) - gpt-5.4-mini-2026-03-17: 97%, 13 fails Full leaderboard (10 models): 99% claude-opus-4-6, gpt-5.4-2026-03-05 (0 fails each) 99% claude-sonnet-4-20250514 (2 fails) 98% claude-haiku-4-5-20251001 (9 fails) 97% gpt-4o, gpt-5.4-mini-2026-03-17 (13 fails each) 96% mistral-large-2512, devstral-2512 (18 fails each) 85% mistral-medium-2508 (77 fails) 74% mistral-small-2603 (116 fails) Key finding: Claude Opus and GPT-5.4 both achieve 0 fails — despite Claude having generated the questions and distractors. No Claude bias. Added --claude-model parameter for flexible Claude model selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
WalkthroughDie Pull Request aktualisiert die Evaluations-Infrastruktur, um benutzerdefinierte Claude-Modell-IDs zu unterstützen. Es werden eine Factory-Funktion für die Claude-API-Aufrufe eingeführt, ein CLI-Parameter für das Claude-Modell hinzugefügt, und die Modell-Anzeigelogik wird aktualisiert, um konfigurierte Modell-IDs zu bevorzugen. Zusätzlich werden neue Evaluationsergebnisse als JSON-Dateien dokumentiert. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Extended evaluation from 6 to 10 models. 193 questions × 10 models.
Leaderboard
excluding sanity check
Key Findings
New in this PR
--claude-modelparameter for flexible Claude model selection-latestaliases)Part of EPIC #329.
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Neue Funktionen
Sonstiges