feat: evaluate 6 models including Mistral Small/Medium/Devstral#353
Conversation
Replace vague model names with exact API identifiers: - mistral-large-2512 (not "Mistral Large") - claude-sonnet-4-20250514 (not "Claude Sonnet") - gpt-4o (not "GPT") Add Mistral Small 4 (mistral-small-2603), Mistral Medium 3.1 (mistral-medium-2508), and Devstral 2 (devstral-2512) to model list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before: pilot-20260326-064026.json (parallel runs overwrite each other) After: pilot-20260326-064026_mistral-small-2603.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Report now displays e.g. "mistral-large-2512" instead of "Mistral Large", reading the actual model identifier from the result config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
193 questions × 6 models (L1 Recognition + L2 Application): | Model | Score | |----------------------------|-------| | claude-sonnet-4-20250514 | 99% | | gpt-4o | 97% | | mistral-large-2512 | 96% | | devstral-2512 | 96% | | mistral-medium-2508 | 85% | | mistral-small-2603 | 74% | Key findings: - Devstral 2 (code-specialized) matches Large for SE anchors - Mistral Medium 3.1 surprisingly weak (85%) for a "frontier" model - Mistral Small 4 shows heavy position bias (74%) - Filenames now include model ID to prevent race conditions - Report shows exact model identifiers instead of aliases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughDer PR zwingt zu genauen, datums‑suffixed Modell‑IDs in Docs, schreibt bei Pilotläufen zusätzliche zusammengefasste JSONs in ein neues Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Pilot as Pilot Runner
participant FS as Filesystem
participant Report as Report Generator
participant HTML as HTML Renderer
Pilot->>FS: write raw results `results/pilot-{ts}_{model}.json`
Pilot->>FS: write summary `summaries/pilot-{ts}_{model}.json`
FS--)Report: store/read results + embedded configs
Report->>Report: get_model_display(backend, config) -> exact_id
Report->>HTML: generate_html(display_names = results.keys())
HTML->>FS: write final report HTML
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/anchor-evaluations.adoc`:
- Around line 410-412: Die Tabelle in docs/anchor-evaluations.adoc verwendet
nicht-reproduzierbare Model-Alias-Einträge (`gpt-4o`, `gpt-5`, `TBD`) direkt
nach der Vorgabe „exact model identifier with date suffix“ (Zeilen um 410–412
und 423–445); ersetze diese Alias-/Platzhalter-IDs entweder durch die konkreten
Snapshot-IDs mit Datumssuffix (z. B. `mistral-large-2512`-Style) oder
kennzeichne die betreffenden Zeilen eindeutig als „not yet evaluated / snapshot
pending“ bzw. „TBD (no snapshot yet)“, so dass jede Tabellenzeile entweder eine
exakte reproduzierbare ID oder einen klaren Platzhalter-Status enthält.
In `@evaluations/generate-report.py`:
- Around line 36-52: Die Funktion get_model_display verwendet harte, veraltete
Claude-Strings; stattdessen soll run_pilot beim Start die tatsächlich
verwendeten Claude-IDs in die run-Konfiguration/Result-JSON schreiben (z.B.
setzen von keys wie "claude_model_used", "claude_cli_model_used",
"claude_haiku_model_used") und get_model_display muss diese priorisiert lesen
(prüfe result/config für diese "…_model_used" keys) bevor es auf harte Strings
oder MODEL_DISPLAY_FALLBACK zurückfällt; passe die Logik in get_model_display
und das Schreiben in run_pilot so an, dass die tatsächliche
Identifier-Persistenz in den Ergebnissen erfolgt und später konsistent verwendet
wird.
In `@evaluations/pilot.py`:
- Around line 315-328: The filename suffix construction currently mutates a
cumulative string via repeated model_suffix.replace() which causes
cross-replacements when one model name appears inside another; instead, build
the suffix from individual components: iterate the models list and for each
model append the exact replacement token (use openai_model, mistral_model,
deepseek_model, f"ollama-{ollama_model}" for "ollama") into a new list, then
join that list with "_" to form model_suffix, then perform the sanitize step and
create out_file as before (reference symbols: models, model_suffix,
openai_model, mistral_model, deepseek_model, ollama_model, out_file,
RESULTS_DIR).
In `@evaluations/results/pilot-20260324-190600.json`:
- Line 8: Die Datei enthält den aktualisierten Modell-Identifier
"mistral-large-2512" unter dem Schlüssel "mistral_model", der aber nicht im
Dateinamen reflektiert wird; bitte benenne die Datei so um, dass der exakte
Modell-Identifier im Namen enthalten ist (z.B. erweitere
pilot-20260324-190600.json → pilot-20260324-190600-mistral-large-2512.json) und
aktualisiere ggf. alle Verweise/Metadaten im Repo, die auf das alte
Dateinamen-Token verweisen, damit Report-Metadaten und Historie eindeutig
bleiben.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: fc01c588-e413-4e7e-a2bf-eab195594639
📒 Files selected for processing (9)
docs/anchor-evaluations.adocevaluations/generate-report.pyevaluations/pilot.pyevaluations/report.htmlevaluations/results/pilot-20260324-190600.jsonevaluations/results/pilot-20260326-070127_mistral-medium-2508.jsonevaluations/results/pilot-20260326-073241_devstral-2512.jsonevaluations/results/pilot-20260326-074132_mistral-small-2603.jsonwebsite/public/evaluation-report.html
…model - New: evaluations/summaries/ — scores only, no raw responses or per-permutation details. 17KB vs 220KB per model. - evaluations/results/ added to .gitignore (full results reproducible via pilot.py) - Report generator reads from summaries/ - pilot.py auto-generates summary after each run - Fixed filename suffix to use exact model IDs (CodeRabbit feedback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
evaluations/generate-report.py (1)
184-190:⚠️ Potential issue | 🟠 MajorKonfigurations- und Dateinamen hier vor dem HTML-Rendering escapen.
display_names,info['file']undinfo['timestamp']kommen jetzt aus JSON/CLI-Konfiguration statt aus statischen Konstanten. Die direkte Interpolation indiv,th,h3unddt/ddöffnet den Report für HTML-/Script-Injection.💡 Mögliche Anpassung
def generate_html(results, output_path): + def esc(value): + return h(str(value)) + # Keys are already exact model IDs (e.g. "mistral-large-2512") display_names = {m: m for m in results} @@ - display = display_names.get(m, m) + display = esc(display_names.get(m, m)) n = len([1 for l in anchor_questions if anchor_questions[l].get(m) is not None]) info = results[m] html += f""" <div class="summary-card"> <div class="model-name">{display}</div> <div class="score" style="color: {score_color(avg)}">{avg:.0%}</div> - <div class="detail">{n} questions · {info['file']}</div> + <div class="detail">{n} questions · {esc(info['file'])}</div> </div> """ @@ - html += f" <th style='text-align:center'>{display_names.get(m, m)}</th>\n" + html += f" <th style='text-align:center'>{esc(display_names.get(m, m))}</th>\n" @@ - html += f"<th style='text-align:center'>{display_names.get(m, m)}</th>" + html += f"<th style='text-align:center'>{esc(display_names.get(m, m))}</th>" @@ - html += f"<h3>{display_names.get(m, m)}: no failures</h3>\n" + html += f"<h3>{esc(display_names.get(m, m))}: no failures</h3>\n" else: - html += f'<h3>{display_names.get(m, m)}: {len(fails)} failures</h3>\n<div class="fail-list">\n' + html += f'<h3>{esc(display_names.get(m, m))}: {len(fails)} failures</h3>\n<div class="fail-list">\n' @@ - html += f"<dt>{display_names.get(m, m)}:</dt><dd>{info['file']} · {int(dur//60)}m {int(dur%60)}s · {info['timestamp'][:19]}</dd><br>" + html += f"<dt>{esc(display_names.get(m, m))}:</dt><dd>{esc(info['file'])} · {int(dur//60)}m {int(dur%60)}s · {esc(info['timestamp'][:19])}</dd><br>"Also applies to: 203-203, 246-246, 268-270, 284-284
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/generate-report.py` around lines 184 - 190, Escape any user-controlled strings before embedding into HTML: apply an HTML-escaping function (e.g. html.escape) to display_names entries, info['file'], info['timestamp'] and any other values derived from JSON/CLI (used in the model-name div, detail div, table th/h3/dt/dd outputs) so the template interpolations in generate-report.py use the escaped strings rather than raw values; update locations that build HTML (references: display_names, info['file'], info['timestamp'], and places rendering anchor_questions results) to call the escape helper prior to f-string insertion.
♻️ Duplicate comments (1)
evaluations/pilot.py (1)
461-470:⚠️ Potential issue | 🟠 MajorDie neue Summary serialisiert weiterhin keine Claude-Identifier.
generate-report.pyliest jetztsummaries/, aber dieser Block übernimmtall_results["config"]unverändert. Dort fehlen weiterhin Felder fürclaude,claude-cliundclaude-haiku, sodass der Report diese Läufe weiter hart codieren bzw. raten muss. Das kann unterschiedliche Snapshots falsch beschriften oder zusammenführen. Bitte die tatsächlich verwendeten Claude-IDs vor dem Summary-Write mitpersistieren und später vorrangig daraus lesen.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/pilot.py` around lines 461 - 470, The summary write currently deep-copies all_results but does not persist the actual Claude identifiers used, so generate-report.py can't reliably map runs; before writing summary (in the block that creates summary from all_results and writes summary_file) add the actual Claude ID fields into summary["config"] (specifically ensure keys "claude", "claude-cli", and "claude-haiku" are populated) by extracting the IDs used in all_results (e.g., from all_results["config"] or from the model/run entries inside all_results["models"]) so the final summary (variable summary written to summary_file) contains the concrete Claude identifiers for later deterministic reading by generate-report.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluations/generate-report.py`:
- Line 22: The script currently points RESULTS_DIR to "summaries" and
load_best_results() ignores legacy "evaluations/results/", causing silent
continuation and empty HTML when no summaries exist; update load_best_results()
to first look in the summaries directory (RESULTS_DIR) and if empty/found-none
fall back to the legacy "evaluations/results/" path, and if both yield 0 models
raise a hard error (raise/exit with non-zero) instead of writing an empty
report; also apply the same check/behavior to the other loading code paths
around the block referenced (lines ~57-75) so no empty-report HTML is produced.
In `@evaluations/pilot.py`:
- Around line 317-323: The one-line if/elif/else statements inside the loop over
models violate Ruff E701; expand each branch into a multi-line block so each
conditional and its body are on separate lines. In the loop that uses models and
appends to model_ids (referencing openai_model, mistral_model, deepseek_model,
ollama_model and the f"ollama-{ollama_model}" case), convert the single-line "if
m == ...: model_ids.append(...)" and corresponding elif/else into properly
indented multi-line if / elif / else blocks, preserving the same append logic
and leaving the final model_suffix creation unchanged.
---
Outside diff comments:
In `@evaluations/generate-report.py`:
- Around line 184-190: Escape any user-controlled strings before embedding into
HTML: apply an HTML-escaping function (e.g. html.escape) to display_names
entries, info['file'], info['timestamp'] and any other values derived from
JSON/CLI (used in the model-name div, detail div, table th/h3/dt/dd outputs) so
the template interpolations in generate-report.py use the escaped strings rather
than raw values; update locations that build HTML (references: display_names,
info['file'], info['timestamp'], and places rendering anchor_questions results)
to call the escape helper prior to f-string insertion.
---
Duplicate comments:
In `@evaluations/pilot.py`:
- Around line 461-470: The summary write currently deep-copies all_results but
does not persist the actual Claude identifiers used, so generate-report.py can't
reliably map runs; before writing summary (in the block that creates summary
from all_results and writes summary_file) add the actual Claude ID fields into
summary["config"] (specifically ensure keys "claude", "claude-cli", and
"claude-haiku" are populated) by extracting the IDs used in all_results (e.g.,
from all_results["config"] or from the model/run entries inside
all_results["models"]) so the final summary (variable summary written to
summary_file) contains the concrete Claude identifiers for later deterministic
reading by generate-report.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 0c639659-e0ef-447f-8cb9-9335a09ae7b7
📒 Files selected for processing (12)
evaluations/.gitignoreevaluations/generate-report.pyevaluations/pilot.pyevaluations/results/pilot-20260324-174404.jsonevaluations/results/pilot-20260324-190600.jsonevaluations/results/pilot-20260324-192413.jsonevaluations/summaries/pilot-20260324-174404.jsonevaluations/summaries/pilot-20260324-190600.jsonevaluations/summaries/pilot-20260324-192413.jsonevaluations/summaries/pilot-20260326-070127_mistral-medium-2508.jsonevaluations/summaries/pilot-20260326-073241_devstral-2512.jsonevaluations/summaries/pilot-20260326-074132_mistral-small-2603.json
✅ Files skipped from review due to trivial changes (7)
- evaluations/.gitignore
- evaluations/summaries/pilot-20260324-192413.json
- evaluations/summaries/pilot-20260326-073241_devstral-2512.json
- evaluations/summaries/pilot-20260326-070127_mistral-medium-2508.json
- evaluations/summaries/pilot-20260324-190600.json
- evaluations/summaries/pilot-20260326-074132_mistral-small-2603.json
- evaluations/summaries/pilot-20260324-174404.json
Summary
Extended evaluation from 3 to 6 models. 193 questions × 6 models (L1 + L2).
Results
Key Findings
Infrastructure improvements
Part of EPIC #329.
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Dokumentation
Verbesserungen
Chores