Skip to content

feat: evaluate 6 models including Mistral Small/Medium/Devstral#353

Merged
rdmueller merged 5 commits into
LLM-Coding:mainfrom
raifdmueller:feat/mistral-models-evaluation
Mar 26, 2026
Merged

feat: evaluate 6 models including Mistral Small/Medium/Devstral#353
rdmueller merged 5 commits into
LLM-Coding:mainfrom
raifdmueller:feat/mistral-models-evaluation

Conversation

@raifdmueller
Copy link
Copy Markdown
Contributor

@raifdmueller raifdmueller commented Mar 26, 2026

Summary

Extended evaluation from 3 to 6 models. 193 questions × 6 models (L1 + L2).

Results

Model Score* Type
claude-sonnet-4-20250514 99% Commercial flagship
gpt-4o 97% Commercial flagship
mistral-large-2512 96% Open-weight flagship
devstral-2512 96% Code-specialized
mistral-medium-2508 85% Commercial mid-tier
mistral-small-2603 74% Open-weight small

Key Findings

  • Devstral 2 matches Mistral Large — a code-specialized model knows SE anchors as well as the generalist flagship
  • Mistral Medium is surprisingly weak (85%) for a "frontier-class" model — worse than Large and Devstral
  • Mistral Small shows heavy position bias — most failures are 75% (3/4 permutations correct)
  • Model naming fixed — result filenames now include exact model ID, report shows exact identifiers

Infrastructure improvements

  • Result filenames include model ID (prevents race conditions on parallel runs)
  • Report generator uses exact model IDs from config
  • Concept document updated with exact API identifiers and date suffixes

Part of EPIC #329.

Test plan

  • All 6 models complete (193 questions each)
  • Report generated with 6 models
  • Exact model IDs in filenames and report

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Dokumentation

    • Anforderungen zur Modellauswahl präzisiert: exakte Provider‑Modellkennungen mit Datum nötig; Tabellendarstellungen und lokale-vs-API‑Hinweise überarbeitet.
  • Verbesserungen

    • Anzeige nutzt nun genaue Modellkennungen statt Aliasnamen.
    • Testlauf‑Dateinamen entkoppelt nach Modell, verhindert Überschreibungen.
    • Zusätzliche zusammengefasste Auswertungs‑Summaries werden gespeichert.
  • Chores

    • Rohergebnisdateien werden nun standardmäßig ignoriert.

raifdmueller and others added 4 commits March 26, 2026 06:43
Replace vague model names with exact API identifiers:
- mistral-large-2512 (not "Mistral Large")
- claude-sonnet-4-20250514 (not "Claude Sonnet")
- gpt-4o (not "GPT")

Add Mistral Small 4 (mistral-small-2603), Mistral Medium 3.1
(mistral-medium-2508), and Devstral 2 (devstral-2512) to model list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before: pilot-20260326-064026.json (parallel runs overwrite each other)
After:  pilot-20260326-064026_mistral-small-2603.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Report now displays e.g. "mistral-large-2512" instead of "Mistral Large",
reading the actual model identifier from the result config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
193 questions × 6 models (L1 Recognition + L2 Application):

| Model                      | Score |
|----------------------------|-------|
| claude-sonnet-4-20250514   |   99% |
| gpt-4o                     |   97% |
| mistral-large-2512         |   96% |
| devstral-2512              |   96% |
| mistral-medium-2508        |   85% |
| mistral-small-2603         |   74% |

Key findings:
- Devstral 2 (code-specialized) matches Large for SE anchors
- Mistral Medium 3.1 surprisingly weak (85%) for a "frontier" model
- Mistral Small 4 shows heavy position bias (74%)
- Filenames now include model ID to prevent race conditions
- Report shows exact model identifiers instead of aliases

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 26, 2026

Walkthrough

Der PR zwingt zu genauen, datums‑suffixed Modell‑IDs in Docs, schreibt bei Pilotläufen zusätzliche zusammengefasste JSONs in ein neues summaries/-Verzeichnis und ändert die Report‑Generierung, sodass Anzeigenamen aus gespeicherten Konfigurationen statt einer statischen Zuordnung abgeleitet werden.

Changes

Cohort / File(s) Summary
Dokumentation & Modellkatalog
docs/anchor-evaluations.adoc
„Model Selection“ verlangt jetzt exakte, datumsbehaftete Provider‑Model‑IDs; kommerzielle Modelltabelle erweitert (explizite API‑IDs, GPT‑4o/GPT‑5, mehrere Mistral/Devstral‑Varianten); Open‑weight‑Tabelle auf lokal über Ollama laufende Modelle reduziert und Spalten/Reihenfolge angepasst.
Report‑Generierung & Anzeige‑Logik
evaluations/generate-report.py
Statische MODEL_DISPLAYMODEL_DISPLAY_FALLBACK; neue get_model_display(backend, config) extrahiert exakte Anzeige‑IDs aus gespeicherten Konfigurationen; load_best_results() indexiert Ergebnisse nach diesem exakten Bezeichner; generate_html() baut Modellliste aus den neuen Keys und nutzt display_names überall.
Pilotlauf / Dateinamenskonvention & Summaries
evaluations/pilot.py, evaluations/.gitignore
Pilot‑Output‑Dateinamen enthalten jetzt modellabhängigen Suffix (pilot-{ts}_{model_suffix}.json) mit sanitisierten exakten Modell‑IDs; zusätzlich wird bei Non‑dry‑run ein reduziertes Summary (ohne per‑question results) in evaluations/summaries/ geschrieben; results/ hinzugefügt zu .gitignore.
Resultate / neue Summary‑Artefakte
evaluations/summaries/* (mehrere pilot-*.json), evaluations/results/pilot-20260324-190600.json
Viele neue timestamped Summary JSONs hinzugefügt; ein bestehender result‑config Eintrag geändert: mistral-large-latestmistral-large-2512. Änderungen sind daten-/artefaktgetrieben, keine API‑Signaturänderungen.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Pilot as Pilot Runner
    participant FS as Filesystem
    participant Report as Report Generator
    participant HTML as HTML Renderer

    Pilot->>FS: write raw results `results/pilot-{ts}_{model}.json`
    Pilot->>FS: write summary `summaries/pilot-{ts}_{model}.json`
    FS--)Report: store/read results + embedded configs
    Report->>Report: get_model_display(backend, config) -> exact_id
    Report->>HTML: generate_html(display_names = results.keys())
    HTML->>FS: write final report HTML
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Der Titel beschreibt präzise die Hauptänderung: Evaluierung von 6 Modellen einschließlich mehrerer Mistral-Varianten, was den Kerninhalten der Pull Request (neue Modell-Evaluierungen, Mistral Small/Medium/Devstral Tests) entspricht.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/anchor-evaluations.adoc`:
- Around line 410-412: Die Tabelle in docs/anchor-evaluations.adoc verwendet
nicht-reproduzierbare Model-Alias-Einträge (`gpt-4o`, `gpt-5`, `TBD`) direkt
nach der Vorgabe „exact model identifier with date suffix“ (Zeilen um 410–412
und 423–445); ersetze diese Alias-/Platzhalter-IDs entweder durch die konkreten
Snapshot-IDs mit Datumssuffix (z. B. `mistral-large-2512`-Style) oder
kennzeichne die betreffenden Zeilen eindeutig als „not yet evaluated / snapshot
pending“ bzw. „TBD (no snapshot yet)“, so dass jede Tabellenzeile entweder eine
exakte reproduzierbare ID oder einen klaren Platzhalter-Status enthält.

In `@evaluations/generate-report.py`:
- Around line 36-52: Die Funktion get_model_display verwendet harte, veraltete
Claude-Strings; stattdessen soll run_pilot beim Start die tatsächlich
verwendeten Claude-IDs in die run-Konfiguration/Result-JSON schreiben (z.B.
setzen von keys wie "claude_model_used", "claude_cli_model_used",
"claude_haiku_model_used") und get_model_display muss diese priorisiert lesen
(prüfe result/config für diese "…_model_used" keys) bevor es auf harte Strings
oder MODEL_DISPLAY_FALLBACK zurückfällt; passe die Logik in get_model_display
und das Schreiben in run_pilot so an, dass die tatsächliche
Identifier-Persistenz in den Ergebnissen erfolgt und später konsistent verwendet
wird.

In `@evaluations/pilot.py`:
- Around line 315-328: The filename suffix construction currently mutates a
cumulative string via repeated model_suffix.replace() which causes
cross-replacements when one model name appears inside another; instead, build
the suffix from individual components: iterate the models list and for each
model append the exact replacement token (use openai_model, mistral_model,
deepseek_model, f"ollama-{ollama_model}" for "ollama") into a new list, then
join that list with "_" to form model_suffix, then perform the sanitize step and
create out_file as before (reference symbols: models, model_suffix,
openai_model, mistral_model, deepseek_model, ollama_model, out_file,
RESULTS_DIR).

In `@evaluations/results/pilot-20260324-190600.json`:
- Line 8: Die Datei enthält den aktualisierten Modell-Identifier
"mistral-large-2512" unter dem Schlüssel "mistral_model", der aber nicht im
Dateinamen reflektiert wird; bitte benenne die Datei so um, dass der exakte
Modell-Identifier im Namen enthalten ist (z.B. erweitere
pilot-20260324-190600.json → pilot-20260324-190600-mistral-large-2512.json) und
aktualisiere ggf. alle Verweise/Metadaten im Repo, die auf das alte
Dateinamen-Token verweisen, damit Report-Metadaten und Historie eindeutig
bleiben.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: fc01c588-e413-4e7e-a2bf-eab195594639

📥 Commits

Reviewing files that changed from the base of the PR and between ceee4d0 and 4602bbe.

📒 Files selected for processing (9)
  • docs/anchor-evaluations.adoc
  • evaluations/generate-report.py
  • evaluations/pilot.py
  • evaluations/report.html
  • evaluations/results/pilot-20260324-190600.json
  • evaluations/results/pilot-20260326-070127_mistral-medium-2508.json
  • evaluations/results/pilot-20260326-073241_devstral-2512.json
  • evaluations/results/pilot-20260326-074132_mistral-small-2603.json
  • website/public/evaluation-report.html

Comment thread docs/anchor-evaluations.adoc
Comment thread evaluations/generate-report.py
Comment thread evaluations/pilot.py Outdated
Comment thread evaluations/results/pilot-20260324-190600.json Outdated
…model

- New: evaluations/summaries/ — scores only, no raw responses or
  per-permutation details. 17KB vs 220KB per model.
- evaluations/results/ added to .gitignore (full results reproducible
  via pilot.py)
- Report generator reads from summaries/
- pilot.py auto-generates summary after each run
- Fixed filename suffix to use exact model IDs (CodeRabbit feedback)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
evaluations/generate-report.py (1)

184-190: ⚠️ Potential issue | 🟠 Major

Konfigurations- und Dateinamen hier vor dem HTML-Rendering escapen.

display_names, info['file'] und info['timestamp'] kommen jetzt aus JSON/CLI-Konfiguration statt aus statischen Konstanten. Die direkte Interpolation in div, th, h3 und dt/dd öffnet den Report für HTML-/Script-Injection.

💡 Mögliche Anpassung
 def generate_html(results, output_path):
+    def esc(value):
+        return h(str(value))
+
     # Keys are already exact model IDs (e.g. "mistral-large-2512")
     display_names = {m: m for m in results}
@@
-        display = display_names.get(m, m)
+        display = esc(display_names.get(m, m))
         n = len([1 for l in anchor_questions if anchor_questions[l].get(m) is not None])
         info = results[m]
         html += f"""  <div class="summary-card">
     <div class="model-name">{display}</div>
     <div class="score" style="color: {score_color(avg)}">{avg:.0%}</div>
-    <div class="detail">{n} questions · {info['file']}</div>
+    <div class="detail">{n} questions · {esc(info['file'])}</div>
   </div>
 """
@@
-        html += f"  <th style='text-align:center'>{display_names.get(m, m)}</th>\n"
+        html += f"  <th style='text-align:center'>{esc(display_names.get(m, m))}</th>\n"
@@
-            html += f"<th style='text-align:center'>{display_names.get(m, m)}</th>"
+            html += f"<th style='text-align:center'>{esc(display_names.get(m, m))}</th>"
@@
-            html += f"<h3>{display_names.get(m, m)}: no failures</h3>\n"
+            html += f"<h3>{esc(display_names.get(m, m))}: no failures</h3>\n"
         else:
-            html += f'<h3>{display_names.get(m, m)}: {len(fails)} failures</h3>\n<div class="fail-list">\n'
+            html += f'<h3>{esc(display_names.get(m, m))}: {len(fails)} failures</h3>\n<div class="fail-list">\n'
@@
-        html += f"<dt>{display_names.get(m, m)}:</dt><dd>{info['file']} · {int(dur//60)}m {int(dur%60)}s · {info['timestamp'][:19]}</dd><br>"
+        html += f"<dt>{esc(display_names.get(m, m))}:</dt><dd>{esc(info['file'])} · {int(dur//60)}m {int(dur%60)}s · {esc(info['timestamp'][:19])}</dd><br>"

Also applies to: 203-203, 246-246, 268-270, 284-284

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 184 - 190, Escape any
user-controlled strings before embedding into HTML: apply an HTML-escaping
function (e.g. html.escape) to display_names entries, info['file'],
info['timestamp'] and any other values derived from JSON/CLI (used in the
model-name div, detail div, table th/h3/dt/dd outputs) so the template
interpolations in generate-report.py use the escaped strings rather than raw
values; update locations that build HTML (references: display_names,
info['file'], info['timestamp'], and places rendering anchor_questions results)
to call the escape helper prior to f-string insertion.
♻️ Duplicate comments (1)
evaluations/pilot.py (1)

461-470: ⚠️ Potential issue | 🟠 Major

Die neue Summary serialisiert weiterhin keine Claude-Identifier.

generate-report.py liest jetzt summaries/, aber dieser Block übernimmt all_results["config"] unverändert. Dort fehlen weiterhin Felder für claude, claude-cli und claude-haiku, sodass der Report diese Läufe weiter hart codieren bzw. raten muss. Das kann unterschiedliche Snapshots falsch beschriften oder zusammenführen. Bitte die tatsächlich verwendeten Claude-IDs vor dem Summary-Write mitpersistieren und später vorrangig daraus lesen.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 461 - 470, The summary write currently
deep-copies all_results but does not persist the actual Claude identifiers used,
so generate-report.py can't reliably map runs; before writing summary (in the
block that creates summary from all_results and writes summary_file) add the
actual Claude ID fields into summary["config"] (specifically ensure keys
"claude", "claude-cli", and "claude-haiku" are populated) by extracting the IDs
used in all_results (e.g., from all_results["config"] or from the model/run
entries inside all_results["models"]) so the final summary (variable summary
written to summary_file) contains the concrete Claude identifiers for later
deterministic reading by generate-report.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluations/generate-report.py`:
- Line 22: The script currently points RESULTS_DIR to "summaries" and
load_best_results() ignores legacy "evaluations/results/", causing silent
continuation and empty HTML when no summaries exist; update load_best_results()
to first look in the summaries directory (RESULTS_DIR) and if empty/found-none
fall back to the legacy "evaluations/results/" path, and if both yield 0 models
raise a hard error (raise/exit with non-zero) instead of writing an empty
report; also apply the same check/behavior to the other loading code paths
around the block referenced (lines ~57-75) so no empty-report HTML is produced.

In `@evaluations/pilot.py`:
- Around line 317-323: The one-line if/elif/else statements inside the loop over
models violate Ruff E701; expand each branch into a multi-line block so each
conditional and its body are on separate lines. In the loop that uses models and
appends to model_ids (referencing openai_model, mistral_model, deepseek_model,
ollama_model and the f"ollama-{ollama_model}" case), convert the single-line "if
m == ...: model_ids.append(...)" and corresponding elif/else into properly
indented multi-line if / elif / else blocks, preserving the same append logic
and leaving the final model_suffix creation unchanged.

---

Outside diff comments:
In `@evaluations/generate-report.py`:
- Around line 184-190: Escape any user-controlled strings before embedding into
HTML: apply an HTML-escaping function (e.g. html.escape) to display_names
entries, info['file'], info['timestamp'] and any other values derived from
JSON/CLI (used in the model-name div, detail div, table th/h3/dt/dd outputs) so
the template interpolations in generate-report.py use the escaped strings rather
than raw values; update locations that build HTML (references: display_names,
info['file'], info['timestamp'], and places rendering anchor_questions results)
to call the escape helper prior to f-string insertion.

---

Duplicate comments:
In `@evaluations/pilot.py`:
- Around line 461-470: The summary write currently deep-copies all_results but
does not persist the actual Claude identifiers used, so generate-report.py can't
reliably map runs; before writing summary (in the block that creates summary
from all_results and writes summary_file) add the actual Claude ID fields into
summary["config"] (specifically ensure keys "claude", "claude-cli", and
"claude-haiku" are populated) by extracting the IDs used in all_results (e.g.,
from all_results["config"] or from the model/run entries inside
all_results["models"]) so the final summary (variable summary written to
summary_file) contains the concrete Claude identifiers for later deterministic
reading by generate-report.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0c639659-e0ef-447f-8cb9-9335a09ae7b7

📥 Commits

Reviewing files that changed from the base of the PR and between 4602bbe and b2bf619.

📒 Files selected for processing (12)
  • evaluations/.gitignore
  • evaluations/generate-report.py
  • evaluations/pilot.py
  • evaluations/results/pilot-20260324-174404.json
  • evaluations/results/pilot-20260324-190600.json
  • evaluations/results/pilot-20260324-192413.json
  • evaluations/summaries/pilot-20260324-174404.json
  • evaluations/summaries/pilot-20260324-190600.json
  • evaluations/summaries/pilot-20260324-192413.json
  • evaluations/summaries/pilot-20260326-070127_mistral-medium-2508.json
  • evaluations/summaries/pilot-20260326-073241_devstral-2512.json
  • evaluations/summaries/pilot-20260326-074132_mistral-small-2603.json
✅ Files skipped from review due to trivial changes (7)
  • evaluations/.gitignore
  • evaluations/summaries/pilot-20260324-192413.json
  • evaluations/summaries/pilot-20260326-073241_devstral-2512.json
  • evaluations/summaries/pilot-20260326-070127_mistral-medium-2508.json
  • evaluations/summaries/pilot-20260324-190600.json
  • evaluations/summaries/pilot-20260326-074132_mistral-small-2603.json
  • evaluations/summaries/pilot-20260324-174404.json

Comment thread evaluations/generate-report.py
Comment thread evaluations/pilot.py
@rdmueller rdmueller merged commit 0e0a749 into LLM-Coding:main Mar 26, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants