feat: evaluate 6 models including Mistral Small/Medium/Devstral by raifdmueller · Pull Request #353 · LLM-Coding/Semantic-Anchors

raifdmueller · 2026-03-26T08:04:15Z

Summary

Extended evaluation from 3 to 6 models. 193 questions × 6 models (L1 + L2).

Results

Model	Score*	Type
claude-sonnet-4-20250514	99%	Commercial flagship
gpt-4o	97%	Commercial flagship
mistral-large-2512	96%	Open-weight flagship
devstral-2512	96%	Code-specialized
mistral-medium-2508	85%	Commercial mid-tier
mistral-small-2603	74%	Open-weight small

Key Findings

Devstral 2 matches Mistral Large — a code-specialized model knows SE anchors as well as the generalist flagship
Mistral Medium is surprisingly weak (85%) for a "frontier-class" model — worse than Large and Devstral
Mistral Small shows heavy position bias — most failures are 75% (3/4 permutations correct)
Model naming fixed — result filenames now include exact model ID, report shows exact identifiers

Infrastructure improvements

Result filenames include model ID (prevents race conditions on parallel runs)
Report generator uses exact model IDs from config
Concept document updated with exact API identifiers and date suffixes

Part of EPIC #329.

Test plan

All 6 models complete (193 questions each)
Report generated with 6 models
Exact model IDs in filenames and report

🤖 Generated with Claude Code

Summary by CodeRabbit

Dokumentation
- Anforderungen zur Modellauswahl präzisiert: exakte Provider‑Modellkennungen mit Datum nötig; Tabellendarstellungen und lokale-vs-API‑Hinweise überarbeitet.
Verbesserungen
- Anzeige nutzt nun genaue Modellkennungen statt Aliasnamen.
- Testlauf‑Dateinamen entkoppelt nach Modell, verhindert Überschreibungen.
- Zusätzliche zusammengefasste Auswertungs‑Summaries werden gespeichert.
Chores
- Rohergebnisdateien werden nun standardmäßig ignoriert.

Replace vague model names with exact API identifiers: - mistral-large-2512 (not "Mistral Large") - claude-sonnet-4-20250514 (not "Claude Sonnet") - gpt-4o (not "GPT") Add Mistral Small 4 (mistral-small-2603), Mistral Medium 3.1 (mistral-medium-2508), and Devstral 2 (devstral-2512) to model list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Before: pilot-20260326-064026.json (parallel runs overwrite each other) After: pilot-20260326-064026_mistral-small-2603.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Report now displays e.g. "mistral-large-2512" instead of "Mistral Large", reading the actual model identifier from the result config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

193 questions × 6 models (L1 Recognition + L2 Application): | Model | Score | |----------------------------|-------| | claude-sonnet-4-20250514 | 99% | | gpt-4o | 97% | | mistral-large-2512 | 96% | | devstral-2512 | 96% | | mistral-medium-2508 | 85% | | mistral-small-2603 | 74% | Key findings: - Devstral 2 (code-specialized) matches Large for SE anchors - Mistral Medium 3.1 surprisingly weak (85%) for a "frontier" model - Mistral Small 4 shows heavy position bias (74%) - Filenames now include model ID to prevent race conditions - Report shows exact model identifiers instead of aliases Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-26T08:04:28Z

Walkthrough

Der PR zwingt zu genauen, datums‑suffixed Modell‑IDs in Docs, schreibt bei Pilotläufen zusätzliche zusammengefasste JSONs in ein neues summaries/-Verzeichnis und ändert die Report‑Generierung, sodass Anzeigenamen aus gespeicherten Konfigurationen statt einer statischen Zuordnung abgeleitet werden.

Changes

Cohort / File(s)	Summary
Dokumentation & Modellkatalog `docs/anchor-evaluations.adoc`	„Model Selection“ verlangt jetzt exakte, datumsbehaftete Provider‑Model‑IDs; kommerzielle Modelltabelle erweitert (explizite API‑IDs, GPT‑4o/GPT‑5, mehrere Mistral/Devstral‑Varianten); Open‑weight‑Tabelle auf lokal über Ollama laufende Modelle reduziert und Spalten/Reihenfolge angepasst.
Report‑Generierung & Anzeige‑Logik `evaluations/generate-report.py`	Statische `MODEL_DISPLAY` → `MODEL_DISPLAY_FALLBACK`; neue `get_model_display(backend, config)` extrahiert exakte Anzeige‑IDs aus gespeicherten Konfigurationen; `load_best_results()` indexiert Ergebnisse nach diesem exakten Bezeichner; `generate_html()` baut Modellliste aus den neuen Keys und nutzt `display_names` überall.
Pilotlauf / Dateinamenskonvention & Summaries `evaluations/pilot.py`, `evaluations/.gitignore`	Pilot‑Output‑Dateinamen enthalten jetzt modellabhängigen Suffix (`pilot-{ts}_{model_suffix}.json`) mit sanitisierten exakten Modell‑IDs; zusätzlich wird bei Non‑dry‑run ein reduziertes Summary (ohne per‑question `results`) in `evaluations/summaries/` geschrieben; `results/` hinzugefügt zu `.gitignore`.
Resultate / neue Summary‑Artefakte `evaluations/summaries/` (mehrere `pilot-.json`), `evaluations/results/pilot-20260324-190600.json`	Viele neue timestamped Summary JSONs hinzugefügt; ein bestehender result‑config Eintrag geändert: `mistral-large-latest` → `mistral-large-2512`. Änderungen sind daten-/artefaktgetrieben, keine API‑Signaturänderungen.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Pilot as Pilot Runner
    participant FS as Filesystem
    participant Report as Report Generator
    participant HTML as HTML Renderer

    Pilot->>FS: write raw results `results/pilot-{ts}_{model}.json`
    Pilot->>FS: write summary `summaries/pilot-{ts}_{model}.json`
    FS--)Report: store/read results + embedded configs
    Report->>Report: get_model_display(backend, config) -> exact_id
    Report->>HTML: generate_html(display_names = results.keys())
    HTML->>FS: write final report HTML

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

docs: add semantic anchor evaluation concept #328: Beide PRs ändern docs/anchor-evaluations.adoc — dieses PR aktualisiert Modell‑Identifiers und Tabellen, #328 hat die Datei eingeführt.
feat: publish evaluation report on the website #344: Beide PRs betreffen die Evaluations‑Report‑Pipeline; Änderungen an generate-report.py/pilot.py können direkt mit dem HTML‑Publikationsfluss in #344 interagieren.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Der Titel beschreibt präzise die Hauptänderung: Evaluierung von 6 Modellen einschließlich mehrerer Mistral-Varianten, was den Kerninhalten der Pull Request (neue Modell-Evaluierungen, Mistral Small/Medium/Devstral Tests) entspricht.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/anchor-evaluations.adoc`:
- Around line 410-412: Die Tabelle in docs/anchor-evaluations.adoc verwendet
nicht-reproduzierbare Model-Alias-Einträge (`gpt-4o`, `gpt-5`, `TBD`) direkt
nach der Vorgabe „exact model identifier with date suffix“ (Zeilen um 410–412
und 423–445); ersetze diese Alias-/Platzhalter-IDs entweder durch die konkreten
Snapshot-IDs mit Datumssuffix (z. B. `mistral-large-2512`-Style) oder
kennzeichne die betreffenden Zeilen eindeutig als „not yet evaluated / snapshot
pending“ bzw. „TBD (no snapshot yet)“, so dass jede Tabellenzeile entweder eine
exakte reproduzierbare ID oder einen klaren Platzhalter-Status enthält.

In `@evaluations/generate-report.py`:
- Around line 36-52: Die Funktion get_model_display verwendet harte, veraltete
Claude-Strings; stattdessen soll run_pilot beim Start die tatsächlich
verwendeten Claude-IDs in die run-Konfiguration/Result-JSON schreiben (z.B.
setzen von keys wie "claude_model_used", "claude_cli_model_used",
"claude_haiku_model_used") und get_model_display muss diese priorisiert lesen
(prüfe result/config für diese "…_model_used" keys) bevor es auf harte Strings
oder MODEL_DISPLAY_FALLBACK zurückfällt; passe die Logik in get_model_display
und das Schreiben in run_pilot so an, dass die tatsächliche
Identifier-Persistenz in den Ergebnissen erfolgt und später konsistent verwendet
wird.

In `@evaluations/pilot.py`:
- Around line 315-328: The filename suffix construction currently mutates a
cumulative string via repeated model_suffix.replace() which causes
cross-replacements when one model name appears inside another; instead, build
the suffix from individual components: iterate the models list and for each
model append the exact replacement token (use openai_model, mistral_model,
deepseek_model, f"ollama-{ollama_model}" for "ollama") into a new list, then
join that list with "_" to form model_suffix, then perform the sanitize step and
create out_file as before (reference symbols: models, model_suffix,
openai_model, mistral_model, deepseek_model, ollama_model, out_file,
RESULTS_DIR).

In `@evaluations/results/pilot-20260324-190600.json`:
- Line 8: Die Datei enthält den aktualisierten Modell-Identifier
"mistral-large-2512" unter dem Schlüssel "mistral_model", der aber nicht im
Dateinamen reflektiert wird; bitte benenne die Datei so um, dass der exakte
Modell-Identifier im Namen enthalten ist (z.B. erweitere
pilot-20260324-190600.json → pilot-20260324-190600-mistral-large-2512.json) und
aktualisiere ggf. alle Verweise/Metadaten im Repo, die auf das alte
Dateinamen-Token verweisen, damit Report-Metadaten und Historie eindeutig
bleiben.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: fc01c588-e413-4e7e-a2bf-eab195594639

📥 Commits

Reviewing files that changed from the base of the PR and between ceee4d0 and 4602bbe.

📒 Files selected for processing (9)

docs/anchor-evaluations.adoc
evaluations/generate-report.py
evaluations/pilot.py
evaluations/report.html
evaluations/results/pilot-20260324-190600.json
evaluations/results/pilot-20260326-070127_mistral-medium-2508.json
evaluations/results/pilot-20260326-073241_devstral-2512.json
evaluations/results/pilot-20260326-074132_mistral-small-2603.json
website/public/evaluation-report.html

…model - New: evaluations/summaries/ — scores only, no raw responses or per-permutation details. 17KB vs 220KB per model. - evaluations/results/ added to .gitignore (full results reproducible via pilot.py) - Report generator reads from summaries/ - pilot.py auto-generates summary after each run - Fixed filename suffix to use exact model IDs (CodeRabbit feedback) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

evaluations/generate-report.py (1)

184-190: ⚠️ Potential issue | 🟠 Major

Konfigurations- und Dateinamen hier vor dem HTML-Rendering escapen.

display_names, info['file'] und info['timestamp'] kommen jetzt aus JSON/CLI-Konfiguration statt aus statischen Konstanten. Die direkte Interpolation in div, th, h3 und dt/dd öffnet den Report für HTML-/Script-Injection.

💡 Mögliche Anpassung

 def generate_html(results, output_path):
+    def esc(value):
+        return h(str(value))
+
     # Keys are already exact model IDs (e.g. "mistral-large-2512")
     display_names = {m: m for m in results}
@@
-        display = display_names.get(m, m)
+        display = esc(display_names.get(m, m))
         n = len([1 for l in anchor_questions if anchor_questions[l].get(m) is not None])
         info = results[m]
         html += f"""  <div class="summary-card">
     <div class="model-name">{display}</div>
     <div class="score" style="color: {score_color(avg)}">{avg:.0%}</div>
-    <div class="detail">{n} questions · {info['file']}</div>
+    <div class="detail">{n} questions · {esc(info['file'])}</div>
   </div>
 """
@@
-        html += f"  <th style='text-align:center'>{display_names.get(m, m)}</th>\n"
+        html += f"  <th style='text-align:center'>{esc(display_names.get(m, m))}</th>\n"
@@
-            html += f"<th style='text-align:center'>{display_names.get(m, m)}</th>"
+            html += f"<th style='text-align:center'>{esc(display_names.get(m, m))}</th>"
@@
-            html += f"<h3>{display_names.get(m, m)}: no failures</h3>\n"
+            html += f"<h3>{esc(display_names.get(m, m))}: no failures</h3>\n"
         else:
-            html += f'<h3>{display_names.get(m, m)}: {len(fails)} failures</h3>\n<div class="fail-list">\n'
+            html += f'<h3>{esc(display_names.get(m, m))}: {len(fails)} failures</h3>\n<div class="fail-list">\n'
@@
-        html += f"<dt>{display_names.get(m, m)}:</dt><dd>{info['file']} · {int(dur//60)}m {int(dur%60)}s · {info['timestamp'][:19]}</dd><br>"
+        html += f"<dt>{esc(display_names.get(m, m))}:</dt><dd>{esc(info['file'])} · {int(dur//60)}m {int(dur%60)}s · {esc(info['timestamp'][:19])}</dd><br>"

Also applies to: 203-203, 246-246, 268-270, 284-284

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 184 - 190, Escape any
user-controlled strings before embedding into HTML: apply an HTML-escaping
function (e.g. html.escape) to display_names entries, info['file'],
info['timestamp'] and any other values derived from JSON/CLI (used in the
model-name div, detail div, table th/h3/dt/dd outputs) so the template
interpolations in generate-report.py use the escaped strings rather than raw
values; update locations that build HTML (references: display_names,
info['file'], info['timestamp'], and places rendering anchor_questions results)
to call the escape helper prior to f-string insertion.

♻️ Duplicate comments (1)

evaluations/pilot.py (1)
461-470: ⚠️ Potential issue | 🟠 Major

Die neue Summary serialisiert weiterhin keine Claude-Identifier.

generate-report.py liest jetzt summaries/, aber dieser Block übernimmt all_results["config"] unverändert. Dort fehlen weiterhin Felder für claude, claude-cli und claude-haiku, sodass der Report diese Läufe weiter hart codieren bzw. raten muss. Das kann unterschiedliche Snapshots falsch beschriften oder zusammenführen. Bitte die tatsächlich verwendeten Claude-IDs vor dem Summary-Write mitpersistieren und später vorrangig daraus lesen.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 461 - 470, The summary write currently
deep-copies all_results but does not persist the actual Claude identifiers used,
so generate-report.py can't reliably map runs; before writing summary (in the
block that creates summary from all_results and writes summary_file) add the
actual Claude ID fields into summary["config"] (specifically ensure keys
"claude", "claude-cli", and "claude-haiku" are populated) by extracting the IDs
used in all_results (e.g., from all_results["config"] or from the model/run
entries inside all_results["models"]) so the final summary (variable summary
written to summary_file) contains the concrete Claude identifiers for later
deterministic reading by generate-report.py.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluations/generate-report.py`:
- Line 22: The script currently points RESULTS_DIR to "summaries" and
load_best_results() ignores legacy "evaluations/results/", causing silent
continuation and empty HTML when no summaries exist; update load_best_results()
to first look in the summaries directory (RESULTS_DIR) and if empty/found-none
fall back to the legacy "evaluations/results/" path, and if both yield 0 models
raise a hard error (raise/exit with non-zero) instead of writing an empty
report; also apply the same check/behavior to the other loading code paths
around the block referenced (lines ~57-75) so no empty-report HTML is produced.

In `@evaluations/pilot.py`:
- Around line 317-323: The one-line if/elif/else statements inside the loop over
models violate Ruff E701; expand each branch into a multi-line block so each
conditional and its body are on separate lines. In the loop that uses models and
appends to model_ids (referencing openai_model, mistral_model, deepseek_model,
ollama_model and the f"ollama-{ollama_model}" case), convert the single-line "if
m == ...: model_ids.append(...)" and corresponding elif/else into properly
indented multi-line if / elif / else blocks, preserving the same append logic
and leaving the final model_suffix creation unchanged.

---

Outside diff comments:
In `@evaluations/generate-report.py`:
- Around line 184-190: Escape any user-controlled strings before embedding into
HTML: apply an HTML-escaping function (e.g. html.escape) to display_names
entries, info['file'], info['timestamp'] and any other values derived from
JSON/CLI (used in the model-name div, detail div, table th/h3/dt/dd outputs) so
the template interpolations in generate-report.py use the escaped strings rather
than raw values; update locations that build HTML (references: display_names,
info['file'], info['timestamp'], and places rendering anchor_questions results)
to call the escape helper prior to f-string insertion.

---

Duplicate comments:
In `@evaluations/pilot.py`:
- Around line 461-470: The summary write currently deep-copies all_results but
does not persist the actual Claude identifiers used, so generate-report.py can't
reliably map runs; before writing summary (in the block that creates summary
from all_results and writes summary_file) add the actual Claude ID fields into
summary["config"] (specifically ensure keys "claude", "claude-cli", and
"claude-haiku" are populated) by extracting the IDs used in all_results (e.g.,
from all_results["config"] or from the model/run entries inside
all_results["models"]) so the final summary (variable summary written to
summary_file) contains the concrete Claude identifiers for later deterministic
reading by generate-report.py.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 0c639659-e0ef-447f-8cb9-9335a09ae7b7

📥 Commits

Reviewing files that changed from the base of the PR and between 4602bbe and b2bf619.

📒 Files selected for processing (12)

evaluations/.gitignore
evaluations/generate-report.py
evaluations/pilot.py
evaluations/results/pilot-20260324-174404.json
evaluations/results/pilot-20260324-190600.json
evaluations/results/pilot-20260324-192413.json
evaluations/summaries/pilot-20260324-174404.json
evaluations/summaries/pilot-20260324-190600.json
evaluations/summaries/pilot-20260324-192413.json
evaluations/summaries/pilot-20260326-070127_mistral-medium-2508.json
evaluations/summaries/pilot-20260326-073241_devstral-2512.json
evaluations/summaries/pilot-20260326-074132_mistral-small-2603.json

✅ Files skipped from review due to trivial changes (7)

evaluations/.gitignore
evaluations/summaries/pilot-20260324-192413.json
evaluations/summaries/pilot-20260326-073241_devstral-2512.json
evaluations/summaries/pilot-20260326-070127_mistral-medium-2508.json
evaluations/summaries/pilot-20260324-190600.json
evaluations/summaries/pilot-20260326-074132_mistral-small-2603.json
evaluations/summaries/pilot-20260324-174404.json

raifdmueller and others added 4 commits March 26, 2026 06:43

fix: include model name in result filename to prevent race conditions

53e8f70

Before: pilot-20260326-064026.json (parallel runs overwrite each other) After: pilot-20260326-064026_mistral-small-2603.json Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: show exact model IDs in report instead of backend aliases

3f3b487

Report now displays e.g. "mistral-large-2512" instead of "Mistral Large", reading the actual model identifier from the result config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread docs/anchor-evaluations.adoc

Comment thread evaluations/generate-report.py

Comment thread evaluations/pilot.py Outdated

Comment thread evaluations/results/pilot-20260324-190600.json Outdated

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread evaluations/generate-report.py

Comment thread evaluations/pilot.py

rdmueller merged commit 0e0a749 into LLM-Coding:main Mar 26, 2026
6 of 7 checks passed

This was referenced Mar 26, 2026

feat: evaluation results for 10 models (Claude/GPT/Mistral families) #361

Merged

feat: evaluation specs + results for PERT, GRASP, VSA #363

Merged

raifdmueller mentioned this pull request Mar 26, 2026

Run full evaluation and publish results #337

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluate 6 models including Mistral Small/Medium/Devstral#353

feat: evaluate 6 models including Mistral Small/Medium/Devstral#353
rdmueller merged 5 commits into
LLM-Coding:mainfrom
raifdmueller:feat/mistral-models-evaluation

raifdmueller commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raifdmueller commented Mar 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Key Findings

Infrastructure improvements

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raifdmueller commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading