Skip to content

feat: evaluation framework with 63 anchor specs and pilot results#343

Merged
rdmueller merged 9 commits into
LLM-Coding:mainfrom
raifdmueller:feat/eval-l1-questions
Mar 24, 2026
Merged

feat: evaluation framework with 63 anchor specs and pilot results#343
rdmueller merged 9 commits into
LLM-Coding:mainfrom
raifdmueller:feat/eval-l1-questions

Conversation

@raifdmueller

@raifdmueller raifdmueller commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Complete evaluation framework for testing semantic anchors across LLMs — concept, tooling, question specs, and results.

Results (193 questions × 3 models, L1 Recognition + L2 Application)

Model Score*
Claude Sonnet 4.6 99%
GPT-4o 97%
Mistral Large 96%

excluding sanity check

What's included

  • 63 question specs (L1 Recognition for all, L2 Application for 59 anchors)
  • 3 result files (Claude, GPT-4o, Mistral Large — 193 questions each)
  • HTML report (evaluations/report.html) — heatmap, summary cards, failure detail
  • 4 scripts: pilot.py (runner), generate-l1-specs.py, generate-l2-specs.py, fill-distractors.py
  • README.adoc with quick start guide
  • 7 backends: claude, claude-cli, claude-haiku, openai, mistral, deepseek, ollama

Key findings

  • Claude is nearly perfect (99%) — only Feynman paraphrase (known spec issue) and GitHub Flow paraphrase
  • GPT-4o has position bias on TDD London/Chicago School recognition
  • Mistral struggles with niche anchors (PRD 0%, LASR 25%)
  • Paraphrase scores are generally lower than anchor scores — anchors add value beyond shorthand
  • Auto-generated specs need human review (caught by Claude scoring 0% on wrong specs)

Status: Preview

  • L4 Consistency tests only for 2 anchors (arc42, TDD London School)
  • Feynman paraphrase spec fixed but not re-run
  • Report published as work-in-progress

Part of EPIC #329. Closes #330, #331, #332, #335.

Test plan

  • Full run on 3 models with 193 questions each
  • Sanity check scores 0% on all models
  • HTML report generated and verified
  • Progress counter working

🤖 Generated with Claude Code

…ding#331)

Auto-generated MC recognition questions from .adoc metadata:
- generate-l1-specs.py: extracts core concepts → correct answer
- fill-distractors.py: uses Claude API to generate 3 plausible distractors
- 56 new specs + 7 hand-crafted from pilot = 63 total

Includes pilot runner, results, and README from the closed PR LLM-Coding#342.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Mar 24, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Fügt ein komplettes Semantic‑Anchor‑Evaluations‑Framework hinzu: Dokumentation, CLI‑Skripte zur Generierung/Ergänzung von Specs und Distraktoren, ein Pilot‑Runner für mehrere LLM‑Backends, Report‑Generator sowie 60+ neue YAML‑Spezifikationen und ein statisches HTML‑Report.

Changes

Cohort / File(s) Summary
Konfiguration & README
evaluations/.gitignore, evaluations/README.adoc
.gitignore: ignore *.pyc. README.adoc: Beschreibung des Semantic Anchor Evaluations‑Frameworks, Quick‑Start, YAML‑Spec‑Schema und Pilot‑Ergebnisse.
Spec-Generatoren & Distraktoren
evaluations/generate-l1-specs.py, evaluations/generate-l2-specs.py, evaluations/fill-distractors.py
Neue CLI‑Skripte: Level‑1‑Specs aus docs/anchors/*.adoc erzeugen, Level‑2 (Application) via Claude API generieren, fehlende Distraktoren mit Claude auffüllen; CLI‑Flags: --dry-run, --anchor, --force.
Pilot & Report
evaluations/pilot.py, evaluations/generate-report.py, evaluations/report.html
pilot.py: Mehrfach‑Choice Pilot‑Runner mit Permutationen über mehrere Backends (Anthropic/Claude, OpenAI, Ollama, Claude CLI), speichert timestamped JSON‑Ergebnisse. generate-report.py erzeugt statisches HTML‑Reporting; report.html Beispielbericht hinzugefügt.
Evaluierungsspezifikationen
evaluations/specs/*.yaml
Über 60 neue YAML‑Spec‑Dateien (verschiedene Anchors, v.a. Tier 3) mit questions.recognition und vielfach questions.application (Prompts, Optionen, correct), inkl. Negativ‑Kontrollen und Sanity‑Check.
Hilfsdateien
evaluations/results/* (neu erzeugt während Runs)
Format/Ort für gespeicherte Pilot‑Ergebnisse (JSON) — Skripte schreiben dort timestamped Runs.

Sequence Diagram

sequenceDiagram
    actor User
    participant GenL1 as generate-l1-specs.py
    participant ADOC as docs/anchors/*.adoc
    participant Spec as evaluations/specs/*.yaml
    participant GenL2 as generate-l2-specs.py
    participant Fill as fill-distractors.py
    participant Pilot as pilot.py
    participant Model as LLM Backend
    participant Results as evaluations/results/*.json
    participant ReportGen as generate-report.py
    participant HTML as evaluations/report.html

    User->>GenL1: --dry-run / --force
    GenL1->>ADOC: parse anchor metadata
    ADOC-->>GenL1: anchor data
    GenL1->>Spec: create/update Level‑1 spec

    User->>GenL2: --dry-run / --anchor
    GenL2->>ADOC: load anchor context
    ADOC-->>GenL2: context
    GenL2->>Model: request Level‑2 JSON (Anthropic)
    Model-->>GenL2: JSON response
    GenL2->>Spec: write application question

    User->>Fill: --anchor / --dry-run
    Fill->>Spec: find placeholders
    Fill->>Model: request distractors
    Model-->>Fill: JSON distractors
    Fill->>Spec: write options A/C/D

    User->>Pilot: run (--model, --dry-run)
    Pilot->>Spec: load specs
    loop per question, 4 randomized permutations
        Pilot->>Model: send permuted prompt
        Model-->>Pilot: response
        Pilot->>Pilot: parse first A–D, score
    end
    Pilot->>Results: append timestamped JSON run

    User->>ReportGen: --output
    ReportGen->>Results: load pilot-*.json
    ReportGen->>ReportGen: compute best runs, heatmap, failures
    ReportGen->>HTML: write report.html
    HTML-->>User: report ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • #329 – Pilot: Manual evaluation with 5 anchors and 2 models — PR liefert Pilot‑Runner, Specs und Ergebnisse, adressiert die Pilot‑Durchführung und Dokumentation.
  • #332 – Level‑2 (application) question generation — generate-l2-specs.py und viele questions.application Einträge implementieren die in #332 beschriebenen Aufgaben.
  • #331 – Auto‑generierung von Level‑1 Specs — generate-l1-specs.py automatisiert Level‑1‑Erzeugung aus .adoc, entspricht Ziel von #331.

Possibly related PRs

  • PR #209 — Korrespondiert mit hinzugefügten Anchors/Tier‑Metadaten; die Spec‑Generatoren und YAMLs bauen auf den Anchor‑Metadaten auf.
  • PR #328 — Beide PRs implementieren das Semantic‑Anchor‑Evaluations‑Konzept; dieser PR liefert Scripts, Specs und Reporting, PR #328 ergänzt begleitende Dokumentation/Website.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Der PR-Titel beschreibt präzise die Hauptänderung: Einführung eines Evaluierungs-Frameworks mit 63 Anker-Spezifikationen und Pilot-Ergebnissen.
Linked Issues check ✅ Passed Der PR erfüllt die Anforderungen aus Issue #330: 5 Anker wurden ausgewählt, 5 Level-1-Fragen erstellt, 2 Level-2-Fragen für starke Anker hinzugefügt, gegen mehrere Modelle getestet, 4× randomisierte Ausführungen durchgeführt und Ergebnisse dokumentiert.
Out of Scope Changes check ✅ Passed Alle Änderungen beziehen sich auf das Evaluierungs-Framework: Spezifikationsdateien, Skripte zur Generierung und Ausführung von Tests, Berichte und Dokumentation.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (5)
evaluations/specs/bem-methodology.yaml-12-14 (1)

12-14: ⚠️ Potential issue | 🟡 Minor

Korrekte Antwort ist inhaltlich zu unvollständig für BEM.

Aktuell fokussiert Option B vor allem auf „Block“ und Nutzen, nicht klar genug auf Block/Element/Modifier als Kern des Anchors. Das schwächt die Messqualität der Recognition-Frage.

Vorschlag zur Präzisierung von Option B
-      B: Solve CSS specificity wars, naming conflicts, and stylesheet maintainability
-        issues in large codebases; standalone component that is meaningful on its
-        own (e.g., `menu`, `button`, `header`)
+      B: CSS-Namenskonvention mit Block, Element und Modifier (BEM), um Komponenten
+        eindeutig zu strukturieren, Spezifitätskonflikte zu reduzieren und Stylesheets
+        in großen Codebasen wartbar zu halten

Based on learnings: Validate semantic anchors meet quality criteria before adding: Precise (specific body of knowledge with clear boundaries), Rich (activates multiple interconnected concepts), Consistent, Attributable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/bem-methodology.yaml` around lines 12 - 14, Option B is too
focused on the Block concept and benefit; update the BEM anchor text (the option
labeled "B") to explicitly state the Block/Element/Modifier triad as the core
concept, include brief examples for each (e.g., block: "menu", element:
"menu__item", modifier: "menu--active"), and ensure the wording meets the
semantic-anchor quality criteria (Precise, Rich, Consistent, Attributable) by
making boundaries clear, activating interconnected concepts (naming,
specificity, maintainability), and keeping attribution to BEM terminology.
evaluations/specs/problem-space-nvc.yaml-13-13 (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Kleiner Textfehler in der Beispielaussage.

Auf Line 13 sollte es „I feel frustrated“ statt „i feel frustrated“ heißen, damit die Option sprachlich sauber und professionell bleibt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/problem-space-nvc.yaml` at line 13, Replace the lowercase
example text "i feel frustrated" with the grammatically correct "I feel
frustrated" in the YAML entry (search for the exact string "i feel frustrated"
in problem-space-nvc.yaml) so the option reads with proper capitalization and
professional tone.
evaluations/specs/todotxt-flavoured-markdown.yaml-13-14 (1)

13-14: ⚠️ Potential issue | 🟡 Minor

Prioritätsnotation bitte präzisieren.

Auf Line 13-14 ist bei todo.txt üblicherweise die Schreibweise (A), (B), (C) (groß) gebräuchlich. Das erhöht die fachliche Genauigkeit der korrekten Option.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/todotxt-flavoured-markdown.yaml` around lines 13 - 14, Die
Beschreibung "uses todo.txt priority notation `(a)`, `(b)`, `(c)` where `(a)` is
highest priority" ist fachlich inkorrekt; passe die Notation auf die übliche
Großschreibweise `(A)`, `(B)`, `(C)` an, also ersetze alle Vorkommen von `(a)`,
`(b)`, `(c)` (und die erklärende Klammer, z. B. "where `(a)` is highest") durch
`(A)`, `(B)`, `(C)` bzw. "where `(A)` is highest" in der Spezifikation, damit
die Prioritätsnotation korrekt dokumentiert ist.
evaluations/specs/c4-diagrams.yaml-11-13 (1)

11-13: ⚠️ Potential issue | 🟡 Minor

Korrekte Antwort ist inhaltlich zu unpräzise formuliert.

Die Formulierung in Line 11–13 („Four levels of abstraction; : system in its environment …“) enthält einen Tippfehler und deckt den Begriff nur teilweise ab. Das schwächt die Aussagekraft des Items, obwohl correct: B gesetzt ist.

Based on learnings: Validate semantic anchors meet quality criteria before adding: Precise, Rich, Consistent, Attributable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/c4-diagrams.yaml` around lines 11 - 13, Item B currently
has a typo and an imprecise definition for the C4 model; update the YAML entry
for symbol "B" so it reads a precise enumeration like "Four levels of
abstraction: System Context (system in its environment), Container
(apps/services), Component (high‑level modules), Code (classes/implementation)"
(ensure the stray ":" is removed), keep `correct: B` unchanged, and make the
wording rich and consistent with the C4 terminology so the semantic anchor meets
precision, richness, and consistency criteria.
evaluations/specs/ears-requirements.yaml-11-11 (1)

11-11: ⚠️ Potential issue | 🟡 Minor

Option B enthält unnötige zusätzliche Anführungszeichen.

Line 11 fügt um den gesamten Inhalt nochmals " ein. Das ist kein Syntaxfehler, aber ein unnötiges visuelles Signal in genau der korrekten Option.

Vorschlag
-      B: '"The <system> shall <requirement>"; "when <trigger> the <system> shall <requirement>"'
+      B: 'The <system> shall <requirement>; when <trigger> the <system> shall <requirement>'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/ears-requirements.yaml` at line 11, Die Option für den
Schlüssel B enthält doppelte äußere Anführungszeichen; öffne die Definition für
B und entferne die äußeren `"` so dass nur die inneren, intendierten Phrasen
erhalten bleiben (z.B. ändere die Wertzuweisung bei B von '"The <system> shall
<requirement>"; "when <trigger> the <system> shall <requirement>"' auf die
Version ohne die zusätzlichen umgebenden Anführungszeichen), prüfe die
Zeichenfolge in der YAML-Zeile mit dem Schlüssel B und stelle sicher, dass nur
die beabsichtigten Zitate innerhalb der Phrasen verbleiben.
🧹 Nitpick comments (20)
evaluations/specs/lasr.yaml (1)

1-22: Optional: Anchor-Qualität für LASR zusätzlich vorab dokumentieren.

Kein Blocker im YAML selbst. Optional wäre ein kurzer Verweis auf eine Vorab-Validierung (Recognition/Accuracy/Depth/Specificity), damit die spätere Auswertung robuster begründet ist.
Based on learnings: Test semantic anchors before adding by evaluating LLM response to 'What concepts do you associate with ?' considering Recognition, Accuracy, Depth, and Specificity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/lasr.yaml` around lines 1 - 22, Add an optional
pre-validation note for the semantic anchor "lasr" by extending the YAML around
anchor: lasr (and related questions.recognition) with a small anchor_quality or
validation field that records the recommended pre-check: evaluate LLM responses
to "What concepts do you associate with <semantic anchor name>?" and capture
Recognition, Accuracy, Depth, and Specificity; include the test prompt, the four
quality criteria, and a short accepted/threshold guideline so later evaluation
logic can reference this metadata during scoring.
evaluations/specs/cqrs.yaml (1)

13-16: CQRS-Beschreibung präzisieren (Read/Write-Modelle explizit nennen).

Die aktuelle richtige Option ist nah dran, fokussiert aber stark auf CQS. Für bessere Trennschärfe sollte klar drinstehen, dass Reads und Writes über getrennte Modelle/Interfaces laufen.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/cqrs.yaml` around lines 13 - 16, Die Beschreibung unter "B:
Bertrand Meyer's principle" / the CQS entry is too focused on CQS; update it to
explicitly describe CQRS by stating that reads and writes use separate
models/interfaces (a Read model/Query-side and a Write model/Command-side), that
reads do not perform state changes and writes encapsulate intent in immutable
command objects, and note that the two sides can have different representations,
validation and storage; replace or augment the current CQS wording to mention
"separate Read and Write models/interfaces" and ensure the term CQRS appears so
readers understand it's distinct from simple CQS.
evaluations/specs/socratic-method.yaml (1)

5-7: Frage-Scalar auf sauberes Blockformat umstellen.

Die aktuelle Schreibweise in Line 5–7 erzeugt unnötige Leerzeilen/Whitespace im Prompt. Bitte auf ein eindeutiges Block-Scalar umstellen, damit die Evaluationsprompts stabil bleiben.

Diff-Vorschlag
-    question: 'Which of the following best describes "Socratic Method"?
-
-      '
+    question: |
+      Which of the following best describes "Socratic Method"?

Based on learnings: Validate semantic anchors meet quality criteria before adding: Precise, Rich, Consistent, Attributable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/socratic-method.yaml` around lines 5 - 7, Die
YAML-Question-Scalar für den Schlüssel "question" verwendet aktuell
Inline-Quotes mit zusätzlichen Zeilenumbrüchen; wechsle zu einem eindeutigen
Block-Scalar (pipe |) für den "question"-Feldeintrag und entferne die
überflüssigen Leerzeilen/Trailing-Whitespace so der Inhalt exakt lautet (z. B. |
gefolgt von der Fragezeile "Which of the following best describes \"Socratic
Method\"?"), achte auf korrekte Einrückung des Block-Scalars und erhalte den
genauen Prompt-Text ohne zusätzliche blank lines.
evaluations/specs/semantic-versioning.yaml (1)

5-7: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).

✨ Vorschlag zur Behebung
-    question: 'Which of the following best describes "Semantic Versioning (SemVer)"?
-
-      '
+    question: 'Which of the following best describes "Semantic Versioning (SemVer)"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/semantic-versioning.yaml` around lines 5 - 7, Remove the
extraneous blank line inside the YAML value for the question key so the question
string is a single continuous line; edit the question field (question: 'Which of
the following best describes "Semantic Versioning (SemVer)"?') to eliminate the
empty line between lines 6–7 while preserving the surrounding quotes and
indentation.
evaluations/specs/linddun.yaml (1)

5-7: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile am Ende (Zeilen 6-7).

✨ Vorschlag zur Behebung
-    question: 'Which of the following best describes "LINDDUN"?
-
-      '
+    question: 'Which of the following best describes "LINDDUN"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/linddun.yaml` around lines 5 - 7, Die Fragezeichen-String
unter dem YAML-Schlüssel question enthält eine überflüssige Leerzeile am Ende;
entferne die leere Zeile so dass der Wert von question keine trailing blank line
mehr hat (bearbeite den question-Wert in evaluations/specs/linddun.yaml und
lösche die leere Zeile nach 'Which of the following best describes "LINDDUN"?'),
achte darauf, dass die Anführungszeichen und Einrückung unverändert bleiben.
evaluations/specs/spc.yaml (2)

11-12: Option B ist grammatikalisch unvollständig.

Die korrekte Antwort (Option B) verwendet fragmentierte Satzstrukturen ohne Verben: "Systematic statistical monitoring of running processes; inherent, random fluctuation — stable and predictable". Dies erschwert das Verständnis und wirkt unprofessionell.

📝 Verbesserungsvorschlag für bessere Lesbarkeit
-      B: Systematic statistical monitoring of running processes; inherent, random
-        fluctuation — stable and predictable
+      B: A methodology for systematic statistical monitoring of processes to distinguish
+        between inherent random variation (common cause) and special cause variation,
+        ensuring processes remain stable and predictable
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/spc.yaml` around lines 11 - 12, Option B (the value under
key "B") is a fragmented, verb-less phrase; rewrite it as a complete,
grammatically correct sentence or concise noun phrase to improve clarity and
professionalism — for example convert "Systematic statistical monitoring of
running processes; inherent, random fluctuation — stable and predictable" into a
full phrase like "Systematic statistical monitoring of running processes,
showing inherent random fluctuations that are stable and predictable" (update
the string value for key B in spc.yaml accordingly).

5-7: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).

✨ Vorschlag zur Behebung
-    question: 'Which of the following best describes "SPC (Statistical Process Control)"?
-
-      '
+    question: 'Which of the following best describes "SPC (Statistical Process Control)"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/spc.yaml` around lines 5 - 7, The YAML question value for
the key question contains an unnecessary blank line; edit the question string
(the question: field) to remove the extra empty line and trailing newline so the
text is a single contiguous paragraph—i.e., trim internal/leading/trailing blank
lines from the question value and ensure the question: string reads "Which of
the following best describes \"SPC (Statistical Process Control)\"?" without the
blank line.
evaluations/specs/tdd-chicago-school.yaml (1)

5-7: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).

✨ Vorschlag zur Behebung
-    question: 'Which of the following best describes "TDD, Chicago School"?
-
-      '
+    question: 'Which of the following best describes "TDD, Chicago School"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/tdd-chicago-school.yaml` around lines 5 - 7, Die
Fragezeichen-Zeile unter dem YAML-Schlüssel question enthält eine unnötige
Leerzeile; entferne die überflüssige Leerzeile innerhalb des question-Strings
(bei dem Eintrag question: 'Which of the following best describes "TDD, Chicago
School"') so der Text in einer zusammenhängenden Zeile steht und entferne ggf.
überflüssige Leerzeichen oder Zeilenumbrüche am Ende des Strings.
evaluations/specs/bluf.yaml (1)

5-7: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).

✨ Vorschlag zur Behebung
-    question: 'Which of the following best describes "BLUF (Bottom Line Up Front)"?
-
-      '
+    question: 'Which of the following best describes "BLUF (Bottom Line Up Front)"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/bluf.yaml` around lines 5 - 7, Die Fragezeichen-Zeile unter
dem YAML-Schlüssel question enthält eine überflüssige Leerzeile; öffnen Sie die
YAML-Einheit mit dem Schlüssel question in evaluations/specs/bluf.yaml und
entfernen Sie die leere Zeile zwischen "Which of the following best describes
\"BLUF (Bottom Line Up Front)\"?" und dem abschließenden String-Rahmen, sodass
der question-Wert in einer durchgehenden Zeile steht (Schlüssel: question,
Suchbegriff: "BLUF (Bottom Line Up Front)").
evaluations/specs/fowler-patterns.yaml (2)

13-14: Inkonsistente Großschreibung der Pattern-Namen in Option B.

Die Pattern-Namen in Option B verwenden inkonsistente Großschreibung. Die ersten vier Patterns sind korrekt groß geschrieben (Transaction Script, Domain Model, etc.), aber die nachfolgenden Patterns sind kleingeschrieben ("table data gateway" statt "Table Data Gateway"). In Fowlers PEAA werden Pattern-Namen als Eigennamen mit Großbuchstaben behandelt.

✨ Vorschlag zur Vereinheitlichung
-      B: Transaction Script, Domain Model, Table Module, Service Layer; table data
-        gateway, row data gateway, active record, data mapper
+      B: Transaction Script, Domain Model, Table Module, Service Layer; Table Data
+        Gateway, Row Data Gateway, Active Record, Data Mapper
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/fowler-patterns.yaml` around lines 13 - 14, Option B listet
Pattern-Namen inkonsistent: nach "Service Layer" sind die Begriffe
kleingeschrieben; ändere die Einträge in der Option-B-Zeile so dass alle
Pattern-Namen großgeschrieben und einheitlich als Eigennamen erscheinen (z. B.
"Table Data Gateway", "Row Data Gateway", "Active Record", "Data Mapper") und
behalte bestehende Trennzeichen/Kommas unverändert, sodass die Zeile mit "B:
Transaction Script, Domain Model, Table Module, Service Layer; Table Data
Gateway, Row Data Gateway, Active Record, Data Mapper" konsistent ist.

5-8: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 7-8).

✨ Vorschlag zur Behebung
-    question: 'Which of the following best describes "Patterns of Enterprise Application
-      Architecture (PEAA)"?
-
-      '
+    question: 'Which of the following best describes "Patterns of Enterprise Application Architecture (PEAA)"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/fowler-patterns.yaml` around lines 5 - 8, Die Frage unter
dem YAML-Schlüssel question enthält eine überflüssige Leerzeile; editieren Sie
den value von question in evaluations/specs/fowler-patterns.yaml (einzelner
"question" Eintrag) und entfernen Sie die Zwischenzeile so der Text in eine
einzige zusammenhängende Zeile/Block übergeht, ohne zusätzliche leere Zeilen
oder führende Leerzeichen; prüfen Sie danach das YAML-Indent-Level, damit die
Formatierung unverändert bleibt.
evaluations/specs/domain-driven-design.yaml (1)

5-8: Unnötige Leerzeile im Fragetext entfernen.

Der Fragetext enthält eine überflüssige Leerzeile am Ende (Zeilen 7-8). Dies reduziert die Lesbarkeit und ist inkonsistent.

✨ Vorschlag zur Behebung des Formatierungsproblems
-    question: 'Which of the following best describes "Domain-Driven Design according
-      to Evans"?
-
-      '
+    question: 'Which of the following best describes "Domain-Driven Design according to Evans"?'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/domain-driven-design.yaml` around lines 5 - 8, Die
Fragezeichenzeile unter dem Schlüssel "question" enthält eine überflüssige
Leerzeile am Ende; entferne die zusätzliche neue Zeile innerhalb des
String-Werts für "question" so der Text endet direkt nach dem Satz "According to
Evans" (bzw. nach dem abschließenden Anführungszeichen) ohne die leere Zeile, z.
B. by collapsing the multi-line YAML scalar so there is no trailing blank line
in the "question" value.
evaluations/specs/cynefin-framework.yaml (1)

5-7: Frage-String enthält unnötige Leerzeilen/Whitespace.

Line 5-7 erzeugt einen Prompt mit zusätzlichem Zeilenrauschen. Bitte als sauberen Einzeiler oder |-Block ohne leere Zeile speichern, damit alle Specs konsistent sind.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/cynefin-framework.yaml` around lines 5 - 7, The "question"
YAML value currently contains an extra blank line and trailing whitespace; edit
the question key to remove the empty line so the prompt is a clean single-line
string or convert it to a block scalar (use |) without any leading empty line,
ensuring there are no embedded blank lines or trailing spaces in the 'question'
value.
evaluations/specs/user-story-mapping.yaml (1)

11-12: Korrekte Option B bitte etwas präzisieren.

Line 11-12 ist als Ground-Truth recht knapp. Ergänze den Kern von Story Mapping (Backbone aus Aktivitäten, darunter Tasks/Stories, häufige Priorisierung nach Releases), damit die Recognition-Frage stabiler trennt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/user-story-mapping.yaml` around lines 11 - 12, Ergänze die
Beschreibung für Option "B:" so sie präzise den Kern von Story Mapping abbildet:
erwähne, dass eine Story Map ein Backbone aus übergeordneten Aktivitäten (user
activities) bildet, darunter die konkreten Tasks/Stories angeordnet werden,
diese häufig horizontal nach Ablauf/Workflow und vertikal nach
Detaillierungs-/Prioritätsstufen organisiert sind, und dass Items typischerweise
für Releases oder Iterationen priorisiert (MVP/Release-Slices) werden; lasse die
vorhandene knappe Formulierung ("Horizontal arrangement of user activities;
high-level tasks users perform") durch diese erweiterte, klarere Formulierung
ersetzen, sodass die Recognition-Frage stabiler trennt.
evaluations/specs/chain-of-thought.yaml (1)

11-12: Option B bitte neutraler formulieren, um CoT nicht auf sichtbare Gedankenketten zu verengen.

Die aktuelle Formulierung koppelt CoT stark an explizite Offenlegung von Zwischenschritten. Für die Anchor-Erkennung wäre eine prompt-/methodenbezogene Beschreibung robuster.

Formulierungsvorschlag
-      B: Explicitly show intermediate reasoning steps before reaching a conclusion;
-        make the thought process visible, not just the final answer
+      B: Encourage step-by-step reasoning by decomposing the problem into intermediate
+        steps before producing the final answer
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/chain-of-thought.yaml` around lines 11 - 12, Die
Formulierung für den Eintrag "B: Explicitly show intermediate reasoning steps
before reaching a conclusion; make the thought process visible, not just the
final answer" ist zu eng und zwingt CoT auf explizite sichtbare
Zwischenschritte; ändere den Text so er ist neutral gegenüber Darstellungsform
(z. B. ermutige zu chain-of-thought-ähnlichem reasoning oder zu geeigneten
methodischen/Prompt-beschreibungen für Anchor-Erkennung), ohne sichtbare
Gedankenketten vorzuschreiben; aktualisiere den YAML-Eintrag "B" entsprechend
(ersetzende Formulierung soll Zweck und Flexibilität erklären, z. B. „Encourage
chain-of-thought style reasoning or method-level description of intermediate
steps when helpful for verification/anchor detection“).
evaluations/generate-l1-specs.py (2)

60-61: Fehlende Fehlerbehandlung bei Tier-Parsing.

int() kann einen ValueError werfen, wenn der Tier-Wert ungültig oder leer ist.

🛡️ Robustere Variante
         elif line.startswith(":tier:"):
-            result["tier"] = int(line.split(":tier:")[1].strip())
+            tier_str = line.split(":tier:")[1].strip()
+            result["tier"] = int(tier_str) if tier_str.isdigit() else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-l1-specs.py` around lines 60 - 61, The tier parsing
currently does result["tier"] = int(line.split(":tier:")[1].strip()) and can
raise ValueError for empty/invalid values; update the block (the branch checking
line.startswith(":tier:") that sets result["tier"]) to catch ValueError (and
optionally TypeError), validate the extracted token before int() (e.g. check
non-empty and numeric), and on failure set a safe default (None or skip setting)
and/or log a clear warning so parsing doesn't crash the script.

149-156: Redundante Bedingung in should_skip.

Die Prüfung anchor_id not in SKIP_EXACT in Zeile 154 ist überflüssig, da Zeile 151-152 bereits True zurückgibt, wenn anchor_id in SKIP_EXACT enthalten ist.

♻️ Vorgeschlagene Vereinfachung
 def should_skip(anchor_id):
     """Check if anchor should be skipped."""
     if anchor_id in SKIP_EXACT:
         return True
     for prefix in SKIP_PREFIXES:
-        if anchor_id.startswith(prefix) and anchor_id not in SKIP_EXACT:
+        if anchor_id.startswith(prefix):
             return True
     return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-l1-specs.py` around lines 149 - 156, Die Funktion
should_skip enthält eine überflüssige Bedingung; entferne die Prüfung "and
anchor_id not in SKIP_EXACT" aus der Schleife über SKIP_PREFIXES und lasse die
Logik wie folgt: zuerst prüfen, ob anchor_id in SKIP_EXACT ist (wie aktuell),
und ansonsten in der for-Schleife nur noch prüfen, ob
anchor_id.startswith(prefix) für die Einträge in SKIP_PREFIXES; dadurch bleibt
das Verhalten unverändert, die redundante Bedingung wird aber eliminiert.
evaluations/pilot.py (2)

409-411: Stilverbesserung: Spread-Operator für Listen.

♻️ Modernere Syntax
                 if lang:
-                    variants = variants + [lang]
+                    variants = [*variants, lang]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 409 - 411, The code currently builds up
the variants list using an explicit concatenation (variants = variants + [lang])
after obtaining lang from cons.get("language_variant"); replace that with a more
idiomatic operation (e.g., append or in-place extend/spread) to avoid creating a
new list each time—use variants.append(lang) or variants += [lang] (or variants
= [*variants, lang] if you prefer spread semantics) to update the variants list
in place where lang is truthy.

256-262: Unbenutzte Variable model_id.

Die Variable model_id aus dem Tuple-Unpacking wird nicht verwendet.

♻️ Unterdrücke Linter-Warnung
         try:
-            response_text, model_id = call_fn(prompt)
+            response_text, _model_id = call_fn(prompt)
         except Exception as e:
             response_text = f"ERROR: {e}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 256 - 262, The tuple-unpacked variable
model_id returned from call_fn is unused and triggers a linter warning; change
the unpacking in the try block to either discard model_id (e.g., assign to _ or
_model_id) or adjust call_fn to return only response_text, and make the same
change in the except/handling flow so the variable is consistently ignored
(reference the call to call_fn and the local name model_id to locate the
change).
evaluations/fill-distractors.py (1)

74-82: JSON-Parsing könnte bei unerwarteten Antworten fehlschlagen.

Die Extraktion aus Markdown-Codeblöcken (Zeilen 76-80) ist fragil: Wenn die Antwort mehrere ```-Blöcke enthält oder das JSON nicht wohlgeformt ist, kann json.loads (Zeile 82) einen Fehler werfen, der von der äußeren Exception-Behandlung in main() abgefangen wird.

Für robustere Verarbeitung könnte ein expliziter Try-Catch mit spezifischer Fehlermeldung hinzugefügt werden.

♻️ Robustere JSON-Extraktion
     text = response.content[0].text.strip()
     # Parse JSON from response (might be wrapped in ```json ... ```)
     if "```" in text:
-        text = text.split("```")[1]
-        if text.startswith("json"):
-            text = text[4:]
-        text = text.strip()
+        parts = text.split("```")
+        for part in parts[1::2]:  # Ungerade Indizes sind Code-Blöcke
+            cleaned = part.lstrip("json").strip()
+            if cleaned.startswith("{"):
+                text = cleaned
+                break
 
-    return json.loads(text)
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError as e:
+        raise ValueError(f"Konnte JSON nicht parsen: {e}\nAntwort: {text[:200]}") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/fill-distractors.py` around lines 74 - 82, The current extraction
of JSON from response.content[0].text is brittle (splitting on "```" and taking
the first block) and then calls json.loads directly; update the extraction in
the function handling the response to iterate over code-block parts (e.g., parts
= text.split("```") and inspect odd-index parts), strip any leading "json" from
a block and select the first block that looks like JSON (e.g., startswith("{")
or "["), then wrap the final json.loads(text) call in a try/except catching
json.JSONDecodeError and raise a clearer ValueError that includes a short
preview of the problematic text; reference the response.content[0].text
processing and json.loads call so the main() exception handler will surface a
more informative error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluations/results/pilot-20260324-122900.json`:
- Around line 14-63: The results JSON is treating transport errors
(raw_response: "ERROR: HTTP Error 404: Not Found") as normal zero-score runs;
update the result-generation logic to detect transport/HTTP failures (e.g., when
raw_response contains HTTP error codes or the client throws a transport
exception) and mark those entries with run_status: "failed" and error_type:
"transport/http" (or similar); ensure such failed entries are excluded from
computing the top-level score and from ranking/aggregation so that the "score"
field and any aggregations only include runs with run_status: "ok" (inspect and
change the code that populates the "results" array and the score computation to
skip items with run_status != "ok").

In `@evaluations/results/pilot-20260324-132552.json`:
- Around line 1-1046: The output file shows empty raw_response fields, meaning
API errors weren't captured; update the evaluation flow where the OpenAI call is
made (the code that sets the "raw_response" field for models/openai and uses
config.openai_model) to wrap the API request in a try/catch, populate
raw_response with the caught error message/stack on failure, and ensure the
error is persisted to the results object before writing the JSON file (e.g., in
the function that constructs/perists each model result such as the API-call
handler or result-aggregation routine that currently assigns raw_response).
Ensure the catch also marks the individual result as failed so downstream
scoring sees the error.

In `@evaluations/specs/adr-according-to-nygard.yaml`:
- Around line 10-12: The current correct option for the "ADR according to
Nygard" anchor is too general (option B); update the option text for the ADR
anchor so it explicitly lists the Nygard ADR sections — e.g., "Status, Context,
Decision, Consequences" — and make that option the clearly correct choice for
the anchor; locate the anchor "ADR according to Nygard" and replace the generic
B description with the explicit Nygard-format wording (mentioning Status,
Context, Decision, Consequences) so the spec matches the review comment.

In `@evaluations/specs/atam.yaml`:
- Around line 13-26: The answer key is wrong: update the YAML so the `correct`
field references option C (the ATAM description) rather than B; locate the block
containing options B/C/D and the `correct: B` entry and change it to `correct:
C` so scoring aligns with the ATAM description.

In `@evaluations/specs/clean-architecture.yaml`:
- Around line 11-12: The option labeled "B" contains a semantic contradiction:
the phrase "Dependencies only point inward" conflicts with the arrow direction
"entities → use cases → interface adapters → frameworks & drivers" (which points
outward). Update the text for the "B" option so the arrow direction matches the
rule (either reverse the arrow to "frameworks & drivers → interface adapters →
use cases → entities" or reword the sentence to match the current arrow), and
ensure the semantic anchor for this item meets quality criteria (precise, rich,
consistent, attributable) before committing the change; locate and edit the
exact string starting with "B: Dependencies only point inward; entities → use
cases → interface adapters → frameworks & drivers".

In `@evaluations/specs/control-chart-shewhart.yaml`:
- Line 10: Update the Shewhart control chart answer text to be more precise:
replace or revise the option currently reading "Measured value plotted over
time; process mean" to explicitly mention plotted sample/statistic over time
with center line (process mean) and the statistical control limits (Upper
Control Limit and Lower Control Limit, UCL/LCL) used to detect out-of-control
signals; ensure the wording in the control-chart-shewhart.yaml entry for option
B references the center line and UCL/LCL (and that UCL/LCL are described as
derived from the process standard error) so the semantic anchor is precise and
testable.

In `@evaluations/specs/conventional-commits.yaml`:
- Around line 11-12: Option B's text ("B: '<type>[!][(optional scope)]:
<description> + optional body/footer; common types'") is incomplete and not
self-contained; update this YAML entry so it fully specifies the conventional
commit format and enumerates typical types (e.g., feat, fix, docs, style,
refactor, perf, test, chore) and explains optional scope, breaking-change marker
(!), and optional body/footer in one sentence so the evaluation can verify both
format and examples.

In `@evaluations/specs/cynefin-framework.yaml`:
- Line 11: Option B currently truncates and misformats the Cynefin description
("B: 'Five domains; : best practices apply, sense-categorize-respond'"); remove
the stray "; :" artifact and replace the value for key B with a concise, correct
formulation that lists the five Cynefin domains and the recommended decision
approach per domain (e.g., "B: 'Five domains — Obvious (best practice),
Complicated (expert analysis), Complex (probe-sense-respond), Chaotic
(act-sense-respond), Disorder (context undetermined)'"), ensuring the text is
precise, complete and consistently formatted for use as a ground-truth item.

In `@evaluations/specs/diataxis-framework.yaml`:
- Line 11: The mapping value for key "B" contains a syntax/text error ("; :")
and a vague description of Diátaxis; update the string for "B" to a clean,
precise description that names the four Diátaxis types and clarifies the
"learning-oriented" anchor (e.g., "Four documentation types: tutorials
(learning-oriented), how‑tos (task‑oriented), reference, and explanations") so
the semantic anchor is precise, rich, and consistent with the Diátaxis
framework.

In `@evaluations/specs/gherkin.yaml`:
- Around line 9-21: Die Antwortzuweisung ist falsch: die Gherkin-Definition ist
Option A (Given-When-Then), nicht B; update den Schlüssel "correct" in der Datei
so dass correct: A statt correct: B, überprüfe den Block mit den
Optionsdefinitionen (A/B/C/D) und die Variable "correct" um sicherzustellen,
dass die korrekte Option referenziert wird und bestehende Tests/Validierungen
noch mit der geänderten Zuordnung funktionieren.

In `@evaluations/specs/github-flow.yaml`:
- Line 11: The Ground Truth entry labeled "B: Workflow steps" is too
vague—replace it with a precise, discriminative list of the central GitHub Flow
steps (e.g., "1. Create a feature branch; 2. Commit changes locally; 3. Open a
pull request; 4. Request review and address feedback; 5. Merge into main; 6.
Deploy from main"), ensuring the YAML value for the key "B" contains those
explicit steps, is consistent in phrasing, and attributable to GitHub Flow so
the item becomes evaluable.

In `@evaluations/specs/iec-61508-sil-levels.yaml`:
- Line 11: The option value under key "B" currently reads "Four Safety Integrity
Levels; : 10^-2^ ≤ pfd < 10^-1^ (tolerable risk reduction)" which is
inconsistent (claims four levels but shows only one); update the "B" value to be
semantically consistent by either (a) enumerating all four SIL ranges for
SIL1–SIL4 with their correct pfd intervals, or (b) change the wording to
indicate this is specifically SIL1 and keep the pfd range (10^-2 ≤ pfd < 10^-1)
and adjust the label to "SIL1" and remove the "Four Safety Integrity Levels"
phrase so the anchor is precise, rich, and consistent.

In `@evaluations/specs/invest.yaml`:
- Around line 13-16: Option B currently only documents Independent and
Negotiable; update the Option B text so it lists and briefly describes all six
INVEST criteria (Independent, Negotiable, Valuable, Estimable, Small, Testable)
as the correct choice, and ensure the wording follows the project's
semantic-anchor rules (precise boundaries, rich interconnections, consistency,
and attribution) when adding each criterion.

In `@evaluations/specs/linddun.yaml`:
- Around line 12-14: Option B currently only covers Linkability and
Identifiability; update the Option B description in linddun.yaml (the "Option B"
block) to list and briefly define all seven LINDDUN categories: Linkability,
Identifiability, Non-repudiation, Detectability, Disclosure of information,
Unawareness, and Non-compliance, using concise one-line definitions for each so
the threat model fully reflects the LINDDUN acronym.

In `@evaluations/specs/madr.yaml`:
- Around line 11-12: The current entry for the MADR anchor is too generic
(option "B"); update the MADR anchor definition to list MADR-specific,
attributable characteristics: mention the markdown-based ADR template and
required sections (e.g., Title/Status/Context/Decision/Consequences), use of
front-matter metadata (id, date, author, status), repository-based tracking
(file per ADR with consistent naming), and adherence to the MADR
template/templating tool; replace the vague text in option "B" with this precise
set of attributes referencing the MADR anchor so scores reflect those unique
traits.

In `@evaluations/specs/moscow.yaml`:
- Around line 12-15: Die MoSCoW-Beschreibung in moscow.yaml ist unvollständig:
die Kategorien "Could have" und "Won’t have" fehlen (aktuell sind nur die
Einträge unter "B:" und "C:" formuliert). Ergänze im YAML explizit die beiden
fehlenden Kategorien mit kurzen Definitionen für "Could have" (wünschenswerte,
nicht-kritische Features) und "Won’t have" (bewusst ausgeschlossene Items) so
dass alle vier MoSCoW-Kategorien (Must, Should, Could, Won’t) klar abgedeckt
sind; achte darauf, die bestehenden Labels ("B:", "C:") oder deren äquivalente
Keys konsistent beizubehalten und die Formulierungen analog zu den vorhandenen
Beschreibungen zu gestalten.

In `@evaluations/specs/owasp-top-10.yaml`:
- Around line 12-15: Die aktuellen Einträge "B:" und "C:" in der
owasp-top-10.yaml fassen zwei konkrete Kategorien zusammen; stattdessen ersetze
die Beschreibung dort durch eine knappe Definition des OWASP Top 10 als
periodisch aktualisierte Top-Risikoliste für Webanwendungen (z. B. „OWASP Top
10: eine regelmäßig aktualisierte Liste der zehn wichtigsten Sicherheitsrisiken
für Webanwendungen“), entferne die vermischten Risiko-Details aus den
"B:"/"C:"-Feldern und verschiebe konkrete Risiko- oder
Verschlüsselungsbeschreibungen in separate Einträge für die jeweiligen
Risikokategorien wenn nötig.

In `@evaluations/specs/semantic-versioning.yaml`:
- Line 11: Option B currently has a syntax error and an incomplete description:
replace the line "B: 'Version format; : incompatible api changes (breaking
changes)'" with a corrected, complete description that fixes the stray
colon/semicolon, capitalizes "API", and explains SemVer parts (MAJOR for
incompatible/ breaking changes, MINOR for backwards-compatible feature
additions, PATCH for backwards-compatible bug fixes); update the value for the
symbol B in semantic-versioning.yaml accordingly so it reads like a single,
clear sentence describing MAJOR.MINOR.PATCH and their meanings.

In `@evaluations/specs/stride.yaml`:
- Around line 12-15: Option B only covers Spoofing and Tampering; update the
entry labeled "B:" in the STRIDE spec to name all six STRIDE threats (Spoofing,
Tampering, Repudiation, Information disclosure, Denial of service, Elevation of
privilege), and for each provide a one-line mitigation mapping (e.g., strong
authentication for Spoofing, integrity controls/digital signatures for
Tampering, non-repudiation/audit logging for Repudiation, encryption/access
controls for Information disclosure, rate-limiting/fault tolerance for Denial of
service, and least privilege/privilege separation for Elevation of privilege) so
the anchor is precise, complete, and actionable.

In `@evaluations/specs/swot.yaml`:
- Around line 12-15: Die Definition unter "B:" beschreibt nur interne Faktoren
(Strengths/Weaknesses) und fehlt die Erwähnung externer Faktoren
(Opportunities/Threats); aktualisiere den Eintrag "B:" so dass er entweder klar
als "interne Stärken/Schwächen" gekennzeichnet wird or—besser—die vollständige
SWOT-Definition enthält (Strengths, Weaknesses, Opportunities, Threats) bzw.
verschiebe/erweitere Inhalte so alle vier Komponenten (S, W, O, T) korrekt
abgedeckt sind; überprüfe auch benachbarte Einträge wie "C:" um sicherzustellen,
dass die Begriffsabgrenzungen konsistent bleiben.

---

Minor comments:
In `@evaluations/specs/bem-methodology.yaml`:
- Around line 12-14: Option B is too focused on the Block concept and benefit;
update the BEM anchor text (the option labeled "B") to explicitly state the
Block/Element/Modifier triad as the core concept, include brief examples for
each (e.g., block: "menu", element: "menu__item", modifier: "menu--active"), and
ensure the wording meets the semantic-anchor quality criteria (Precise, Rich,
Consistent, Attributable) by making boundaries clear, activating interconnected
concepts (naming, specificity, maintainability), and keeping attribution to BEM
terminology.

In `@evaluations/specs/c4-diagrams.yaml`:
- Around line 11-13: Item B currently has a typo and an imprecise definition for
the C4 model; update the YAML entry for symbol "B" so it reads a precise
enumeration like "Four levels of abstraction: System Context (system in its
environment), Container (apps/services), Component (high‑level modules), Code
(classes/implementation)" (ensure the stray ":" is removed), keep `correct: B`
unchanged, and make the wording rich and consistent with the C4 terminology so
the semantic anchor meets precision, richness, and consistency criteria.

In `@evaluations/specs/ears-requirements.yaml`:
- Line 11: Die Option für den Schlüssel B enthält doppelte äußere
Anführungszeichen; öffne die Definition für B und entferne die äußeren `"` so
dass nur die inneren, intendierten Phrasen erhalten bleiben (z.B. ändere die
Wertzuweisung bei B von '"The <system> shall <requirement>"; "when <trigger> the
<system> shall <requirement>"' auf die Version ohne die zusätzlichen umgebenden
Anführungszeichen), prüfe die Zeichenfolge in der YAML-Zeile mit dem Schlüssel B
und stelle sicher, dass nur die beabsichtigten Zitate innerhalb der Phrasen
verbleiben.

In `@evaluations/specs/problem-space-nvc.yaml`:
- Line 13: Replace the lowercase example text "i feel frustrated" with the
grammatically correct "I feel frustrated" in the YAML entry (search for the
exact string "i feel frustrated" in problem-space-nvc.yaml) so the option reads
with proper capitalization and professional tone.

In `@evaluations/specs/todotxt-flavoured-markdown.yaml`:
- Around line 13-14: Die Beschreibung "uses todo.txt priority notation `(a)`,
`(b)`, `(c)` where `(a)` is highest priority" ist fachlich inkorrekt; passe die
Notation auf die übliche Großschreibweise `(A)`, `(B)`, `(C)` an, also ersetze
alle Vorkommen von `(a)`, `(b)`, `(c)` (und die erklärende Klammer, z. B. "where
`(a)` is highest") durch `(A)`, `(B)`, `(C)` bzw. "where `(A)` is highest" in
der Spezifikation, damit die Prioritätsnotation korrekt dokumentiert ist.

---

Nitpick comments:
In `@evaluations/fill-distractors.py`:
- Around line 74-82: The current extraction of JSON from
response.content[0].text is brittle (splitting on "```" and taking the first
block) and then calls json.loads directly; update the extraction in the function
handling the response to iterate over code-block parts (e.g., parts =
text.split("```") and inspect odd-index parts), strip any leading "json" from a
block and select the first block that looks like JSON (e.g., startswith("{") or
"["), then wrap the final json.loads(text) call in a try/except catching
json.JSONDecodeError and raise a clearer ValueError that includes a short
preview of the problematic text; reference the response.content[0].text
processing and json.loads call so the main() exception handler will surface a
more informative error.

In `@evaluations/generate-l1-specs.py`:
- Around line 60-61: The tier parsing currently does result["tier"] =
int(line.split(":tier:")[1].strip()) and can raise ValueError for empty/invalid
values; update the block (the branch checking line.startswith(":tier:") that
sets result["tier"]) to catch ValueError (and optionally TypeError), validate
the extracted token before int() (e.g. check non-empty and numeric), and on
failure set a safe default (None or skip setting) and/or log a clear warning so
parsing doesn't crash the script.
- Around line 149-156: Die Funktion should_skip enthält eine überflüssige
Bedingung; entferne die Prüfung "and anchor_id not in SKIP_EXACT" aus der
Schleife über SKIP_PREFIXES und lasse die Logik wie folgt: zuerst prüfen, ob
anchor_id in SKIP_EXACT ist (wie aktuell), und ansonsten in der for-Schleife nur
noch prüfen, ob anchor_id.startswith(prefix) für die Einträge in SKIP_PREFIXES;
dadurch bleibt das Verhalten unverändert, die redundante Bedingung wird aber
eliminiert.

In `@evaluations/pilot.py`:
- Around line 409-411: The code currently builds up the variants list using an
explicit concatenation (variants = variants + [lang]) after obtaining lang from
cons.get("language_variant"); replace that with a more idiomatic operation
(e.g., append or in-place extend/spread) to avoid creating a new list each
time—use variants.append(lang) or variants += [lang] (or variants = [*variants,
lang] if you prefer spread semantics) to update the variants list in place where
lang is truthy.
- Around line 256-262: The tuple-unpacked variable model_id returned from
call_fn is unused and triggers a linter warning; change the unpacking in the try
block to either discard model_id (e.g., assign to _ or _model_id) or adjust
call_fn to return only response_text, and make the same change in the
except/handling flow so the variable is consistently ignored (reference the call
to call_fn and the local name model_id to locate the change).

In `@evaluations/specs/bluf.yaml`:
- Around line 5-7: Die Fragezeichen-Zeile unter dem YAML-Schlüssel question
enthält eine überflüssige Leerzeile; öffnen Sie die YAML-Einheit mit dem
Schlüssel question in evaluations/specs/bluf.yaml und entfernen Sie die leere
Zeile zwischen "Which of the following best describes \"BLUF (Bottom Line Up
Front)\"?" und dem abschließenden String-Rahmen, sodass der question-Wert in
einer durchgehenden Zeile steht (Schlüssel: question, Suchbegriff: "BLUF (Bottom
Line Up Front)").

In `@evaluations/specs/chain-of-thought.yaml`:
- Around line 11-12: Die Formulierung für den Eintrag "B: Explicitly show
intermediate reasoning steps before reaching a conclusion; make the thought
process visible, not just the final answer" ist zu eng und zwingt CoT auf
explizite sichtbare Zwischenschritte; ändere den Text so er ist neutral
gegenüber Darstellungsform (z. B. ermutige zu chain-of-thought-ähnlichem
reasoning oder zu geeigneten methodischen/Prompt-beschreibungen für
Anchor-Erkennung), ohne sichtbare Gedankenketten vorzuschreiben; aktualisiere
den YAML-Eintrag "B" entsprechend (ersetzende Formulierung soll Zweck und
Flexibilität erklären, z. B. „Encourage chain-of-thought style reasoning or
method-level description of intermediate steps when helpful for
verification/anchor detection“).

In `@evaluations/specs/cqrs.yaml`:
- Around line 13-16: Die Beschreibung unter "B: Bertrand Meyer's principle" /
the CQS entry is too focused on CQS; update it to explicitly describe CQRS by
stating that reads and writes use separate models/interfaces (a Read
model/Query-side and a Write model/Command-side), that reads do not perform
state changes and writes encapsulate intent in immutable command objects, and
note that the two sides can have different representations, validation and
storage; replace or augment the current CQS wording to mention "separate Read
and Write models/interfaces" and ensure the term CQRS appears so readers
understand it's distinct from simple CQS.

In `@evaluations/specs/cynefin-framework.yaml`:
- Around line 5-7: The "question" YAML value currently contains an extra blank
line and trailing whitespace; edit the question key to remove the empty line so
the prompt is a clean single-line string or convert it to a block scalar (use |)
without any leading empty line, ensuring there are no embedded blank lines or
trailing spaces in the 'question' value.

In `@evaluations/specs/domain-driven-design.yaml`:
- Around line 5-8: Die Fragezeichenzeile unter dem Schlüssel "question" enthält
eine überflüssige Leerzeile am Ende; entferne die zusätzliche neue Zeile
innerhalb des String-Werts für "question" so der Text endet direkt nach dem Satz
"According to Evans" (bzw. nach dem abschließenden Anführungszeichen) ohne die
leere Zeile, z. B. by collapsing the multi-line YAML scalar so there is no
trailing blank line in the "question" value.

In `@evaluations/specs/fowler-patterns.yaml`:
- Around line 13-14: Option B listet Pattern-Namen inkonsistent: nach "Service
Layer" sind die Begriffe kleingeschrieben; ändere die Einträge in der
Option-B-Zeile so dass alle Pattern-Namen großgeschrieben und einheitlich als
Eigennamen erscheinen (z. B. "Table Data Gateway", "Row Data Gateway", "Active
Record", "Data Mapper") und behalte bestehende Trennzeichen/Kommas unverändert,
sodass die Zeile mit "B: Transaction Script, Domain Model, Table Module, Service
Layer; Table Data Gateway, Row Data Gateway, Active Record, Data Mapper"
konsistent ist.
- Around line 5-8: Die Frage unter dem YAML-Schlüssel question enthält eine
überflüssige Leerzeile; editieren Sie den value von question in
evaluations/specs/fowler-patterns.yaml (einzelner "question" Eintrag) und
entfernen Sie die Zwischenzeile so der Text in eine einzige zusammenhängende
Zeile/Block übergeht, ohne zusätzliche leere Zeilen oder führende Leerzeichen;
prüfen Sie danach das YAML-Indent-Level, damit die Formatierung unverändert
bleibt.

In `@evaluations/specs/lasr.yaml`:
- Around line 1-22: Add an optional pre-validation note for the semantic anchor
"lasr" by extending the YAML around anchor: lasr (and related
questions.recognition) with a small anchor_quality or validation field that
records the recommended pre-check: evaluate LLM responses to "What concepts do
you associate with <semantic anchor name>?" and capture Recognition, Accuracy,
Depth, and Specificity; include the test prompt, the four quality criteria, and
a short accepted/threshold guideline so later evaluation logic can reference
this metadata during scoring.

In `@evaluations/specs/linddun.yaml`:
- Around line 5-7: Die Fragezeichen-String unter dem YAML-Schlüssel question
enthält eine überflüssige Leerzeile am Ende; entferne die leere Zeile so dass
der Wert von question keine trailing blank line mehr hat (bearbeite den
question-Wert in evaluations/specs/linddun.yaml und lösche die leere Zeile nach
'Which of the following best describes "LINDDUN"?'), achte darauf, dass die
Anführungszeichen und Einrückung unverändert bleiben.

In `@evaluations/specs/semantic-versioning.yaml`:
- Around line 5-7: Remove the extraneous blank line inside the YAML value for
the question key so the question string is a single continuous line; edit the
question field (question: 'Which of the following best describes "Semantic
Versioning (SemVer)"?') to eliminate the empty line between lines 6–7 while
preserving the surrounding quotes and indentation.

In `@evaluations/specs/socratic-method.yaml`:
- Around line 5-7: Die YAML-Question-Scalar für den Schlüssel "question"
verwendet aktuell Inline-Quotes mit zusätzlichen Zeilenumbrüchen; wechsle zu
einem eindeutigen Block-Scalar (pipe |) für den "question"-Feldeintrag und
entferne die überflüssigen Leerzeilen/Trailing-Whitespace so der Inhalt exakt
lautet (z. B. | gefolgt von der Fragezeile "Which of the following best
describes \"Socratic Method\"?"), achte auf korrekte Einrückung des
Block-Scalars und erhalte den genauen Prompt-Text ohne zusätzliche blank lines.

In `@evaluations/specs/spc.yaml`:
- Around line 11-12: Option B (the value under key "B") is a fragmented,
verb-less phrase; rewrite it as a complete, grammatically correct sentence or
concise noun phrase to improve clarity and professionalism — for example convert
"Systematic statistical monitoring of running processes; inherent, random
fluctuation — stable and predictable" into a full phrase like "Systematic
statistical monitoring of running processes, showing inherent random
fluctuations that are stable and predictable" (update the string value for key B
in spc.yaml accordingly).
- Around line 5-7: The YAML question value for the key question contains an
unnecessary blank line; edit the question string (the question: field) to remove
the extra empty line and trailing newline so the text is a single contiguous
paragraph—i.e., trim internal/leading/trailing blank lines from the question
value and ensure the question: string reads "Which of the following best
describes \"SPC (Statistical Process Control)\"?" without the blank line.

In `@evaluations/specs/tdd-chicago-school.yaml`:
- Around line 5-7: Die Fragezeichen-Zeile unter dem YAML-Schlüssel question
enthält eine unnötige Leerzeile; entferne die überflüssige Leerzeile innerhalb
des question-Strings (bei dem Eintrag question: 'Which of the following best
describes "TDD, Chicago School"') so der Text in einer zusammenhängenden Zeile
steht und entferne ggf. überflüssige Leerzeichen oder Zeilenumbrüche am Ende des
Strings.

In `@evaluations/specs/user-story-mapping.yaml`:
- Around line 11-12: Ergänze die Beschreibung für Option "B:" so sie präzise den
Kern von Story Mapping abbildet: erwähne, dass eine Story Map ein Backbone aus
übergeordneten Aktivitäten (user activities) bildet, darunter die konkreten
Tasks/Stories angeordnet werden, diese häufig horizontal nach Ablauf/Workflow
und vertikal nach Detaillierungs-/Prioritätsstufen organisiert sind, und dass
Items typischerweise für Releases oder Iterationen priorisiert
(MVP/Release-Slices) werden; lasse die vorhandene knappe Formulierung
("Horizontal arrangement of user activities; high-level tasks users perform")
durch diese erweiterte, klarere Formulierung ersetzen, sodass die
Recognition-Frage stabiler trennt.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 487eff58-252c-4849-a278-20a7a39827e6

📥 Commits

Reviewing files that changed from the base of the PR and between 73cf01a and e315226.

📒 Files selected for processing (78)
  • evaluations/.gitignore
  • evaluations/README.adoc
  • evaluations/fill-distractors.py
  • evaluations/generate-l1-specs.py
  • evaluations/pilot.py
  • evaluations/results/pilot-20260324-122900.json
  • evaluations/results/pilot-20260324-123126.json
  • evaluations/results/pilot-20260324-131427.json
  • evaluations/results/pilot-20260324-131452.json
  • evaluations/results/pilot-20260324-132301.json
  • evaluations/results/pilot-20260324-132431.json
  • evaluations/results/pilot-20260324-132552.json
  • evaluations/results/pilot-20260324-135717.json
  • evaluations/results/pilot-20260324-140232.json
  • evaluations/results/pilot-20260324-161744.json
  • evaluations/specs/adr-according-to-nygard.yaml
  • evaluations/specs/arc42.yaml
  • evaluations/specs/atam.yaml
  • evaluations/specs/bdd-given-when-then.yaml
  • evaluations/specs/bem-methodology.yaml
  • evaluations/specs/bluf.yaml
  • evaluations/specs/c4-diagrams.yaml
  • evaluations/specs/chain-of-thought.yaml
  • evaluations/specs/clean-architecture.yaml
  • evaluations/specs/control-chart-shewhart.yaml
  • evaluations/specs/conventional-commits.yaml
  • evaluations/specs/cqrs.yaml
  • evaluations/specs/cynefin-framework.yaml
  • evaluations/specs/definition-of-done.yaml
  • evaluations/specs/devils-advocate.yaml
  • evaluations/specs/diataxis-framework.yaml
  • evaluations/specs/docs-as-code.yaml
  • evaluations/specs/domain-driven-design.yaml
  • evaluations/specs/ears-requirements.yaml
  • evaluations/specs/event-driven-architecture.yaml
  • evaluations/specs/fagan-inspection.yaml
  • evaluations/specs/feynman-technique.yaml
  • evaluations/specs/five-whys.yaml
  • evaluations/specs/fowler-patterns.yaml
  • evaluations/specs/gherkin.yaml
  • evaluations/specs/github-flow.yaml
  • evaluations/specs/gutes-deutsch-wolf-schneider.yaml
  • evaluations/specs/hexagonal-architecture.yaml
  • evaluations/specs/iec-61508-sil-levels.yaml
  • evaluations/specs/impact-mapping.yaml
  • evaluations/specs/invest.yaml
  • evaluations/specs/iso-25010.yaml
  • evaluations/specs/jobs-to-be-done.yaml
  • evaluations/specs/lasr.yaml
  • evaluations/specs/linddun.yaml
  • evaluations/specs/llm-evaluations.yaml
  • evaluations/specs/madr.yaml
  • evaluations/specs/mece.yaml
  • evaluations/specs/morphological-box.yaml
  • evaluations/specs/moscow.yaml
  • evaluations/specs/mutation-testing.yaml
  • evaluations/specs/negative-control.yaml
  • evaluations/specs/nelson-rules.yaml
  • evaluations/specs/owasp-top-10.yaml
  • evaluations/specs/plain-english-strunk-white.yaml
  • evaluations/specs/prd.yaml
  • evaluations/specs/problem-space-nvc.yaml
  • evaluations/specs/property-based-testing.yaml
  • evaluations/specs/pyramid-principle.yaml
  • evaluations/specs/sanity-check.yaml
  • evaluations/specs/semantic-versioning.yaml
  • evaluations/specs/socratic-method.yaml
  • evaluations/specs/sota.yaml
  • evaluations/specs/spc.yaml
  • evaluations/specs/stride.yaml
  • evaluations/specs/swot.yaml
  • evaluations/specs/tdd-chicago-school.yaml
  • evaluations/specs/tdd-london-school.yaml
  • evaluations/specs/testing-pyramid.yaml
  • evaluations/specs/timtowtdi.yaml
  • evaluations/specs/todotxt-flavoured-markdown.yaml
  • evaluations/specs/user-story-mapping.yaml
  • evaluations/specs/wardley-mapping.yaml

Comment thread evaluations/results/pilot-20260324-122900.json Outdated
Comment thread evaluations/results/pilot-20260324-132552.json Outdated
Comment thread evaluations/specs/adr-according-to-nygard.yaml Outdated
Comment thread evaluations/specs/atam.yaml Outdated
Comment thread evaluations/specs/clean-architecture.yaml
Comment thread evaluations/specs/moscow.yaml
Comment thread evaluations/specs/owasp-top-10.yaml Outdated
Comment thread evaluations/specs/semantic-versioning.yaml Outdated
Comment on lines +12 to +15
B: Impersonating another user, process, or system to gain unauthorized access;
mitigated by strong authentication; unauthorized modification of data in transit
or at rest; mitigated by integrity controls, digital signatures, and access
controls

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

„STRIDE“ ist als richtige Antwort nur teilweise beschrieben.

Line 12–15 decken praktisch nur Spoofing und Tampering ab. Für eine valide Anchor-Evaluation sollte die korrekte Option das vollständige STRIDE-Modell (S, T, R, I, D, E) benennen.

Vorschlag für vollständige Option B
-      B: Impersonating another user, process, or system to gain unauthorized access;
-        mitigated by strong authentication; unauthorized modification of data in transit
-        or at rest; mitigated by integrity controls, digital signatures, and access
-        controls
+      B: Threat-Modeling-Ansatz mit sechs Kategorien: Spoofing, Tampering, Repudiation,
+        Information Disclosure, Denial of Service und Elevation of Privilege; dient
+        der systematischen Identifikation von Bedrohungen und passenden Gegenmaßnahmen

Based on learnings: Validate semantic anchors meet quality criteria before adding: Precise (specific body of knowledge with clear boundaries), Rich (activates multiple interconnected concepts), Consistent, Attributable.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
B: Impersonating another user, process, or system to gain unauthorized access;
mitigated by strong authentication; unauthorized modification of data in transit
or at rest; mitigated by integrity controls, digital signatures, and access
controls
B: Threat-Modeling-Ansatz mit sechs Kategorien: Spoofing, Tampering, Repudiation,
Information Disclosure, Denial of Service und Elevation of Privilege; dient
der systematischen Identifikation von Bedrohungen und passenden Gegenmaßnahmen
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/stride.yaml` around lines 12 - 15, Option B only covers
Spoofing and Tampering; update the entry labeled "B:" in the STRIDE spec to name
all six STRIDE threats (Spoofing, Tampering, Repudiation, Information
disclosure, Denial of service, Elevation of privilege), and for each provide a
one-line mitigation mapping (e.g., strong authentication for Spoofing, integrity
controls/digital signatures for Tampering, non-repudiation/audit logging for
Repudiation, encryption/access controls for Information disclosure,
rate-limiting/fault tolerance for Denial of service, and least
privilege/privilege separation for Elevation of privilege) so the anchor is
precise, complete, and actionable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — nur Spoofing/Tampering abgedeckt. Wird vervollständigt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 44 seconds before sending another message.

Comment thread evaluations/specs/swot.yaml
raifdmueller and others added 3 commits March 24, 2026 17:22
Fixed 5 specs where auto-generated correct answers were wrong
(ATAM, Gherkin, ADR Nygard, LINDDUN, OWASP Top 10).

Results (75 questions each, excluding sanity check):
- Claude Sonnet: 100% (perfect across all anchors)
- GPT-4o: 96% (LASR 75%, PRD 75%, TDD position bias)
- Mistral Large: 94% (PRD 0%, LASR 25%, Control Chart 50%)

Key finding: auto-generated specs need human review — Claude scoring
0% on ATAM/Gherkin proved the specs were wrong, not Claude.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Produces a self-contained HTML report with:
- Model summary cards with average scores
- Color-coded heatmap (anchor × model)
- Expandable detail rows for multi-question anchors
- Separate control questions section
- Failure detail per model
- Run metadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kept only the 3 final runs (post spec-fix):
- Claude Sonnet: 100% (170530)
- Mistral Large: 94% (171539)
- GPT-4o: 96% (171546)

Fixed report generator to prefer latest run when question count is equal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluations/generate-report.py`:
- Around line 167-171: The HTML templates render unescaped dynamic values (e.g.,
display, info['file'], label, anchor_id) causing possible HTML injection/XSS;
fix by adding and using a central escape helper (e.g., html_escape) and apply it
wherever strings are interpolated into templates — replace direct uses of
display, info['file'], label, anchor_id (and any other user-driven values at the
noted sites including where avg/n are shown) with escaped versions before
concatenation, and ensure score_color remains safe or validated separately.
- Around line 25-32: The report's MODEL_DISPLAY mapping omits "deepseek",
causing DeepSeek runs to be hidden; add an entry "deepseek": "DeepSeek" to the
MODEL_DISPLAY dict and update any rendering usage to use a safe lookup (e.g.,
MODEL_DISPLAY.get(model_key, model_key or model_key.title())) so unknown models
fall back to a readable label; search for usages of MODEL_DISPLAY and the code
that builds the model list (references to MODEL_DISPLAY and the
report-generation function) and replace direct indexing with a .get fallback.
- Around line 40-42: Beim Einlesen der Dateien matching "pilot-*.json" (variable
f in der Schleife über RESULTS_DIR.glob) fehlt Robustheit: ein fehlerhaftes JSON
in json.load(open(...)) bricht die ganze Report-Generierung ab. Um das zu
beheben, öffne jede Datei mit with open(f, encoding="utf-8") as fh und versuche
json.load(fh) in einem try/except-Block aufzufangen (mindestens JSONDecodeError
und OSError), logge einen aussagekräftigen Fehler inklusive Dateiname und
Exception und skippe mit continue die defekte Datei; danach wie bisher mit
d["models"].items() (m, r) weiterverarbeiten.
- Around line 43-44: Der aktuelle Vergleich wählt nur nach len(r) und lässt bei
Gleichstand den zuerst geladenen (älteren) Lauf stehen; ändere die Bedingung
beim Setzen von results[m] so dass bei Gleichstand der neuere Lauf anhand eines
Timestamp-Feldes bevorzugt wird (z.B. r.get("timestamp")). Konkret: erweitere
die if-Bedingung (betrifft results, m, r, results[m]["data"]) um einen
zusätzlichen Tie-Break: (len(r) == len(results[m]["data"]) and
r.get("timestamp", 0) > results[m].get("timestamp", 0)), und stelle sicher, dass
beim Setzen von results[m] auch das Timestamp-Feld gespeichert wird, damit
zukünftige Vergleiche korrekt arbeiten.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4047fd8f-94e2-4a96-9e5f-bb3c799aa13d

📥 Commits

Reviewing files that changed from the base of the PR and between 9e34b4e and e8c78d2.

📒 Files selected for processing (2)
  • evaluations/generate-report.py
  • evaluations/report.html

Comment on lines +25 to +32
MODEL_DISPLAY = {
"claude": "Claude Sonnet",
"claude-cli": "Claude Sonnet (CLI)",
"claude-haiku": "Claude Haiku",
"openai": "GPT-4o",
"mistral": "Mistral Large",
"ollama": "Ollama (local)",
}

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DeepSeek-Ergebnisse werden aktuell nie angezeigt.

Line 78–85 bauen die Modellliste hartkodiert auf; deepseek fehlt dort vollständig. Dadurch kann ein vorhandener Run im Report unsichtbar bleiben.

🔧 Vorschlag: Modellliste erweitern + Fallback
 MODEL_DISPLAY = {
     "claude": "Claude Sonnet",
     "claude-cli": "Claude Sonnet (CLI)",
     "claude-haiku": "Claude Haiku",
     "openai": "GPT-4o",
     "mistral": "Mistral Large",
+    "deepseek": "DeepSeek",
     "ollama": "Ollama (local)",
 }
@@
-    for m in ["claude", "openai", "mistral"]:
+    for m in ["claude", "openai", "mistral", "deepseek"]:
         if m in results and len(results[m]["data"]) >= 60:
             model_names.append(m)
@@
     for m in ["claude-cli", "claude-haiku", "ollama"]:
         if m in results and m not in model_names:
             model_names.append(m)
+
+    for m in sorted(results.keys()):
+        if m not in model_names:
+            model_names.append(m)

Also applies to: 78-85

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 25 - 32, The report's
MODEL_DISPLAY mapping omits "deepseek", causing DeepSeek runs to be hidden; add
an entry "deepseek": "DeepSeek" to the MODEL_DISPLAY dict and update any
rendering usage to use a safe lookup (e.g., MODEL_DISPLAY.get(model_key,
model_key or model_key.title())) so unknown models fall back to a readable
label; search for usages of MODEL_DISPLAY and the code that builds the model
list (references to MODEL_DISPLAY and the report-generation function) and
replace direct indexing with a .get fallback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — DeepSeek wird ergänzt wenn API-Key verfügbar.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 41 seconds before sending another message.

Comment on lines +40 to +42
for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
d = json.load(open(f, encoding="utf-8"))
for m, r in d["models"].items():

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ein beschädigtes Ergebnisfile stoppt die gesamte Report-Generierung.

Auf Line 40–42 fehlt Fehlerbehandlung beim Laden. Ein einzelnes fehlerhaftes JSON verhindert den kompletten Report.

🔧 Vorschlag: robust laden und defekte Dateien überspringen
-    for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
-        d = json.load(open(f, encoding="utf-8"))
-        for m, r in d["models"].items():
+    for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
+        try:
+            with f.open(encoding="utf-8") as fh:
+                d = json.load(fh)
+            model_results = d.get("models", {})
+        except (OSError, json.JSONDecodeError) as exc:
+            print(f"Skipping {f.name}: {exc}")
+            continue
+
+        for m, r in model_results.items():
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
d = json.load(open(f, encoding="utf-8"))
for m, r in d["models"].items():
for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
try:
with f.open(encoding="utf-8") as fh:
d = json.load(fh)
model_results = d.get("models", {})
except (OSError, json.JSONDecodeError) as exc:
print(f"Skipping {f.name}: {exc}")
continue
for m, r in model_results.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 40 - 42, Beim Einlesen der
Dateien matching "pilot-*.json" (variable f in der Schleife über
RESULTS_DIR.glob) fehlt Robustheit: ein fehlerhaftes JSON in
json.load(open(...)) bricht die ganze Report-Generierung ab. Um das zu beheben,
öffne jede Datei mit with open(f, encoding="utf-8") as fh und versuche
json.load(fh) in einem try/except-Block aufzufangen (mindestens JSONDecodeError
und OSError), logge einen aussagekräftigen Fehler inklusive Dateiname und
Exception und skippe mit continue die defekte Datei; danach wie bisher mit
d["models"].items() (m, r) weiterverarbeiten.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — Error Handling wird ergänzt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 40 seconds before sending another message.

Comment thread evaluations/generate-report.py Outdated
Comment on lines +43 to +44
if m not in results or len(r) > len(results[m]["data"]):
results[m] = {

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bei Gleichstand wird ein veralteter Lauf ausgewählt.

Auf Line 43 und Line 44 wird nur len(r) verglichen. Bei gleicher Fragezahl bleibt der zuerst geladene (typisch ältere) Lauf aktiv, neuere Ergebnisse werden verworfen.

🔧 Vorschlag zur Korrektur (Tie-Break über Timestamp)
-            if m not in results or len(r) > len(results[m]["data"]):
+            prev = results.get(m)
+            prev_len = len(prev["data"]) if prev else -1
+            prev_ts = prev.get("timestamp", "") if prev else ""
+            curr_ts = d.get("timestamp", "")
+            if (
+                prev is None
+                or len(r) > prev_len
+                or (len(r) == prev_len and curr_ts > prev_ts)
+            ):
                 results[m] = {
                     "data": r,
                     "file": f.name,
                     "config": d.get("config", {}),
                     "duration": d.get("duration_seconds", 0),
                     "timestamp": d.get("timestamp", ""),
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 43 - 44, Der aktuelle Vergleich
wählt nur nach len(r) und lässt bei Gleichstand den zuerst geladenen (älteren)
Lauf stehen; ändere die Bedingung beim Setzen von results[m] so dass bei
Gleichstand der neuere Lauf anhand eines Timestamp-Feldes bevorzugt wird (z.B.
r.get("timestamp")). Konkret: erweitere die if-Bedingung (betrifft results, m,
r, results[m]["data"]) um einen zusätzlichen Tie-Break: (len(r) ==
len(results[m]["data"]) and r.get("timestamp", 0) > results[m].get("timestamp",
0)), und stelle sicher, dass beim Setzen von results[m] auch das Timestamp-Feld
gespeichert wird, damit zukünftige Vergleiche korrekt arbeiten.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gefixt — >= statt > sorgt dafür, dass der neueste Run gewinnt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 38 seconds before sending another message.

Comment thread evaluations/generate-report.py
raifdmueller and others added 3 commits March 24, 2026 17:44
…g#332)

Auto-generated realistic scenarios with anchor prompt, calibrated
paraphrase, and 4 MC options per anchor using Claude API.

Each question tests whether the model APPLIES the methodology:
- Scenario: concrete software engineering situation
- Anchor variant: "using <anchor name>"
- Paraphrase variant: goal description without naming the method
- 4 options: one correct (reflects methodology), three plausible alternatives

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shows e.g. [42/193] arc42 / recognition... 100%

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old paraphrase "validate and deepen understanding" was too vague —
any learning method would score. New paraphrase "identify and fill
knowledge gaps" describes the goal without naming the method.

If models still score low on the paraphrase, that shows the anchor
name adds genuine value beyond the goal description.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (3)
evaluations/generate-report.py (3)

167-170: ⚠️ Potential issue | 🟠 Major

Dynamische Werte vor dem Rendern escapen.

label, anchor_id, info['file'] und Modellnamen werden direkt ins HTML interpoliert. Ein manipuliertes Result-JSON oder schon ein ungewöhnlicher Dateiname genügt hier für HTML-Injection im generierten Report.

Also applies to: 194-194, 209-209, 230-230, 248-252, 264-264

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 167 - 170, The HTML template
concatenation injects unescaped dynamic values (display/model names, label,
anchor_id, info['file'], etc.) allowing HTML injection; fix by HTML-escaping
these variables before interpolation (e.g., call an escape utility such as
html.escape on display, any model name variables, label, anchor_id, and
info['file']) and use the escaped versions when building the HTML strings
(ensure the same change is applied to the other template sites noted). Also
validate/normalize anchor_id to safe characters if used as an element id.

40-41: ⚠️ Potential issue | 🟠 Major

Ein defektes Result-JSON bricht den ganzen Report.

Ohne Fehlerbehandlung beim Laden reicht eine kaputte Datei unter evaluations/results/, um die komplette Report-Generierung zu stoppen. Besser pro Datei OSError und JSONDecodeError abfangen und nur den betroffenen Run überspringen.

🔧 Vorschlag
-    for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
-        d = json.load(open(f, encoding="utf-8"))
-        for m, r in d["models"].items():
+    for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
+        try:
+            with f.open(encoding="utf-8") as fh:
+                d = json.load(fh)
+        except (OSError, json.JSONDecodeError) as exc:
+            print(f"Skipping {f.name}: {exc}")
+            continue
+
+        for m, r in d.get("models", {}).items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 40 - 41, Beim Einlesen der
einzelnen Result-JSONs (loop über sorted(RESULTS_DIR.glob("pilot-*.json")) und
json.load(open(...))) fehlt Error-Handling, wodurch ein defektes JSON die
gesamte Report-Erzeugung bricht; um das zu beheben, wickle das Öffnen und
json.load in ein try/except, fange OSError und json.JSONDecodeError ab, und bei
Fehlern logge die fehlerhafte Datei (z.B. mit print oder vorhandenen Logger) und
continue zum nächsten File, so dass nur der betroffene Run übersprungen wird;
verwende idealerweise with open(...) als Kontextmanager beim Lesen.

25-32: ⚠️ Potential issue | 🟠 Major

Modellliste und Anzeigenamen sind noch hart verdrahtet.

deepseek wird nie in model_names aufgenommen, partielle claude/openai/mistral-Runs verschwinden unter dem 60er-Schwellenwert komplett, und openai/ollama werden unabhängig vom tatsächlichen Run als statische Namen angezeigt. Dadurch blendet der Report vorhandene Ergebnisse aus oder beschriftet sie falsch.

Also applies to: 77-85, 162-165, 261-265

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-report.py` around lines 25 - 32, The report currently
uses a hard-coded MODEL_DISPLAY and static model_names which omits "deepseek",
hides models with partial runs under the 60% threshold, and always labels
"openai"/"ollama" the same regardless of actual run keys; update the logic so
that model_names are derived from the actual result keys present in the data
(include any "deepseek" or other dynamic keys), use MODEL_DISPLAY only as a
fallback for pretty names (default to a sensible generated name like capitalized
key if no mapping exists), and change the threshold/aggregation code that
filters models (the block referencing the 60 threshold and the places where
model lists are built/filtered around model_names and MODEL_DISPLAY) to always
include models with any results (show partial-run counts/percentages instead of
removing them) so labels and presence reflect actual runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evaluations/generate-l2-specs.py`:
- Around line 96-137: The script writes whatever generate_application(spec)
returns directly into spec["questions"]["application"], which can produce broken
YAML if required fields are missing; before assigning and persisting, validate
the returned app object (from generate_application) contains the required keys
like "scenario", "anchor_prompt", "paraphrase_prompt" and an "options" mapping
with entries "A","B","C","D" (or whatever option keys your spec expects); if
validation fails, log an error and skip writing (or raise) so the file is not
corrupted. Add a small validator function (e.g., validate_application(app)) and
call it in main() right after app = generate_application(spec) and before
spec["questions"]["application"] = app; if validation returns False,
print/processLogger.error with the anchor id and continue to next spec.

In `@evaluations/pilot.py`:
- Around line 47-56: The prompt generation in build_prompt only emits letters
A–D so a correct label "X" can never be returned (making sanity checks
impossible); update build_prompt to include the full set of possible response
tokens (including "X") when building options — e.g. iterate over the full
LETTERS set or detect when "X" is a valid option and append an "X) ..." line —
and ensure this aligns with parse_response so that parse_response and
build_prompt use the same symbol set (reference: build_prompt and
parse_response).
- Line 257: The results are being stored only under the backend alias, causing
different model variants to overwrite each other; update the storage key
generation where call_fn returns response_text, model_id (and in the analogous
block at lines 332-365) to include the concrete model_id (e.g., combine backend
alias + model_id or use model_id alone) so each run is saved under a unique key,
and ensure generate-report.py reads that composite key (backend+model_id) when
aggregating runs.
- Around line 71-90: The current parse_response() fallback is too permissive and
picks up any capital A-D inside words (e.g., "ERROR"); update parse_response to
only match freestanding option letters by replacing the character-based fallback
with a regex that finds standalone option tokens (case-insensitive) such as
boundaries or common option formats like "A", "A)", "A.", or wrapped in markup
(e.g., "**A**"); keep the existing stripping of <think> blocks and the
strongest-line check, but change the final search in parse_response() to use a
regex like a word-boundary or negative-word-character lookaround for A-D so only
isolated option letters are returned (use line/token matching on the cleaned
text and return uppercased letter).

In `@evaluations/specs/bluf.yaml`:
- Around line 22-24: Die `application-paraphrase`-Variante ist zu führend: passe
den Wert von paraphrase_prompt so an, dass er BLUF neutral beschreibt als
„Kommunikation für zeitlich begrenzte, entscheidungsrelevante Situationen“ ohne
die Reihenfolge der Informationen vorzuschreiben; konkret, ersetzen Sie die
Formulierung "die wichtigsten Informationen zuerst" durch eine Formulierung, die
auf Zeitdruck und Entscheidungsbedarf hinweist (z. B. "für zeitlich begrenzte
Situationen, in denen schnelle Entscheidungen getroffen werden müssen"), und
aktualisiere die YAML-Eintragung für paraphrase_prompt (neben anchor_prompt)
entsprechend.

---

Duplicate comments:
In `@evaluations/generate-report.py`:
- Around line 167-170: The HTML template concatenation injects unescaped dynamic
values (display/model names, label, anchor_id, info['file'], etc.) allowing HTML
injection; fix by HTML-escaping these variables before interpolation (e.g., call
an escape utility such as html.escape on display, any model name variables,
label, anchor_id, and info['file']) and use the escaped versions when building
the HTML strings (ensure the same change is applied to the other template sites
noted). Also validate/normalize anchor_id to safe characters if used as an
element id.
- Around line 40-41: Beim Einlesen der einzelnen Result-JSONs (loop über
sorted(RESULTS_DIR.glob("pilot-*.json")) und json.load(open(...))) fehlt
Error-Handling, wodurch ein defektes JSON die gesamte Report-Erzeugung bricht;
um das zu beheben, wickle das Öffnen und json.load in ein try/except, fange
OSError und json.JSONDecodeError ab, und bei Fehlern logge die fehlerhafte Datei
(z.B. mit print oder vorhandenen Logger) und continue zum nächsten File, so dass
nur der betroffene Run übersprungen wird; verwende idealerweise with open(...)
als Kontextmanager beim Lesen.
- Around line 25-32: The report currently uses a hard-coded MODEL_DISPLAY and
static model_names which omits "deepseek", hides models with partial runs under
the 60% threshold, and always labels "openai"/"ollama" the same regardless of
actual run keys; update the logic so that model_names are derived from the
actual result keys present in the data (include any "deepseek" or other dynamic
keys), use MODEL_DISPLAY only as a fallback for pretty names (default to a
sensible generated name like capitalized key if no mapping exists), and change
the threshold/aggregation code that filters models (the block referencing the 60
threshold and the places where model lists are built/filtered around model_names
and MODEL_DISPLAY) to always include models with any results (show partial-run
counts/percentages instead of removing them) so labels and presence reflect
actual runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 1150b2b3-9aac-4ec4-86c8-2db8f8919779

📥 Commits

Reviewing files that changed from the base of the PR and between e8c78d2 and ae3d4bf.

📒 Files selected for processing (63)
  • evaluations/generate-l2-specs.py
  • evaluations/generate-report.py
  • evaluations/pilot.py
  • evaluations/report.html
  • evaluations/specs/adr-according-to-nygard.yaml
  • evaluations/specs/atam.yaml
  • evaluations/specs/bdd-given-when-then.yaml
  • evaluations/specs/bem-methodology.yaml
  • evaluations/specs/bluf.yaml
  • evaluations/specs/c4-diagrams.yaml
  • evaluations/specs/chain-of-thought.yaml
  • evaluations/specs/clean-architecture.yaml
  • evaluations/specs/control-chart-shewhart.yaml
  • evaluations/specs/conventional-commits.yaml
  • evaluations/specs/cqrs.yaml
  • evaluations/specs/cynefin-framework.yaml
  • evaluations/specs/definition-of-done.yaml
  • evaluations/specs/devils-advocate.yaml
  • evaluations/specs/diataxis-framework.yaml
  • evaluations/specs/docs-as-code.yaml
  • evaluations/specs/domain-driven-design.yaml
  • evaluations/specs/ears-requirements.yaml
  • evaluations/specs/event-driven-architecture.yaml
  • evaluations/specs/fagan-inspection.yaml
  • evaluations/specs/feynman-technique.yaml
  • evaluations/specs/five-whys.yaml
  • evaluations/specs/fowler-patterns.yaml
  • evaluations/specs/gherkin.yaml
  • evaluations/specs/github-flow.yaml
  • evaluations/specs/gutes-deutsch-wolf-schneider.yaml
  • evaluations/specs/hexagonal-architecture.yaml
  • evaluations/specs/iec-61508-sil-levels.yaml
  • evaluations/specs/impact-mapping.yaml
  • evaluations/specs/invest.yaml
  • evaluations/specs/iso-25010.yaml
  • evaluations/specs/jobs-to-be-done.yaml
  • evaluations/specs/lasr.yaml
  • evaluations/specs/linddun.yaml
  • evaluations/specs/llm-evaluations.yaml
  • evaluations/specs/madr.yaml
  • evaluations/specs/mece.yaml
  • evaluations/specs/morphological-box.yaml
  • evaluations/specs/moscow.yaml
  • evaluations/specs/mutation-testing.yaml
  • evaluations/specs/nelson-rules.yaml
  • evaluations/specs/owasp-top-10.yaml
  • evaluations/specs/plain-english-strunk-white.yaml
  • evaluations/specs/prd.yaml
  • evaluations/specs/problem-space-nvc.yaml
  • evaluations/specs/property-based-testing.yaml
  • evaluations/specs/pyramid-principle.yaml
  • evaluations/specs/semantic-versioning.yaml
  • evaluations/specs/socratic-method.yaml
  • evaluations/specs/sota.yaml
  • evaluations/specs/spc.yaml
  • evaluations/specs/stride.yaml
  • evaluations/specs/swot.yaml
  • evaluations/specs/tdd-chicago-school.yaml
  • evaluations/specs/testing-pyramid.yaml
  • evaluations/specs/timtowtdi.yaml
  • evaluations/specs/todotxt-flavoured-markdown.yaml
  • evaluations/specs/user-story-mapping.yaml
  • evaluations/specs/wardley-mapping.yaml
✅ Files skipped from review due to trivial changes (53)
  • evaluations/specs/invest.yaml
  • evaluations/specs/c4-diagrams.yaml
  • evaluations/specs/clean-architecture.yaml
  • evaluations/specs/gherkin.yaml
  • evaluations/specs/bdd-given-when-then.yaml
  • evaluations/specs/definition-of-done.yaml
  • evaluations/specs/stride.yaml
  • evaluations/specs/cqrs.yaml
  • evaluations/specs/hexagonal-architecture.yaml
  • evaluations/specs/iso-25010.yaml
  • evaluations/specs/atam.yaml
  • evaluations/specs/event-driven-architecture.yaml
  • evaluations/specs/devils-advocate.yaml
  • evaluations/specs/bem-methodology.yaml
  • evaluations/specs/mutation-testing.yaml
  • evaluations/specs/semantic-versioning.yaml
  • evaluations/specs/swot.yaml
  • evaluations/specs/lasr.yaml
  • evaluations/specs/linddun.yaml
  • evaluations/specs/github-flow.yaml
  • evaluations/specs/adr-according-to-nygard.yaml
  • evaluations/specs/todotxt-flavoured-markdown.yaml
  • evaluations/specs/problem-space-nvc.yaml
  • evaluations/specs/owasp-top-10.yaml
  • evaluations/specs/sota.yaml
  • evaluations/specs/prd.yaml
  • evaluations/specs/moscow.yaml
  • evaluations/specs/impact-mapping.yaml
  • evaluations/specs/ears-requirements.yaml
  • evaluations/specs/user-story-mapping.yaml
  • evaluations/specs/property-based-testing.yaml
  • evaluations/specs/domain-driven-design.yaml
  • evaluations/specs/pyramid-principle.yaml
  • evaluations/specs/socratic-method.yaml
  • evaluations/specs/mece.yaml
  • evaluations/specs/tdd-chicago-school.yaml
  • evaluations/specs/conventional-commits.yaml
  • evaluations/specs/timtowtdi.yaml
  • evaluations/specs/testing-pyramid.yaml
  • evaluations/specs/five-whys.yaml
  • evaluations/specs/cynefin-framework.yaml
  • evaluations/report.html
  • evaluations/specs/diataxis-framework.yaml
  • evaluations/specs/spc.yaml
  • evaluations/specs/wardley-mapping.yaml
  • evaluations/specs/gutes-deutsch-wolf-schneider.yaml
  • evaluations/specs/control-chart-shewhart.yaml
  • evaluations/specs/feynman-technique.yaml
  • evaluations/specs/nelson-rules.yaml
  • evaluations/specs/fagan-inspection.yaml
  • evaluations/specs/fowler-patterns.yaml
  • evaluations/specs/jobs-to-be-done.yaml
  • evaluations/specs/iec-61508-sil-levels.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • evaluations/specs/plain-english-strunk-white.yaml
  • evaluations/specs/chain-of-thought.yaml
  • evaluations/specs/morphological-box.yaml
  • evaluations/specs/llm-evaluations.yaml
  • evaluations/specs/madr.yaml

Comment on lines +96 to +137
text = response.content[0].text.strip()
if "```" in text:
text = text.split("```")[1]
if text.startswith("json"):
text = text[4:]
text = text.strip()

return json.loads(text)


def main():
parser = argparse.ArgumentParser(description="Generate L2 Application questions using Claude API")
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--anchor", help="Process single anchor")
args = parser.parse_args()

specs_to_fill = []
for f in sorted(SPECS_DIR.glob("*.yaml")):
spec = yaml.safe_load(f.read_text(encoding="utf-8"))
if spec["anchor"] in SKIP_ANCHORS:
continue
if args.anchor and spec["anchor"] != args.anchor:
continue
if needs_application(spec):
specs_to_fill.append((f, spec))

print(f"Found {len(specs_to_fill)} specs needing L2 Application questions")

for filepath, spec in specs_to_fill:
anchor_id = spec["anchor"]
print(f" {anchor_id}...", end=" ", flush=True)

if args.dry_run:
print("(dry run)")
continue

try:
app = generate_application(spec)
spec["questions"]["application"] = app

with open(filepath, "w", encoding="utf-8") as fh:
yaml.dump(spec, fh, default_flow_style=False, allow_unicode=True, sort_keys=False)

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Claude-Antwort vor dem Persistieren validieren.

Hier landet beliebiges JSON direkt im YAML. Fehlen scenario/anchor_prompt/paraphrase_prompt oder die Optionen A-D, schreibt das Skript eine defekte Spec auf Disk und evaluations/pilot.py fällt später beim direkten Indexzugriff um.

🔧 Vorschlag
+def validate_application(app):
+    if not isinstance(app, dict):
+        raise ValueError("application must be a JSON object")
+
+    required = {"scenario", "anchor_prompt", "paraphrase_prompt", "options", "correct"}
+    missing = required - app.keys()
+    if missing:
+        raise ValueError(f"missing application keys: {sorted(missing)}")
+
+    if not isinstance(app["options"], dict) or set(app["options"]) != {"A", "B", "C", "D"}:
+        raise ValueError("options must contain exactly A, B, C and D")
+
+    if app["correct"] not in {"A", "B", "C", "D"}:
+        raise ValueError("correct must be one of A-D")
+
+    return app
+
@@
-    return json.loads(text)
+    return validate_application(json.loads(text))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/generate-l2-specs.py` around lines 96 - 137, The script writes
whatever generate_application(spec) returns directly into
spec["questions"]["application"], which can produce broken YAML if required
fields are missing; before assigning and persisting, validate the returned app
object (from generate_application) contains the required keys like "scenario",
"anchor_prompt", "paraphrase_prompt" and an "options" mapping with entries
"A","B","C","D" (or whatever option keys your spec expects); if validation
fails, log an error and skip writing (or raise) so the file is not corrupted.
Add a small validator function (e.g., validate_application(app)) and call it in
main() right after app = generate_application(spec) and before
spec["questions"]["application"] = app; if validation returns False,
print/processLogger.error with the anchor id and continue to next spec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — Validierung wird ergänzt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 34 seconds before sending another message.

Comment thread evaluations/pilot.py
Comment on lines +47 to +56
def build_prompt(question_text, options, permutation):
"""Build a prompt with options in the given permutation order."""
lines = [question_text.strip(), ""]
for i, perm_idx in enumerate(permutation):
letter = LETTERS[i]
option_text = options[LETTERS[perm_idx]]
lines.append(f"{letter}) {option_text}")
lines.append("")
lines.append("Answer with the letter only.")
return "\n".join(lines)

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

correct: X ist mit diesem Prompting unerreichbar.

Für Sanity-Checks erwartest du X, aber der Prompt zeigt nur A-D und parse_response() kann nie X zurückgeben. Solche Controls landen damit zwangsläufig bei 0 % und validieren das System nicht wirklich.

Also applies to: 59-68, 71-90

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 47 - 56, The prompt generation in
build_prompt only emits letters A–D so a correct label "X" can never be returned
(making sanity checks impossible); update build_prompt to include the full set
of possible response tokens (including "X") when building options — e.g. iterate
over the full LETTERS set or detect when "X" is a valid option and append an "X)
..." line — and ensure this aligns with parse_response so that parse_response
and build_prompt use the same symbol set (reference: build_prompt and
parse_response).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design. Sanity Check MUSS 0% scoren. correct=X matched nie A-D — das ist der Zweck.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 34 seconds before sending another message.

Comment thread evaluations/pilot.py
Comment on lines +71 to +90
def parse_response(text):
"""Extract the first capital letter A-D from the response.
Strips <think>...</think> blocks (used by reasoning models like qwen3)."""
import re
# Remove thinking blocks (qwen3, DeepSeek R1, etc.)
cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
# If nothing left after stripping, fall back to original
if not cleaned:
cleaned = text.strip()
# Try to find a standalone answer letter (e.g., "B", "B)", "**B**", "b")
# First: look for a line that is just a letter (strongest signal)
for line in cleaned.split('\n'):
line = line.strip().strip('*').strip('.').strip(')').strip()
if line.upper() in ("A", "B", "C", "D"):
return line.upper()
# Fallback: first capital A-D in the text
for char in cleaned:
if char in "ABCD":
return char
return None

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Der Antwortparser produziert False Positives.

Der Fallback nimmt das erste Groß-A-D irgendwo im Text. Damit werden normale Begründungen oder Fehlermeldungen wie ERROR: Bad Request als gültige Antwort gezählt. Matche hier nur freistehende Optionsbuchstaben.

🔧 Vorschlag
-    # Fallback: first capital A-D in the text
-    for char in cleaned:
-        if char in "ABCD":
-            return char
+    # Fallback: first standalone option letter
+    match = re.search(r"\b([ABCD])\b", cleaned, flags=re.IGNORECASE)
+    if match:
+        return match.group(1).upper()
     return None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def parse_response(text):
"""Extract the first capital letter A-D from the response.
Strips <think>...</think> blocks (used by reasoning models like qwen3)."""
import re
# Remove thinking blocks (qwen3, DeepSeek R1, etc.)
cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
# If nothing left after stripping, fall back to original
if not cleaned:
cleaned = text.strip()
# Try to find a standalone answer letter (e.g., "B", "B)", "**B**", "b")
# First: look for a line that is just a letter (strongest signal)
for line in cleaned.split('\n'):
line = line.strip().strip('*').strip('.').strip(')').strip()
if line.upper() in ("A", "B", "C", "D"):
return line.upper()
# Fallback: first capital A-D in the text
for char in cleaned:
if char in "ABCD":
return char
return None
def parse_response(text):
"""Extract the first capital letter A-D from the response.
Strips <think>...</think> blocks (used by reasoning models like qwen3)."""
import re
# Remove thinking blocks (qwen3, DeepSeek R1, etc.)
cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()
# If nothing left after stripping, fall back to original
if not cleaned:
cleaned = text.strip()
# Try to find a standalone answer letter (e.g., "B", "B)", "**B**", "b")
# First: look for a line that is just a letter (strongest signal)
for line in cleaned.split('\n'):
line = line.strip().strip('*').strip('.').strip(')').strip()
if line.upper() in ("A", "B", "C", "D"):
return line.upper()
# Fallback: first standalone option letter
match = re.search(r"\b([ABCD])\b", cleaned, flags=re.IGNORECASE)
if match:
return match.group(1).upper()
return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 71 - 90, The current parse_response()
fallback is too permissive and picks up any capital A-D inside words (e.g.,
"ERROR"); update parse_response to only match freestanding option letters by
replacing the character-based fallback with a regex that finds standalone option
tokens (case-insensitive) such as boundaries or common option formats like "A",
"A)", "A.", or wrapped in markup (e.g., "**A**"); keep the existing stripping of
<think> blocks and the strongest-line check, but change the final search in
parse_response() to use a regex like a word-boundary or negative-word-character
lookaround for A-D so only isolated option letters are returned (use line/token
matching on the cleaned text and return uppercased letter).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Akzeptiertes Trade-off. Standalone-Letter-First minimiert False Positives. 4 Runs mitteln Fehler aus.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 32 seconds before sending another message.

Comment thread evaluations/pilot.py
expected = correct_letter_for_permutation(original_correct, perm)

try:
response_text, model_id = call_fn(prompt)

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mehrere Modellvarianten kollidieren unter demselben Key.

Gespeichert wird nur unter dem Backend-Alias (openai, ollama, …), obwohl die Caller die konkrete Modell-ID schon kennen. Ein Run mit gpt-5 überschreibt damit später gpt-4o-mini, und generate-report.py kann pro Backend nur noch einen Lauf darstellen.

Also applies to: 332-365

🧰 Tools
🪛 Ruff (0.15.6)

[warning] 257-257: Unpacked variable model_id is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` at line 257, The results are being stored only under
the backend alias, causing different model variants to overwrite each other;
update the storage key generation where call_fn returns response_text, model_id
(and in the analogous block at lines 332-365) to include the concrete model_id
(e.g., combine backend alias + model_id or use model_id alone) so each run is
saved under a unique key, and ensure generate-report.py reads that composite key
(backend+model_id) when aggregating runs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — Model-Key sollte konkreten Modellnamen enthalten.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 31 seconds before sending another message.

Comment on lines +22 to +24
anchor_prompt: using BLUF (Bottom Line Up Front)
paraphrase_prompt: to communicate the most critical information first for time-pressed
executives who need to make immediate decisions

@coderabbitai coderabbitai Bot Mar 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Die Paraphrase ist zu führend.

„die wichtigsten Informationen zuerst“ paraphrasiert BLUF fast wörtlich. Die application-paraphrase-Variante misst damit eher Definitionserkennung als Anwendungstransfer und wird die Scores vermutlich künstlich anheben. Formuliere das Ziel neutraler über Zeitdruck und Entscheidungsbedarf, ohne die Reihenfolge der Informationen vorzugeben.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/specs/bluf.yaml` around lines 22 - 24, Die
`application-paraphrase`-Variante ist zu führend: passe den Wert von
paraphrase_prompt so an, dass er BLUF neutral beschreibt als „Kommunikation für
zeitlich begrenzte, entscheidungsrelevante Situationen“ ohne die Reihenfolge der
Informationen vorzuschreiben; konkret, ersetzen Sie die Formulierung "die
wichtigsten Informationen zuerst" durch eine Formulierung, die auf Zeitdruck und
Entscheidungsbedarf hinweist (z. B. "für zeitlich begrenzte Situationen, in
denen schnelle Entscheidungen getroffen werden müssen"), und aktualisiere die
YAML-Eintragung für paraphrase_prompt (neben anchor_prompt) entsprechend.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — BLUF Paraphrase zu nah an der Antwort. Wird kalibriert.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rate Limit Exceeded

@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 30 seconds before sending another message.

raifdmueller and others added 2 commits March 24, 2026 19:40
Final results (193 questions each, excluding sanity check):
- Claude Sonnet: 99% (2 fails: Feynman para 0%, GitHub Flow para 75%)
- GPT-4o: 97% (13 fails: Feynman para 0%, TDD position bias, niche anchors)
- Mistral Large: 96% (18 fails: PRD 0%, LASR 25%, niche anchors)

Key findings from L2 Application tests:
- Paraphrase scores are generally lower than anchor scores — anchors
  add value beyond shorthand for most models
- Feynman Technique paraphrase fails across all models (0%) — the
  paraphrase is too vague, not the models. Fixed in spec but not yet re-run.
- Niche anchors (LASR, PRD, todo.txt, EARS) show model-specific gaps

Includes updated HTML report and progress counter in runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fixed semantic-versioning.yaml: replaced broken "Version format; :"
  with proper MAJOR.MINOR.PATCH description
- Added html.escape() for all dynamic values in report generator
- Regenerated report

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@raifdmueller

Copy link
Copy Markdown
Contributor Author

Response to CodeRabbit Review

Fixed in latest commit

Finding Status
Semantic Versioning (Critical): Syntax error in Option B ✅ Fixed — replaced with proper MAJOR.MINOR.PATCH description
HTML Escaping (Major): Dynamic values unescaped in report ✅ Fixed — added html.escape() for all dynamic values

Already fixed in earlier commits (CodeRabbit reviewed stale code)

Finding Status
ATAM (Critical): Wrong correct answer ✅ Was already fixed — correct=B now describes scenario-based evaluation
Gherkin (Critical): Wrong correct answer ✅ Was already fixed — correct=A (Given/When/Then)

By design / accepted

Finding Explanation
correct: X unerreichbar Intentional. Sanity check MUST score 0% — any answer A-D is wrong. X will never match, which is the point.
404 errors in old result files Old pilot runs from before API fixes. Only the 3 final result files matter — old ones were deleted.
Model key collisions Accepted for pilot. Each run tests one model at a time now (parallel runs caused race conditions).
Parser false positives Mitigated by standalone-letter-first strategy. The fallback to "first A-D in text" is a known trade-off — verbose models may trigger it, but position randomization (4 runs) averages out errors.

Known limitations (documented in concept)

Finding Plan
Auto-generated correct answers too vague (ADR Nygard, SWOT, MoSCoW, INVEST, Conventional Commits, etc.) This is the core limitation of auto-generation from .adoc Core Concepts. The build_correct_answer function takes only the first 2 concepts. Human review needed — tracked as ongoing quality improvement. Claude scoring 0% on wrong specs is the validation mechanism.
Paraphrases too leading (BLUF) or too vague (Feynman, GitHub Flow) Paraphrase calibration is hard by design — the concept document discusses this. Feynman was already fixed.
DeepSeek not in report No DeepSeek API key configured yet. Will be added when available.
No validation of Claude-generated L2 specs Accepted for pilot phase. Full validation comes with human review in Phase 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pilot: Manual evaluation with 5 anchors and 2 models

2 participants