feat: evaluation framework with 63 anchor specs and pilot results#343
Conversation
…ding#331) Auto-generated MC recognition questions from .adoc metadata: - generate-l1-specs.py: extracts core concepts → correct answer - fill-distractors.py: uses Claude API to generate 3 plausible distractors - 56 new specs + 7 hand-crafted from pilot = 63 total Includes pilot runner, results, and README from the closed PR LLM-Coding#342. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughFügt ein komplettes Semantic‑Anchor‑Evaluations‑Framework hinzu: Dokumentation, CLI‑Skripte zur Generierung/Ergänzung von Specs und Distraktoren, ein Pilot‑Runner für mehrere LLM‑Backends, Report‑Generator sowie 60+ neue YAML‑Spezifikationen und ein statisches HTML‑Report. Changes
Sequence DiagramsequenceDiagram
actor User
participant GenL1 as generate-l1-specs.py
participant ADOC as docs/anchors/*.adoc
participant Spec as evaluations/specs/*.yaml
participant GenL2 as generate-l2-specs.py
participant Fill as fill-distractors.py
participant Pilot as pilot.py
participant Model as LLM Backend
participant Results as evaluations/results/*.json
participant ReportGen as generate-report.py
participant HTML as evaluations/report.html
User->>GenL1: --dry-run / --force
GenL1->>ADOC: parse anchor metadata
ADOC-->>GenL1: anchor data
GenL1->>Spec: create/update Level‑1 spec
User->>GenL2: --dry-run / --anchor
GenL2->>ADOC: load anchor context
ADOC-->>GenL2: context
GenL2->>Model: request Level‑2 JSON (Anthropic)
Model-->>GenL2: JSON response
GenL2->>Spec: write application question
User->>Fill: --anchor / --dry-run
Fill->>Spec: find placeholders
Fill->>Model: request distractors
Model-->>Fill: JSON distractors
Fill->>Spec: write options A/C/D
User->>Pilot: run (--model, --dry-run)
Pilot->>Spec: load specs
loop per question, 4 randomized permutations
Pilot->>Model: send permuted prompt
Model-->>Pilot: response
Pilot->>Pilot: parse first A–D, score
end
Pilot->>Results: append timestamped JSON run
User->>ReportGen: --output
ReportGen->>Results: load pilot-*.json
ReportGen->>ReportGen: compute best runs, heatmap, failures
ReportGen->>HTML: write report.html
HTML-->>User: report ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (5)
evaluations/specs/bem-methodology.yaml-12-14 (1)
12-14:⚠️ Potential issue | 🟡 MinorKorrekte Antwort ist inhaltlich zu unvollständig für BEM.
Aktuell fokussiert Option B vor allem auf „Block“ und Nutzen, nicht klar genug auf Block/Element/Modifier als Kern des Anchors. Das schwächt die Messqualität der Recognition-Frage.
Vorschlag zur Präzisierung von Option B
- B: Solve CSS specificity wars, naming conflicts, and stylesheet maintainability - issues in large codebases; standalone component that is meaningful on its - own (e.g., `menu`, `button`, `header`) + B: CSS-Namenskonvention mit Block, Element und Modifier (BEM), um Komponenten + eindeutig zu strukturieren, Spezifitätskonflikte zu reduzieren und Stylesheets + in großen Codebasen wartbar zu haltenBased on learnings: Validate semantic anchors meet quality criteria before adding: Precise (specific body of knowledge with clear boundaries), Rich (activates multiple interconnected concepts), Consistent, Attributable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/bem-methodology.yaml` around lines 12 - 14, Option B is too focused on the Block concept and benefit; update the BEM anchor text (the option labeled "B") to explicitly state the Block/Element/Modifier triad as the core concept, include brief examples for each (e.g., block: "menu", element: "menu__item", modifier: "menu--active"), and ensure the wording meets the semantic-anchor quality criteria (Precise, Rich, Consistent, Attributable) by making boundaries clear, activating interconnected concepts (naming, specificity, maintainability), and keeping attribution to BEM terminology.evaluations/specs/problem-space-nvc.yaml-13-13 (1)
13-13:⚠️ Potential issue | 🟡 MinorKleiner Textfehler in der Beispielaussage.
Auf Line 13 sollte es „I feel frustrated“ statt „i feel frustrated“ heißen, damit die Option sprachlich sauber und professionell bleibt.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/problem-space-nvc.yaml` at line 13, Replace the lowercase example text "i feel frustrated" with the grammatically correct "I feel frustrated" in the YAML entry (search for the exact string "i feel frustrated" in problem-space-nvc.yaml) so the option reads with proper capitalization and professional tone.evaluations/specs/todotxt-flavoured-markdown.yaml-13-14 (1)
13-14:⚠️ Potential issue | 🟡 MinorPrioritätsnotation bitte präzisieren.
Auf Line 13-14 ist bei todo.txt üblicherweise die Schreibweise
(A),(B),(C)(groß) gebräuchlich. Das erhöht die fachliche Genauigkeit der korrekten Option.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/todotxt-flavoured-markdown.yaml` around lines 13 - 14, Die Beschreibung "uses todo.txt priority notation `(a)`, `(b)`, `(c)` where `(a)` is highest priority" ist fachlich inkorrekt; passe die Notation auf die übliche Großschreibweise `(A)`, `(B)`, `(C)` an, also ersetze alle Vorkommen von `(a)`, `(b)`, `(c)` (und die erklärende Klammer, z. B. "where `(a)` is highest") durch `(A)`, `(B)`, `(C)` bzw. "where `(A)` is highest" in der Spezifikation, damit die Prioritätsnotation korrekt dokumentiert ist.evaluations/specs/c4-diagrams.yaml-11-13 (1)
11-13:⚠️ Potential issue | 🟡 MinorKorrekte Antwort ist inhaltlich zu unpräzise formuliert.
Die Formulierung in Line 11–13 („Four levels of abstraction; : system in its environment …“) enthält einen Tippfehler und deckt den Begriff nur teilweise ab. Das schwächt die Aussagekraft des Items, obwohl
correct: Bgesetzt ist.Based on learnings: Validate semantic anchors meet quality criteria before adding: Precise, Rich, Consistent, Attributable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/c4-diagrams.yaml` around lines 11 - 13, Item B currently has a typo and an imprecise definition for the C4 model; update the YAML entry for symbol "B" so it reads a precise enumeration like "Four levels of abstraction: System Context (system in its environment), Container (apps/services), Component (high‑level modules), Code (classes/implementation)" (ensure the stray ":" is removed), keep `correct: B` unchanged, and make the wording rich and consistent with the C4 terminology so the semantic anchor meets precision, richness, and consistency criteria.evaluations/specs/ears-requirements.yaml-11-11 (1)
11-11:⚠️ Potential issue | 🟡 MinorOption B enthält unnötige zusätzliche Anführungszeichen.
Line 11 fügt um den gesamten Inhalt nochmals
"ein. Das ist kein Syntaxfehler, aber ein unnötiges visuelles Signal in genau der korrekten Option.Vorschlag
- B: '"The <system> shall <requirement>"; "when <trigger> the <system> shall <requirement>"' + B: 'The <system> shall <requirement>; when <trigger> the <system> shall <requirement>'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/ears-requirements.yaml` at line 11, Die Option für den Schlüssel B enthält doppelte äußere Anführungszeichen; öffne die Definition für B und entferne die äußeren `"` so dass nur die inneren, intendierten Phrasen erhalten bleiben (z.B. ändere die Wertzuweisung bei B von '"The <system> shall <requirement>"; "when <trigger> the <system> shall <requirement>"' auf die Version ohne die zusätzlichen umgebenden Anführungszeichen), prüfe die Zeichenfolge in der YAML-Zeile mit dem Schlüssel B und stelle sicher, dass nur die beabsichtigten Zitate innerhalb der Phrasen verbleiben.
🧹 Nitpick comments (20)
evaluations/specs/lasr.yaml (1)
1-22: Optional: Anchor-Qualität für LASR zusätzlich vorab dokumentieren.Kein Blocker im YAML selbst. Optional wäre ein kurzer Verweis auf eine Vorab-Validierung (Recognition/Accuracy/Depth/Specificity), damit die spätere Auswertung robuster begründet ist.
Based on learnings: Test semantic anchors before adding by evaluating LLM response to 'What concepts do you associate with ?' considering Recognition, Accuracy, Depth, and Specificity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/lasr.yaml` around lines 1 - 22, Add an optional pre-validation note for the semantic anchor "lasr" by extending the YAML around anchor: lasr (and related questions.recognition) with a small anchor_quality or validation field that records the recommended pre-check: evaluate LLM responses to "What concepts do you associate with <semantic anchor name>?" and capture Recognition, Accuracy, Depth, and Specificity; include the test prompt, the four quality criteria, and a short accepted/threshold guideline so later evaluation logic can reference this metadata during scoring.evaluations/specs/cqrs.yaml (1)
13-16: CQRS-Beschreibung präzisieren (Read/Write-Modelle explizit nennen).Die aktuelle richtige Option ist nah dran, fokussiert aber stark auf CQS. Für bessere Trennschärfe sollte klar drinstehen, dass Reads und Writes über getrennte Modelle/Interfaces laufen.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/cqrs.yaml` around lines 13 - 16, Die Beschreibung unter "B: Bertrand Meyer's principle" / the CQS entry is too focused on CQS; update it to explicitly describe CQRS by stating that reads and writes use separate models/interfaces (a Read model/Query-side and a Write model/Command-side), that reads do not perform state changes and writes encapsulate intent in immutable command objects, and note that the two sides can have different representations, validation and storage; replace or augment the current CQS wording to mention "separate Read and Write models/interfaces" and ensure the term CQRS appears so readers understand it's distinct from simple CQS.evaluations/specs/socratic-method.yaml (1)
5-7: Frage-Scalar auf sauberes Blockformat umstellen.Die aktuelle Schreibweise in Line 5–7 erzeugt unnötige Leerzeilen/Whitespace im Prompt. Bitte auf ein eindeutiges Block-Scalar umstellen, damit die Evaluationsprompts stabil bleiben.
Diff-Vorschlag
- question: 'Which of the following best describes "Socratic Method"? - - ' + question: | + Which of the following best describes "Socratic Method"?Based on learnings: Validate semantic anchors meet quality criteria before adding: Precise, Rich, Consistent, Attributable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/socratic-method.yaml` around lines 5 - 7, Die YAML-Question-Scalar für den Schlüssel "question" verwendet aktuell Inline-Quotes mit zusätzlichen Zeilenumbrüchen; wechsle zu einem eindeutigen Block-Scalar (pipe |) für den "question"-Feldeintrag und entferne die überflüssigen Leerzeilen/Trailing-Whitespace so der Inhalt exakt lautet (z. B. | gefolgt von der Fragezeile "Which of the following best describes \"Socratic Method\"?"), achte auf korrekte Einrückung des Block-Scalars und erhalte den genauen Prompt-Text ohne zusätzliche blank lines.evaluations/specs/semantic-versioning.yaml (1)
5-7: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).
✨ Vorschlag zur Behebung
- question: 'Which of the following best describes "Semantic Versioning (SemVer)"? - - ' + question: 'Which of the following best describes "Semantic Versioning (SemVer)"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/semantic-versioning.yaml` around lines 5 - 7, Remove the extraneous blank line inside the YAML value for the question key so the question string is a single continuous line; edit the question field (question: 'Which of the following best describes "Semantic Versioning (SemVer)"?') to eliminate the empty line between lines 6–7 while preserving the surrounding quotes and indentation.evaluations/specs/linddun.yaml (1)
5-7: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile am Ende (Zeilen 6-7).
✨ Vorschlag zur Behebung
- question: 'Which of the following best describes "LINDDUN"? - - ' + question: 'Which of the following best describes "LINDDUN"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/linddun.yaml` around lines 5 - 7, Die Fragezeichen-String unter dem YAML-Schlüssel question enthält eine überflüssige Leerzeile am Ende; entferne die leere Zeile so dass der Wert von question keine trailing blank line mehr hat (bearbeite den question-Wert in evaluations/specs/linddun.yaml und lösche die leere Zeile nach 'Which of the following best describes "LINDDUN"?'), achte darauf, dass die Anführungszeichen und Einrückung unverändert bleiben.evaluations/specs/spc.yaml (2)
11-12: Option B ist grammatikalisch unvollständig.Die korrekte Antwort (Option B) verwendet fragmentierte Satzstrukturen ohne Verben: "Systematic statistical monitoring of running processes; inherent, random fluctuation — stable and predictable". Dies erschwert das Verständnis und wirkt unprofessionell.
📝 Verbesserungsvorschlag für bessere Lesbarkeit
- B: Systematic statistical monitoring of running processes; inherent, random - fluctuation — stable and predictable + B: A methodology for systematic statistical monitoring of processes to distinguish + between inherent random variation (common cause) and special cause variation, + ensuring processes remain stable and predictable🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/spc.yaml` around lines 11 - 12, Option B (the value under key "B") is a fragmented, verb-less phrase; rewrite it as a complete, grammatically correct sentence or concise noun phrase to improve clarity and professionalism — for example convert "Systematic statistical monitoring of running processes; inherent, random fluctuation — stable and predictable" into a full phrase like "Systematic statistical monitoring of running processes, showing inherent random fluctuations that are stable and predictable" (update the string value for key B in spc.yaml accordingly).
5-7: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).
✨ Vorschlag zur Behebung
- question: 'Which of the following best describes "SPC (Statistical Process Control)"? - - ' + question: 'Which of the following best describes "SPC (Statistical Process Control)"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/spc.yaml` around lines 5 - 7, The YAML question value for the key question contains an unnecessary blank line; edit the question string (the question: field) to remove the extra empty line and trailing newline so the text is a single contiguous paragraph—i.e., trim internal/leading/trailing blank lines from the question value and ensure the question: string reads "Which of the following best describes \"SPC (Statistical Process Control)\"?" without the blank line.evaluations/specs/tdd-chicago-school.yaml (1)
5-7: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).
✨ Vorschlag zur Behebung
- question: 'Which of the following best describes "TDD, Chicago School"? - - ' + question: 'Which of the following best describes "TDD, Chicago School"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/tdd-chicago-school.yaml` around lines 5 - 7, Die Fragezeichen-Zeile unter dem YAML-Schlüssel question enthält eine unnötige Leerzeile; entferne die überflüssige Leerzeile innerhalb des question-Strings (bei dem Eintrag question: 'Which of the following best describes "TDD, Chicago School"') so der Text in einer zusammenhängenden Zeile steht und entferne ggf. überflüssige Leerzeichen oder Zeilenumbrüche am Ende des Strings.evaluations/specs/bluf.yaml (1)
5-7: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 6-7).
✨ Vorschlag zur Behebung
- question: 'Which of the following best describes "BLUF (Bottom Line Up Front)"? - - ' + question: 'Which of the following best describes "BLUF (Bottom Line Up Front)"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/bluf.yaml` around lines 5 - 7, Die Fragezeichen-Zeile unter dem YAML-Schlüssel question enthält eine überflüssige Leerzeile; öffnen Sie die YAML-Einheit mit dem Schlüssel question in evaluations/specs/bluf.yaml und entfernen Sie die leere Zeile zwischen "Which of the following best describes \"BLUF (Bottom Line Up Front)\"?" und dem abschließenden String-Rahmen, sodass der question-Wert in einer durchgehenden Zeile steht (Schlüssel: question, Suchbegriff: "BLUF (Bottom Line Up Front)").evaluations/specs/fowler-patterns.yaml (2)
13-14: Inkonsistente Großschreibung der Pattern-Namen in Option B.Die Pattern-Namen in Option B verwenden inkonsistente Großschreibung. Die ersten vier Patterns sind korrekt groß geschrieben (Transaction Script, Domain Model, etc.), aber die nachfolgenden Patterns sind kleingeschrieben ("table data gateway" statt "Table Data Gateway"). In Fowlers PEAA werden Pattern-Namen als Eigennamen mit Großbuchstaben behandelt.
✨ Vorschlag zur Vereinheitlichung
- B: Transaction Script, Domain Model, Table Module, Service Layer; table data - gateway, row data gateway, active record, data mapper + B: Transaction Script, Domain Model, Table Module, Service Layer; Table Data + Gateway, Row Data Gateway, Active Record, Data Mapper🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/fowler-patterns.yaml` around lines 13 - 14, Option B listet Pattern-Namen inkonsistent: nach "Service Layer" sind die Begriffe kleingeschrieben; ändere die Einträge in der Option-B-Zeile so dass alle Pattern-Namen großgeschrieben und einheitlich als Eigennamen erscheinen (z. B. "Table Data Gateway", "Row Data Gateway", "Active Record", "Data Mapper") und behalte bestehende Trennzeichen/Kommas unverändert, sodass die Zeile mit "B: Transaction Script, Domain Model, Table Module, Service Layer; Table Data Gateway, Row Data Gateway, Active Record, Data Mapper" konsistent ist.
5-8: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile (Zeilen 7-8).
✨ Vorschlag zur Behebung
- question: 'Which of the following best describes "Patterns of Enterprise Application - Architecture (PEAA)"? - - ' + question: 'Which of the following best describes "Patterns of Enterprise Application Architecture (PEAA)"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/fowler-patterns.yaml` around lines 5 - 8, Die Frage unter dem YAML-Schlüssel question enthält eine überflüssige Leerzeile; editieren Sie den value von question in evaluations/specs/fowler-patterns.yaml (einzelner "question" Eintrag) und entfernen Sie die Zwischenzeile so der Text in eine einzige zusammenhängende Zeile/Block übergeht, ohne zusätzliche leere Zeilen oder führende Leerzeichen; prüfen Sie danach das YAML-Indent-Level, damit die Formatierung unverändert bleibt.evaluations/specs/domain-driven-design.yaml (1)
5-8: Unnötige Leerzeile im Fragetext entfernen.Der Fragetext enthält eine überflüssige Leerzeile am Ende (Zeilen 7-8). Dies reduziert die Lesbarkeit und ist inkonsistent.
✨ Vorschlag zur Behebung des Formatierungsproblems
- question: 'Which of the following best describes "Domain-Driven Design according - to Evans"? - - ' + question: 'Which of the following best describes "Domain-Driven Design according to Evans"?'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/domain-driven-design.yaml` around lines 5 - 8, Die Fragezeichenzeile unter dem Schlüssel "question" enthält eine überflüssige Leerzeile am Ende; entferne die zusätzliche neue Zeile innerhalb des String-Werts für "question" so der Text endet direkt nach dem Satz "According to Evans" (bzw. nach dem abschließenden Anführungszeichen) ohne die leere Zeile, z. B. by collapsing the multi-line YAML scalar so there is no trailing blank line in the "question" value.evaluations/specs/cynefin-framework.yaml (1)
5-7: Frage-String enthält unnötige Leerzeilen/Whitespace.Line 5-7 erzeugt einen Prompt mit zusätzlichem Zeilenrauschen. Bitte als sauberen Einzeiler oder
|-Block ohne leere Zeile speichern, damit alle Specs konsistent sind.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/cynefin-framework.yaml` around lines 5 - 7, The "question" YAML value currently contains an extra blank line and trailing whitespace; edit the question key to remove the empty line so the prompt is a clean single-line string or convert it to a block scalar (use |) without any leading empty line, ensuring there are no embedded blank lines or trailing spaces in the 'question' value.evaluations/specs/user-story-mapping.yaml (1)
11-12: Korrekte Option B bitte etwas präzisieren.Line 11-12 ist als Ground-Truth recht knapp. Ergänze den Kern von Story Mapping (Backbone aus Aktivitäten, darunter Tasks/Stories, häufige Priorisierung nach Releases), damit die Recognition-Frage stabiler trennt.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/user-story-mapping.yaml` around lines 11 - 12, Ergänze die Beschreibung für Option "B:" so sie präzise den Kern von Story Mapping abbildet: erwähne, dass eine Story Map ein Backbone aus übergeordneten Aktivitäten (user activities) bildet, darunter die konkreten Tasks/Stories angeordnet werden, diese häufig horizontal nach Ablauf/Workflow und vertikal nach Detaillierungs-/Prioritätsstufen organisiert sind, und dass Items typischerweise für Releases oder Iterationen priorisiert (MVP/Release-Slices) werden; lasse die vorhandene knappe Formulierung ("Horizontal arrangement of user activities; high-level tasks users perform") durch diese erweiterte, klarere Formulierung ersetzen, sodass die Recognition-Frage stabiler trennt.evaluations/specs/chain-of-thought.yaml (1)
11-12: Option B bitte neutraler formulieren, um CoT nicht auf sichtbare Gedankenketten zu verengen.Die aktuelle Formulierung koppelt CoT stark an explizite Offenlegung von Zwischenschritten. Für die Anchor-Erkennung wäre eine prompt-/methodenbezogene Beschreibung robuster.
Formulierungsvorschlag
- B: Explicitly show intermediate reasoning steps before reaching a conclusion; - make the thought process visible, not just the final answer + B: Encourage step-by-step reasoning by decomposing the problem into intermediate + steps before producing the final answer🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/specs/chain-of-thought.yaml` around lines 11 - 12, Die Formulierung für den Eintrag "B: Explicitly show intermediate reasoning steps before reaching a conclusion; make the thought process visible, not just the final answer" ist zu eng und zwingt CoT auf explizite sichtbare Zwischenschritte; ändere den Text so er ist neutral gegenüber Darstellungsform (z. B. ermutige zu chain-of-thought-ähnlichem reasoning oder zu geeigneten methodischen/Prompt-beschreibungen für Anchor-Erkennung), ohne sichtbare Gedankenketten vorzuschreiben; aktualisiere den YAML-Eintrag "B" entsprechend (ersetzende Formulierung soll Zweck und Flexibilität erklären, z. B. „Encourage chain-of-thought style reasoning or method-level description of intermediate steps when helpful for verification/anchor detection“).evaluations/generate-l1-specs.py (2)
60-61: Fehlende Fehlerbehandlung bei Tier-Parsing.
int()kann einenValueErrorwerfen, wenn der Tier-Wert ungültig oder leer ist.🛡️ Robustere Variante
elif line.startswith(":tier:"): - result["tier"] = int(line.split(":tier:")[1].strip()) + tier_str = line.split(":tier:")[1].strip() + result["tier"] = int(tier_str) if tier_str.isdigit() else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/generate-l1-specs.py` around lines 60 - 61, The tier parsing currently does result["tier"] = int(line.split(":tier:")[1].strip()) and can raise ValueError for empty/invalid values; update the block (the branch checking line.startswith(":tier:") that sets result["tier"]) to catch ValueError (and optionally TypeError), validate the extracted token before int() (e.g. check non-empty and numeric), and on failure set a safe default (None or skip setting) and/or log a clear warning so parsing doesn't crash the script.
149-156: Redundante Bedingung inshould_skip.Die Prüfung
anchor_id not in SKIP_EXACTin Zeile 154 ist überflüssig, da Zeile 151-152 bereitsTruezurückgibt, wennanchor_idinSKIP_EXACTenthalten ist.♻️ Vorgeschlagene Vereinfachung
def should_skip(anchor_id): """Check if anchor should be skipped.""" if anchor_id in SKIP_EXACT: return True for prefix in SKIP_PREFIXES: - if anchor_id.startswith(prefix) and anchor_id not in SKIP_EXACT: + if anchor_id.startswith(prefix): return True return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/generate-l1-specs.py` around lines 149 - 156, Die Funktion should_skip enthält eine überflüssige Bedingung; entferne die Prüfung "and anchor_id not in SKIP_EXACT" aus der Schleife über SKIP_PREFIXES und lasse die Logik wie folgt: zuerst prüfen, ob anchor_id in SKIP_EXACT ist (wie aktuell), und ansonsten in der for-Schleife nur noch prüfen, ob anchor_id.startswith(prefix) für die Einträge in SKIP_PREFIXES; dadurch bleibt das Verhalten unverändert, die redundante Bedingung wird aber eliminiert.evaluations/pilot.py (2)
409-411: Stilverbesserung: Spread-Operator für Listen.♻️ Modernere Syntax
if lang: - variants = variants + [lang] + variants = [*variants, lang]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/pilot.py` around lines 409 - 411, The code currently builds up the variants list using an explicit concatenation (variants = variants + [lang]) after obtaining lang from cons.get("language_variant"); replace that with a more idiomatic operation (e.g., append or in-place extend/spread) to avoid creating a new list each time—use variants.append(lang) or variants += [lang] (or variants = [*variants, lang] if you prefer spread semantics) to update the variants list in place where lang is truthy.
256-262: Unbenutzte Variablemodel_id.Die Variable
model_idaus dem Tuple-Unpacking wird nicht verwendet.♻️ Unterdrücke Linter-Warnung
try: - response_text, model_id = call_fn(prompt) + response_text, _model_id = call_fn(prompt) except Exception as e: response_text = f"ERROR: {e}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/pilot.py` around lines 256 - 262, The tuple-unpacked variable model_id returned from call_fn is unused and triggers a linter warning; change the unpacking in the try block to either discard model_id (e.g., assign to _ or _model_id) or adjust call_fn to return only response_text, and make the same change in the except/handling flow so the variable is consistently ignored (reference the call to call_fn and the local name model_id to locate the change).evaluations/fill-distractors.py (1)
74-82: JSON-Parsing könnte bei unerwarteten Antworten fehlschlagen.Die Extraktion aus Markdown-Codeblöcken (Zeilen 76-80) ist fragil: Wenn die Antwort mehrere ```-Blöcke enthält oder das JSON nicht wohlgeformt ist, kann
json.loads(Zeile 82) einen Fehler werfen, der von der äußeren Exception-Behandlung inmain()abgefangen wird.Für robustere Verarbeitung könnte ein expliziter Try-Catch mit spezifischer Fehlermeldung hinzugefügt werden.
♻️ Robustere JSON-Extraktion
text = response.content[0].text.strip() # Parse JSON from response (might be wrapped in ```json ... ```) if "```" in text: - text = text.split("```")[1] - if text.startswith("json"): - text = text[4:] - text = text.strip() + parts = text.split("```") + for part in parts[1::2]: # Ungerade Indizes sind Code-Blöcke + cleaned = part.lstrip("json").strip() + if cleaned.startswith("{"): + text = cleaned + break - return json.loads(text) + try: + return json.loads(text) + except json.JSONDecodeError as e: + raise ValueError(f"Konnte JSON nicht parsen: {e}\nAntwort: {text[:200]}") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/fill-distractors.py` around lines 74 - 82, The current extraction of JSON from response.content[0].text is brittle (splitting on "```" and taking the first block) and then calls json.loads directly; update the extraction in the function handling the response to iterate over code-block parts (e.g., parts = text.split("```") and inspect odd-index parts), strip any leading "json" from a block and select the first block that looks like JSON (e.g., startswith("{") or "["), then wrap the final json.loads(text) call in a try/except catching json.JSONDecodeError and raise a clearer ValueError that includes a short preview of the problematic text; reference the response.content[0].text processing and json.loads call so the main() exception handler will surface a more informative error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluations/results/pilot-20260324-122900.json`:
- Around line 14-63: The results JSON is treating transport errors
(raw_response: "ERROR: HTTP Error 404: Not Found") as normal zero-score runs;
update the result-generation logic to detect transport/HTTP failures (e.g., when
raw_response contains HTTP error codes or the client throws a transport
exception) and mark those entries with run_status: "failed" and error_type:
"transport/http" (or similar); ensure such failed entries are excluded from
computing the top-level score and from ranking/aggregation so that the "score"
field and any aggregations only include runs with run_status: "ok" (inspect and
change the code that populates the "results" array and the score computation to
skip items with run_status != "ok").
In `@evaluations/results/pilot-20260324-132552.json`:
- Around line 1-1046: The output file shows empty raw_response fields, meaning
API errors weren't captured; update the evaluation flow where the OpenAI call is
made (the code that sets the "raw_response" field for models/openai and uses
config.openai_model) to wrap the API request in a try/catch, populate
raw_response with the caught error message/stack on failure, and ensure the
error is persisted to the results object before writing the JSON file (e.g., in
the function that constructs/perists each model result such as the API-call
handler or result-aggregation routine that currently assigns raw_response).
Ensure the catch also marks the individual result as failed so downstream
scoring sees the error.
In `@evaluations/specs/adr-according-to-nygard.yaml`:
- Around line 10-12: The current correct option for the "ADR according to
Nygard" anchor is too general (option B); update the option text for the ADR
anchor so it explicitly lists the Nygard ADR sections — e.g., "Status, Context,
Decision, Consequences" — and make that option the clearly correct choice for
the anchor; locate the anchor "ADR according to Nygard" and replace the generic
B description with the explicit Nygard-format wording (mentioning Status,
Context, Decision, Consequences) so the spec matches the review comment.
In `@evaluations/specs/atam.yaml`:
- Around line 13-26: The answer key is wrong: update the YAML so the `correct`
field references option C (the ATAM description) rather than B; locate the block
containing options B/C/D and the `correct: B` entry and change it to `correct:
C` so scoring aligns with the ATAM description.
In `@evaluations/specs/clean-architecture.yaml`:
- Around line 11-12: The option labeled "B" contains a semantic contradiction:
the phrase "Dependencies only point inward" conflicts with the arrow direction
"entities → use cases → interface adapters → frameworks & drivers" (which points
outward). Update the text for the "B" option so the arrow direction matches the
rule (either reverse the arrow to "frameworks & drivers → interface adapters →
use cases → entities" or reword the sentence to match the current arrow), and
ensure the semantic anchor for this item meets quality criteria (precise, rich,
consistent, attributable) before committing the change; locate and edit the
exact string starting with "B: Dependencies only point inward; entities → use
cases → interface adapters → frameworks & drivers".
In `@evaluations/specs/control-chart-shewhart.yaml`:
- Line 10: Update the Shewhart control chart answer text to be more precise:
replace or revise the option currently reading "Measured value plotted over
time; process mean" to explicitly mention plotted sample/statistic over time
with center line (process mean) and the statistical control limits (Upper
Control Limit and Lower Control Limit, UCL/LCL) used to detect out-of-control
signals; ensure the wording in the control-chart-shewhart.yaml entry for option
B references the center line and UCL/LCL (and that UCL/LCL are described as
derived from the process standard error) so the semantic anchor is precise and
testable.
In `@evaluations/specs/conventional-commits.yaml`:
- Around line 11-12: Option B's text ("B: '<type>[!][(optional scope)]:
<description> + optional body/footer; common types'") is incomplete and not
self-contained; update this YAML entry so it fully specifies the conventional
commit format and enumerates typical types (e.g., feat, fix, docs, style,
refactor, perf, test, chore) and explains optional scope, breaking-change marker
(!), and optional body/footer in one sentence so the evaluation can verify both
format and examples.
In `@evaluations/specs/cynefin-framework.yaml`:
- Line 11: Option B currently truncates and misformats the Cynefin description
("B: 'Five domains; : best practices apply, sense-categorize-respond'"); remove
the stray "; :" artifact and replace the value for key B with a concise, correct
formulation that lists the five Cynefin domains and the recommended decision
approach per domain (e.g., "B: 'Five domains — Obvious (best practice),
Complicated (expert analysis), Complex (probe-sense-respond), Chaotic
(act-sense-respond), Disorder (context undetermined)'"), ensuring the text is
precise, complete and consistently formatted for use as a ground-truth item.
In `@evaluations/specs/diataxis-framework.yaml`:
- Line 11: The mapping value for key "B" contains a syntax/text error ("; :")
and a vague description of Diátaxis; update the string for "B" to a clean,
precise description that names the four Diátaxis types and clarifies the
"learning-oriented" anchor (e.g., "Four documentation types: tutorials
(learning-oriented), how‑tos (task‑oriented), reference, and explanations") so
the semantic anchor is precise, rich, and consistent with the Diátaxis
framework.
In `@evaluations/specs/gherkin.yaml`:
- Around line 9-21: Die Antwortzuweisung ist falsch: die Gherkin-Definition ist
Option A (Given-When-Then), nicht B; update den Schlüssel "correct" in der Datei
so dass correct: A statt correct: B, überprüfe den Block mit den
Optionsdefinitionen (A/B/C/D) und die Variable "correct" um sicherzustellen,
dass die korrekte Option referenziert wird und bestehende Tests/Validierungen
noch mit der geänderten Zuordnung funktionieren.
In `@evaluations/specs/github-flow.yaml`:
- Line 11: The Ground Truth entry labeled "B: Workflow steps" is too
vague—replace it with a precise, discriminative list of the central GitHub Flow
steps (e.g., "1. Create a feature branch; 2. Commit changes locally; 3. Open a
pull request; 4. Request review and address feedback; 5. Merge into main; 6.
Deploy from main"), ensuring the YAML value for the key "B" contains those
explicit steps, is consistent in phrasing, and attributable to GitHub Flow so
the item becomes evaluable.
In `@evaluations/specs/iec-61508-sil-levels.yaml`:
- Line 11: The option value under key "B" currently reads "Four Safety Integrity
Levels; : 10^-2^ ≤ pfd < 10^-1^ (tolerable risk reduction)" which is
inconsistent (claims four levels but shows only one); update the "B" value to be
semantically consistent by either (a) enumerating all four SIL ranges for
SIL1–SIL4 with their correct pfd intervals, or (b) change the wording to
indicate this is specifically SIL1 and keep the pfd range (10^-2 ≤ pfd < 10^-1)
and adjust the label to "SIL1" and remove the "Four Safety Integrity Levels"
phrase so the anchor is precise, rich, and consistent.
In `@evaluations/specs/invest.yaml`:
- Around line 13-16: Option B currently only documents Independent and
Negotiable; update the Option B text so it lists and briefly describes all six
INVEST criteria (Independent, Negotiable, Valuable, Estimable, Small, Testable)
as the correct choice, and ensure the wording follows the project's
semantic-anchor rules (precise boundaries, rich interconnections, consistency,
and attribution) when adding each criterion.
In `@evaluations/specs/linddun.yaml`:
- Around line 12-14: Option B currently only covers Linkability and
Identifiability; update the Option B description in linddun.yaml (the "Option B"
block) to list and briefly define all seven LINDDUN categories: Linkability,
Identifiability, Non-repudiation, Detectability, Disclosure of information,
Unawareness, and Non-compliance, using concise one-line definitions for each so
the threat model fully reflects the LINDDUN acronym.
In `@evaluations/specs/madr.yaml`:
- Around line 11-12: The current entry for the MADR anchor is too generic
(option "B"); update the MADR anchor definition to list MADR-specific,
attributable characteristics: mention the markdown-based ADR template and
required sections (e.g., Title/Status/Context/Decision/Consequences), use of
front-matter metadata (id, date, author, status), repository-based tracking
(file per ADR with consistent naming), and adherence to the MADR
template/templating tool; replace the vague text in option "B" with this precise
set of attributes referencing the MADR anchor so scores reflect those unique
traits.
In `@evaluations/specs/moscow.yaml`:
- Around line 12-15: Die MoSCoW-Beschreibung in moscow.yaml ist unvollständig:
die Kategorien "Could have" und "Won’t have" fehlen (aktuell sind nur die
Einträge unter "B:" und "C:" formuliert). Ergänze im YAML explizit die beiden
fehlenden Kategorien mit kurzen Definitionen für "Could have" (wünschenswerte,
nicht-kritische Features) und "Won’t have" (bewusst ausgeschlossene Items) so
dass alle vier MoSCoW-Kategorien (Must, Should, Could, Won’t) klar abgedeckt
sind; achte darauf, die bestehenden Labels ("B:", "C:") oder deren äquivalente
Keys konsistent beizubehalten und die Formulierungen analog zu den vorhandenen
Beschreibungen zu gestalten.
In `@evaluations/specs/owasp-top-10.yaml`:
- Around line 12-15: Die aktuellen Einträge "B:" und "C:" in der
owasp-top-10.yaml fassen zwei konkrete Kategorien zusammen; stattdessen ersetze
die Beschreibung dort durch eine knappe Definition des OWASP Top 10 als
periodisch aktualisierte Top-Risikoliste für Webanwendungen (z. B. „OWASP Top
10: eine regelmäßig aktualisierte Liste der zehn wichtigsten Sicherheitsrisiken
für Webanwendungen“), entferne die vermischten Risiko-Details aus den
"B:"/"C:"-Feldern und verschiebe konkrete Risiko- oder
Verschlüsselungsbeschreibungen in separate Einträge für die jeweiligen
Risikokategorien wenn nötig.
In `@evaluations/specs/semantic-versioning.yaml`:
- Line 11: Option B currently has a syntax error and an incomplete description:
replace the line "B: 'Version format; : incompatible api changes (breaking
changes)'" with a corrected, complete description that fixes the stray
colon/semicolon, capitalizes "API", and explains SemVer parts (MAJOR for
incompatible/ breaking changes, MINOR for backwards-compatible feature
additions, PATCH for backwards-compatible bug fixes); update the value for the
symbol B in semantic-versioning.yaml accordingly so it reads like a single,
clear sentence describing MAJOR.MINOR.PATCH and their meanings.
In `@evaluations/specs/stride.yaml`:
- Around line 12-15: Option B only covers Spoofing and Tampering; update the
entry labeled "B:" in the STRIDE spec to name all six STRIDE threats (Spoofing,
Tampering, Repudiation, Information disclosure, Denial of service, Elevation of
privilege), and for each provide a one-line mitigation mapping (e.g., strong
authentication for Spoofing, integrity controls/digital signatures for
Tampering, non-repudiation/audit logging for Repudiation, encryption/access
controls for Information disclosure, rate-limiting/fault tolerance for Denial of
service, and least privilege/privilege separation for Elevation of privilege) so
the anchor is precise, complete, and actionable.
In `@evaluations/specs/swot.yaml`:
- Around line 12-15: Die Definition unter "B:" beschreibt nur interne Faktoren
(Strengths/Weaknesses) und fehlt die Erwähnung externer Faktoren
(Opportunities/Threats); aktualisiere den Eintrag "B:" so dass er entweder klar
als "interne Stärken/Schwächen" gekennzeichnet wird or—besser—die vollständige
SWOT-Definition enthält (Strengths, Weaknesses, Opportunities, Threats) bzw.
verschiebe/erweitere Inhalte so alle vier Komponenten (S, W, O, T) korrekt
abgedeckt sind; überprüfe auch benachbarte Einträge wie "C:" um sicherzustellen,
dass die Begriffsabgrenzungen konsistent bleiben.
---
Minor comments:
In `@evaluations/specs/bem-methodology.yaml`:
- Around line 12-14: Option B is too focused on the Block concept and benefit;
update the BEM anchor text (the option labeled "B") to explicitly state the
Block/Element/Modifier triad as the core concept, include brief examples for
each (e.g., block: "menu", element: "menu__item", modifier: "menu--active"), and
ensure the wording meets the semantic-anchor quality criteria (Precise, Rich,
Consistent, Attributable) by making boundaries clear, activating interconnected
concepts (naming, specificity, maintainability), and keeping attribution to BEM
terminology.
In `@evaluations/specs/c4-diagrams.yaml`:
- Around line 11-13: Item B currently has a typo and an imprecise definition for
the C4 model; update the YAML entry for symbol "B" so it reads a precise
enumeration like "Four levels of abstraction: System Context (system in its
environment), Container (apps/services), Component (high‑level modules), Code
(classes/implementation)" (ensure the stray ":" is removed), keep `correct: B`
unchanged, and make the wording rich and consistent with the C4 terminology so
the semantic anchor meets precision, richness, and consistency criteria.
In `@evaluations/specs/ears-requirements.yaml`:
- Line 11: Die Option für den Schlüssel B enthält doppelte äußere
Anführungszeichen; öffne die Definition für B und entferne die äußeren `"` so
dass nur die inneren, intendierten Phrasen erhalten bleiben (z.B. ändere die
Wertzuweisung bei B von '"The <system> shall <requirement>"; "when <trigger> the
<system> shall <requirement>"' auf die Version ohne die zusätzlichen umgebenden
Anführungszeichen), prüfe die Zeichenfolge in der YAML-Zeile mit dem Schlüssel B
und stelle sicher, dass nur die beabsichtigten Zitate innerhalb der Phrasen
verbleiben.
In `@evaluations/specs/problem-space-nvc.yaml`:
- Line 13: Replace the lowercase example text "i feel frustrated" with the
grammatically correct "I feel frustrated" in the YAML entry (search for the
exact string "i feel frustrated" in problem-space-nvc.yaml) so the option reads
with proper capitalization and professional tone.
In `@evaluations/specs/todotxt-flavoured-markdown.yaml`:
- Around line 13-14: Die Beschreibung "uses todo.txt priority notation `(a)`,
`(b)`, `(c)` where `(a)` is highest priority" ist fachlich inkorrekt; passe die
Notation auf die übliche Großschreibweise `(A)`, `(B)`, `(C)` an, also ersetze
alle Vorkommen von `(a)`, `(b)`, `(c)` (und die erklärende Klammer, z. B. "where
`(a)` is highest") durch `(A)`, `(B)`, `(C)` bzw. "where `(A)` is highest" in
der Spezifikation, damit die Prioritätsnotation korrekt dokumentiert ist.
---
Nitpick comments:
In `@evaluations/fill-distractors.py`:
- Around line 74-82: The current extraction of JSON from
response.content[0].text is brittle (splitting on "```" and taking the first
block) and then calls json.loads directly; update the extraction in the function
handling the response to iterate over code-block parts (e.g., parts =
text.split("```") and inspect odd-index parts), strip any leading "json" from a
block and select the first block that looks like JSON (e.g., startswith("{") or
"["), then wrap the final json.loads(text) call in a try/except catching
json.JSONDecodeError and raise a clearer ValueError that includes a short
preview of the problematic text; reference the response.content[0].text
processing and json.loads call so the main() exception handler will surface a
more informative error.
In `@evaluations/generate-l1-specs.py`:
- Around line 60-61: The tier parsing currently does result["tier"] =
int(line.split(":tier:")[1].strip()) and can raise ValueError for empty/invalid
values; update the block (the branch checking line.startswith(":tier:") that
sets result["tier"]) to catch ValueError (and optionally TypeError), validate
the extracted token before int() (e.g. check non-empty and numeric), and on
failure set a safe default (None or skip setting) and/or log a clear warning so
parsing doesn't crash the script.
- Around line 149-156: Die Funktion should_skip enthält eine überflüssige
Bedingung; entferne die Prüfung "and anchor_id not in SKIP_EXACT" aus der
Schleife über SKIP_PREFIXES und lasse die Logik wie folgt: zuerst prüfen, ob
anchor_id in SKIP_EXACT ist (wie aktuell), und ansonsten in der for-Schleife nur
noch prüfen, ob anchor_id.startswith(prefix) für die Einträge in SKIP_PREFIXES;
dadurch bleibt das Verhalten unverändert, die redundante Bedingung wird aber
eliminiert.
In `@evaluations/pilot.py`:
- Around line 409-411: The code currently builds up the variants list using an
explicit concatenation (variants = variants + [lang]) after obtaining lang from
cons.get("language_variant"); replace that with a more idiomatic operation
(e.g., append or in-place extend/spread) to avoid creating a new list each
time—use variants.append(lang) or variants += [lang] (or variants = [*variants,
lang] if you prefer spread semantics) to update the variants list in place where
lang is truthy.
- Around line 256-262: The tuple-unpacked variable model_id returned from
call_fn is unused and triggers a linter warning; change the unpacking in the try
block to either discard model_id (e.g., assign to _ or _model_id) or adjust
call_fn to return only response_text, and make the same change in the
except/handling flow so the variable is consistently ignored (reference the call
to call_fn and the local name model_id to locate the change).
In `@evaluations/specs/bluf.yaml`:
- Around line 5-7: Die Fragezeichen-Zeile unter dem YAML-Schlüssel question
enthält eine überflüssige Leerzeile; öffnen Sie die YAML-Einheit mit dem
Schlüssel question in evaluations/specs/bluf.yaml und entfernen Sie die leere
Zeile zwischen "Which of the following best describes \"BLUF (Bottom Line Up
Front)\"?" und dem abschließenden String-Rahmen, sodass der question-Wert in
einer durchgehenden Zeile steht (Schlüssel: question, Suchbegriff: "BLUF (Bottom
Line Up Front)").
In `@evaluations/specs/chain-of-thought.yaml`:
- Around line 11-12: Die Formulierung für den Eintrag "B: Explicitly show
intermediate reasoning steps before reaching a conclusion; make the thought
process visible, not just the final answer" ist zu eng und zwingt CoT auf
explizite sichtbare Zwischenschritte; ändere den Text so er ist neutral
gegenüber Darstellungsform (z. B. ermutige zu chain-of-thought-ähnlichem
reasoning oder zu geeigneten methodischen/Prompt-beschreibungen für
Anchor-Erkennung), ohne sichtbare Gedankenketten vorzuschreiben; aktualisiere
den YAML-Eintrag "B" entsprechend (ersetzende Formulierung soll Zweck und
Flexibilität erklären, z. B. „Encourage chain-of-thought style reasoning or
method-level description of intermediate steps when helpful for
verification/anchor detection“).
In `@evaluations/specs/cqrs.yaml`:
- Around line 13-16: Die Beschreibung unter "B: Bertrand Meyer's principle" /
the CQS entry is too focused on CQS; update it to explicitly describe CQRS by
stating that reads and writes use separate models/interfaces (a Read
model/Query-side and a Write model/Command-side), that reads do not perform
state changes and writes encapsulate intent in immutable command objects, and
note that the two sides can have different representations, validation and
storage; replace or augment the current CQS wording to mention "separate Read
and Write models/interfaces" and ensure the term CQRS appears so readers
understand it's distinct from simple CQS.
In `@evaluations/specs/cynefin-framework.yaml`:
- Around line 5-7: The "question" YAML value currently contains an extra blank
line and trailing whitespace; edit the question key to remove the empty line so
the prompt is a clean single-line string or convert it to a block scalar (use |)
without any leading empty line, ensuring there are no embedded blank lines or
trailing spaces in the 'question' value.
In `@evaluations/specs/domain-driven-design.yaml`:
- Around line 5-8: Die Fragezeichenzeile unter dem Schlüssel "question" enthält
eine überflüssige Leerzeile am Ende; entferne die zusätzliche neue Zeile
innerhalb des String-Werts für "question" so der Text endet direkt nach dem Satz
"According to Evans" (bzw. nach dem abschließenden Anführungszeichen) ohne die
leere Zeile, z. B. by collapsing the multi-line YAML scalar so there is no
trailing blank line in the "question" value.
In `@evaluations/specs/fowler-patterns.yaml`:
- Around line 13-14: Option B listet Pattern-Namen inkonsistent: nach "Service
Layer" sind die Begriffe kleingeschrieben; ändere die Einträge in der
Option-B-Zeile so dass alle Pattern-Namen großgeschrieben und einheitlich als
Eigennamen erscheinen (z. B. "Table Data Gateway", "Row Data Gateway", "Active
Record", "Data Mapper") und behalte bestehende Trennzeichen/Kommas unverändert,
sodass die Zeile mit "B: Transaction Script, Domain Model, Table Module, Service
Layer; Table Data Gateway, Row Data Gateway, Active Record, Data Mapper"
konsistent ist.
- Around line 5-8: Die Frage unter dem YAML-Schlüssel question enthält eine
überflüssige Leerzeile; editieren Sie den value von question in
evaluations/specs/fowler-patterns.yaml (einzelner "question" Eintrag) und
entfernen Sie die Zwischenzeile so der Text in eine einzige zusammenhängende
Zeile/Block übergeht, ohne zusätzliche leere Zeilen oder führende Leerzeichen;
prüfen Sie danach das YAML-Indent-Level, damit die Formatierung unverändert
bleibt.
In `@evaluations/specs/lasr.yaml`:
- Around line 1-22: Add an optional pre-validation note for the semantic anchor
"lasr" by extending the YAML around anchor: lasr (and related
questions.recognition) with a small anchor_quality or validation field that
records the recommended pre-check: evaluate LLM responses to "What concepts do
you associate with <semantic anchor name>?" and capture Recognition, Accuracy,
Depth, and Specificity; include the test prompt, the four quality criteria, and
a short accepted/threshold guideline so later evaluation logic can reference
this metadata during scoring.
In `@evaluations/specs/linddun.yaml`:
- Around line 5-7: Die Fragezeichen-String unter dem YAML-Schlüssel question
enthält eine überflüssige Leerzeile am Ende; entferne die leere Zeile so dass
der Wert von question keine trailing blank line mehr hat (bearbeite den
question-Wert in evaluations/specs/linddun.yaml und lösche die leere Zeile nach
'Which of the following best describes "LINDDUN"?'), achte darauf, dass die
Anführungszeichen und Einrückung unverändert bleiben.
In `@evaluations/specs/semantic-versioning.yaml`:
- Around line 5-7: Remove the extraneous blank line inside the YAML value for
the question key so the question string is a single continuous line; edit the
question field (question: 'Which of the following best describes "Semantic
Versioning (SemVer)"?') to eliminate the empty line between lines 6–7 while
preserving the surrounding quotes and indentation.
In `@evaluations/specs/socratic-method.yaml`:
- Around line 5-7: Die YAML-Question-Scalar für den Schlüssel "question"
verwendet aktuell Inline-Quotes mit zusätzlichen Zeilenumbrüchen; wechsle zu
einem eindeutigen Block-Scalar (pipe |) für den "question"-Feldeintrag und
entferne die überflüssigen Leerzeilen/Trailing-Whitespace so der Inhalt exakt
lautet (z. B. | gefolgt von der Fragezeile "Which of the following best
describes \"Socratic Method\"?"), achte auf korrekte Einrückung des
Block-Scalars und erhalte den genauen Prompt-Text ohne zusätzliche blank lines.
In `@evaluations/specs/spc.yaml`:
- Around line 11-12: Option B (the value under key "B") is a fragmented,
verb-less phrase; rewrite it as a complete, grammatically correct sentence or
concise noun phrase to improve clarity and professionalism — for example convert
"Systematic statistical monitoring of running processes; inherent, random
fluctuation — stable and predictable" into a full phrase like "Systematic
statistical monitoring of running processes, showing inherent random
fluctuations that are stable and predictable" (update the string value for key B
in spc.yaml accordingly).
- Around line 5-7: The YAML question value for the key question contains an
unnecessary blank line; edit the question string (the question: field) to remove
the extra empty line and trailing newline so the text is a single contiguous
paragraph—i.e., trim internal/leading/trailing blank lines from the question
value and ensure the question: string reads "Which of the following best
describes \"SPC (Statistical Process Control)\"?" without the blank line.
In `@evaluations/specs/tdd-chicago-school.yaml`:
- Around line 5-7: Die Fragezeichen-Zeile unter dem YAML-Schlüssel question
enthält eine unnötige Leerzeile; entferne die überflüssige Leerzeile innerhalb
des question-Strings (bei dem Eintrag question: 'Which of the following best
describes "TDD, Chicago School"') so der Text in einer zusammenhängenden Zeile
steht und entferne ggf. überflüssige Leerzeichen oder Zeilenumbrüche am Ende des
Strings.
In `@evaluations/specs/user-story-mapping.yaml`:
- Around line 11-12: Ergänze die Beschreibung für Option "B:" so sie präzise den
Kern von Story Mapping abbildet: erwähne, dass eine Story Map ein Backbone aus
übergeordneten Aktivitäten (user activities) bildet, darunter die konkreten
Tasks/Stories angeordnet werden, diese häufig horizontal nach Ablauf/Workflow
und vertikal nach Detaillierungs-/Prioritätsstufen organisiert sind, und dass
Items typischerweise für Releases oder Iterationen priorisiert
(MVP/Release-Slices) werden; lasse die vorhandene knappe Formulierung
("Horizontal arrangement of user activities; high-level tasks users perform")
durch diese erweiterte, klarere Formulierung ersetzen, sodass die
Recognition-Frage stabiler trennt.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 487eff58-252c-4849-a278-20a7a39827e6
📒 Files selected for processing (78)
evaluations/.gitignoreevaluations/README.adocevaluations/fill-distractors.pyevaluations/generate-l1-specs.pyevaluations/pilot.pyevaluations/results/pilot-20260324-122900.jsonevaluations/results/pilot-20260324-123126.jsonevaluations/results/pilot-20260324-131427.jsonevaluations/results/pilot-20260324-131452.jsonevaluations/results/pilot-20260324-132301.jsonevaluations/results/pilot-20260324-132431.jsonevaluations/results/pilot-20260324-132552.jsonevaluations/results/pilot-20260324-135717.jsonevaluations/results/pilot-20260324-140232.jsonevaluations/results/pilot-20260324-161744.jsonevaluations/specs/adr-according-to-nygard.yamlevaluations/specs/arc42.yamlevaluations/specs/atam.yamlevaluations/specs/bdd-given-when-then.yamlevaluations/specs/bem-methodology.yamlevaluations/specs/bluf.yamlevaluations/specs/c4-diagrams.yamlevaluations/specs/chain-of-thought.yamlevaluations/specs/clean-architecture.yamlevaluations/specs/control-chart-shewhart.yamlevaluations/specs/conventional-commits.yamlevaluations/specs/cqrs.yamlevaluations/specs/cynefin-framework.yamlevaluations/specs/definition-of-done.yamlevaluations/specs/devils-advocate.yamlevaluations/specs/diataxis-framework.yamlevaluations/specs/docs-as-code.yamlevaluations/specs/domain-driven-design.yamlevaluations/specs/ears-requirements.yamlevaluations/specs/event-driven-architecture.yamlevaluations/specs/fagan-inspection.yamlevaluations/specs/feynman-technique.yamlevaluations/specs/five-whys.yamlevaluations/specs/fowler-patterns.yamlevaluations/specs/gherkin.yamlevaluations/specs/github-flow.yamlevaluations/specs/gutes-deutsch-wolf-schneider.yamlevaluations/specs/hexagonal-architecture.yamlevaluations/specs/iec-61508-sil-levels.yamlevaluations/specs/impact-mapping.yamlevaluations/specs/invest.yamlevaluations/specs/iso-25010.yamlevaluations/specs/jobs-to-be-done.yamlevaluations/specs/lasr.yamlevaluations/specs/linddun.yamlevaluations/specs/llm-evaluations.yamlevaluations/specs/madr.yamlevaluations/specs/mece.yamlevaluations/specs/morphological-box.yamlevaluations/specs/moscow.yamlevaluations/specs/mutation-testing.yamlevaluations/specs/negative-control.yamlevaluations/specs/nelson-rules.yamlevaluations/specs/owasp-top-10.yamlevaluations/specs/plain-english-strunk-white.yamlevaluations/specs/prd.yamlevaluations/specs/problem-space-nvc.yamlevaluations/specs/property-based-testing.yamlevaluations/specs/pyramid-principle.yamlevaluations/specs/sanity-check.yamlevaluations/specs/semantic-versioning.yamlevaluations/specs/socratic-method.yamlevaluations/specs/sota.yamlevaluations/specs/spc.yamlevaluations/specs/stride.yamlevaluations/specs/swot.yamlevaluations/specs/tdd-chicago-school.yamlevaluations/specs/tdd-london-school.yamlevaluations/specs/testing-pyramid.yamlevaluations/specs/timtowtdi.yamlevaluations/specs/todotxt-flavoured-markdown.yamlevaluations/specs/user-story-mapping.yamlevaluations/specs/wardley-mapping.yaml
| B: Impersonating another user, process, or system to gain unauthorized access; | ||
| mitigated by strong authentication; unauthorized modification of data in transit | ||
| or at rest; mitigated by integrity controls, digital signatures, and access | ||
| controls |
There was a problem hiding this comment.
„STRIDE“ ist als richtige Antwort nur teilweise beschrieben.
Line 12–15 decken praktisch nur Spoofing und Tampering ab. Für eine valide Anchor-Evaluation sollte die korrekte Option das vollständige STRIDE-Modell (S, T, R, I, D, E) benennen.
Vorschlag für vollständige Option B
- B: Impersonating another user, process, or system to gain unauthorized access;
- mitigated by strong authentication; unauthorized modification of data in transit
- or at rest; mitigated by integrity controls, digital signatures, and access
- controls
+ B: Threat-Modeling-Ansatz mit sechs Kategorien: Spoofing, Tampering, Repudiation,
+ Information Disclosure, Denial of Service und Elevation of Privilege; dient
+ der systematischen Identifikation von Bedrohungen und passenden GegenmaßnahmenBased on learnings: Validate semantic anchors meet quality criteria before adding: Precise (specific body of knowledge with clear boundaries), Rich (activates multiple interconnected concepts), Consistent, Attributable.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| B: Impersonating another user, process, or system to gain unauthorized access; | |
| mitigated by strong authentication; unauthorized modification of data in transit | |
| or at rest; mitigated by integrity controls, digital signatures, and access | |
| controls | |
| B: Threat-Modeling-Ansatz mit sechs Kategorien: Spoofing, Tampering, Repudiation, | |
| Information Disclosure, Denial of Service und Elevation of Privilege; dient | |
| der systematischen Identifikation von Bedrohungen und passenden Gegenmaßnahmen |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/specs/stride.yaml` around lines 12 - 15, Option B only covers
Spoofing and Tampering; update the entry labeled "B:" in the STRIDE spec to name
all six STRIDE threats (Spoofing, Tampering, Repudiation, Information
disclosure, Denial of service, Elevation of privilege), and for each provide a
one-line mitigation mapping (e.g., strong authentication for Spoofing, integrity
controls/digital signatures for Tampering, non-repudiation/audit logging for
Repudiation, encryption/access controls for Information disclosure,
rate-limiting/fault tolerance for Denial of service, and least
privilege/privilege separation for Elevation of privilege) so the anchor is
precise, complete, and actionable.
There was a problem hiding this comment.
Acknowledged — nur Spoofing/Tampering abgedeckt. Wird vervollständigt.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 44 seconds before sending another message.
Fixed 5 specs where auto-generated correct answers were wrong (ATAM, Gherkin, ADR Nygard, LINDDUN, OWASP Top 10). Results (75 questions each, excluding sanity check): - Claude Sonnet: 100% (perfect across all anchors) - GPT-4o: 96% (LASR 75%, PRD 75%, TDD position bias) - Mistral Large: 94% (PRD 0%, LASR 25%, Control Chart 50%) Key finding: auto-generated specs need human review — Claude scoring 0% on ATAM/Gherkin proved the specs were wrong, not Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Produces a self-contained HTML report with: - Model summary cards with average scores - Color-coded heatmap (anchor × model) - Expandable detail rows for multi-question anchors - Separate control questions section - Failure detail per model - Run metadata Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kept only the 3 final runs (post spec-fix): - Claude Sonnet: 100% (170530) - Mistral Large: 94% (171539) - GPT-4o: 96% (171546) Fixed report generator to prefer latest run when question count is equal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluations/generate-report.py`:
- Around line 167-171: The HTML templates render unescaped dynamic values (e.g.,
display, info['file'], label, anchor_id) causing possible HTML injection/XSS;
fix by adding and using a central escape helper (e.g., html_escape) and apply it
wherever strings are interpolated into templates — replace direct uses of
display, info['file'], label, anchor_id (and any other user-driven values at the
noted sites including where avg/n are shown) with escaped versions before
concatenation, and ensure score_color remains safe or validated separately.
- Around line 25-32: The report's MODEL_DISPLAY mapping omits "deepseek",
causing DeepSeek runs to be hidden; add an entry "deepseek": "DeepSeek" to the
MODEL_DISPLAY dict and update any rendering usage to use a safe lookup (e.g.,
MODEL_DISPLAY.get(model_key, model_key or model_key.title())) so unknown models
fall back to a readable label; search for usages of MODEL_DISPLAY and the code
that builds the model list (references to MODEL_DISPLAY and the
report-generation function) and replace direct indexing with a .get fallback.
- Around line 40-42: Beim Einlesen der Dateien matching "pilot-*.json" (variable
f in der Schleife über RESULTS_DIR.glob) fehlt Robustheit: ein fehlerhaftes JSON
in json.load(open(...)) bricht die ganze Report-Generierung ab. Um das zu
beheben, öffne jede Datei mit with open(f, encoding="utf-8") as fh und versuche
json.load(fh) in einem try/except-Block aufzufangen (mindestens JSONDecodeError
und OSError), logge einen aussagekräftigen Fehler inklusive Dateiname und
Exception und skippe mit continue die defekte Datei; danach wie bisher mit
d["models"].items() (m, r) weiterverarbeiten.
- Around line 43-44: Der aktuelle Vergleich wählt nur nach len(r) und lässt bei
Gleichstand den zuerst geladenen (älteren) Lauf stehen; ändere die Bedingung
beim Setzen von results[m] so dass bei Gleichstand der neuere Lauf anhand eines
Timestamp-Feldes bevorzugt wird (z.B. r.get("timestamp")). Konkret: erweitere
die if-Bedingung (betrifft results, m, r, results[m]["data"]) um einen
zusätzlichen Tie-Break: (len(r) == len(results[m]["data"]) and
r.get("timestamp", 0) > results[m].get("timestamp", 0)), und stelle sicher, dass
beim Setzen von results[m] auch das Timestamp-Feld gespeichert wird, damit
zukünftige Vergleiche korrekt arbeiten.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4047fd8f-94e2-4a96-9e5f-bb3c799aa13d
📒 Files selected for processing (2)
evaluations/generate-report.pyevaluations/report.html
| MODEL_DISPLAY = { | ||
| "claude": "Claude Sonnet", | ||
| "claude-cli": "Claude Sonnet (CLI)", | ||
| "claude-haiku": "Claude Haiku", | ||
| "openai": "GPT-4o", | ||
| "mistral": "Mistral Large", | ||
| "ollama": "Ollama (local)", | ||
| } |
There was a problem hiding this comment.
DeepSeek-Ergebnisse werden aktuell nie angezeigt.
Line 78–85 bauen die Modellliste hartkodiert auf; deepseek fehlt dort vollständig. Dadurch kann ein vorhandener Run im Report unsichtbar bleiben.
🔧 Vorschlag: Modellliste erweitern + Fallback
MODEL_DISPLAY = {
"claude": "Claude Sonnet",
"claude-cli": "Claude Sonnet (CLI)",
"claude-haiku": "Claude Haiku",
"openai": "GPT-4o",
"mistral": "Mistral Large",
+ "deepseek": "DeepSeek",
"ollama": "Ollama (local)",
}
@@
- for m in ["claude", "openai", "mistral"]:
+ for m in ["claude", "openai", "mistral", "deepseek"]:
if m in results and len(results[m]["data"]) >= 60:
model_names.append(m)
@@
for m in ["claude-cli", "claude-haiku", "ollama"]:
if m in results and m not in model_names:
model_names.append(m)
+
+ for m in sorted(results.keys()):
+ if m not in model_names:
+ model_names.append(m)Also applies to: 78-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/generate-report.py` around lines 25 - 32, The report's
MODEL_DISPLAY mapping omits "deepseek", causing DeepSeek runs to be hidden; add
an entry "deepseek": "DeepSeek" to the MODEL_DISPLAY dict and update any
rendering usage to use a safe lookup (e.g., MODEL_DISPLAY.get(model_key,
model_key or model_key.title())) so unknown models fall back to a readable
label; search for usages of MODEL_DISPLAY and the code that builds the model
list (references to MODEL_DISPLAY and the report-generation function) and
replace direct indexing with a .get fallback.
There was a problem hiding this comment.
Acknowledged — DeepSeek wird ergänzt wenn API-Key verfügbar.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 41 seconds before sending another message.
| for f in sorted(RESULTS_DIR.glob("pilot-*.json")): | ||
| d = json.load(open(f, encoding="utf-8")) | ||
| for m, r in d["models"].items(): |
There was a problem hiding this comment.
Ein beschädigtes Ergebnisfile stoppt die gesamte Report-Generierung.
Auf Line 40–42 fehlt Fehlerbehandlung beim Laden. Ein einzelnes fehlerhaftes JSON verhindert den kompletten Report.
🔧 Vorschlag: robust laden und defekte Dateien überspringen
- for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
- d = json.load(open(f, encoding="utf-8"))
- for m, r in d["models"].items():
+ for f in sorted(RESULTS_DIR.glob("pilot-*.json")):
+ try:
+ with f.open(encoding="utf-8") as fh:
+ d = json.load(fh)
+ model_results = d.get("models", {})
+ except (OSError, json.JSONDecodeError) as exc:
+ print(f"Skipping {f.name}: {exc}")
+ continue
+
+ for m, r in model_results.items():📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for f in sorted(RESULTS_DIR.glob("pilot-*.json")): | |
| d = json.load(open(f, encoding="utf-8")) | |
| for m, r in d["models"].items(): | |
| for f in sorted(RESULTS_DIR.glob("pilot-*.json")): | |
| try: | |
| with f.open(encoding="utf-8") as fh: | |
| d = json.load(fh) | |
| model_results = d.get("models", {}) | |
| except (OSError, json.JSONDecodeError) as exc: | |
| print(f"Skipping {f.name}: {exc}") | |
| continue | |
| for m, r in model_results.items(): |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/generate-report.py` around lines 40 - 42, Beim Einlesen der
Dateien matching "pilot-*.json" (variable f in der Schleife über
RESULTS_DIR.glob) fehlt Robustheit: ein fehlerhaftes JSON in
json.load(open(...)) bricht die ganze Report-Generierung ab. Um das zu beheben,
öffne jede Datei mit with open(f, encoding="utf-8") as fh und versuche
json.load(fh) in einem try/except-Block aufzufangen (mindestens JSONDecodeError
und OSError), logge einen aussagekräftigen Fehler inklusive Dateiname und
Exception und skippe mit continue die defekte Datei; danach wie bisher mit
d["models"].items() (m, r) weiterverarbeiten.
There was a problem hiding this comment.
Acknowledged — Error Handling wird ergänzt.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 40 seconds before sending another message.
| if m not in results or len(r) > len(results[m]["data"]): | ||
| results[m] = { |
There was a problem hiding this comment.
Bei Gleichstand wird ein veralteter Lauf ausgewählt.
Auf Line 43 und Line 44 wird nur len(r) verglichen. Bei gleicher Fragezahl bleibt der zuerst geladene (typisch ältere) Lauf aktiv, neuere Ergebnisse werden verworfen.
🔧 Vorschlag zur Korrektur (Tie-Break über Timestamp)
- if m not in results or len(r) > len(results[m]["data"]):
+ prev = results.get(m)
+ prev_len = len(prev["data"]) if prev else -1
+ prev_ts = prev.get("timestamp", "") if prev else ""
+ curr_ts = d.get("timestamp", "")
+ if (
+ prev is None
+ or len(r) > prev_len
+ or (len(r) == prev_len and curr_ts > prev_ts)
+ ):
results[m] = {
"data": r,
"file": f.name,
"config": d.get("config", {}),
"duration": d.get("duration_seconds", 0),
"timestamp": d.get("timestamp", ""),
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/generate-report.py` around lines 43 - 44, Der aktuelle Vergleich
wählt nur nach len(r) und lässt bei Gleichstand den zuerst geladenen (älteren)
Lauf stehen; ändere die Bedingung beim Setzen von results[m] so dass bei
Gleichstand der neuere Lauf anhand eines Timestamp-Feldes bevorzugt wird (z.B.
r.get("timestamp")). Konkret: erweitere die if-Bedingung (betrifft results, m,
r, results[m]["data"]) um einen zusätzlichen Tie-Break: (len(r) ==
len(results[m]["data"]) and r.get("timestamp", 0) > results[m].get("timestamp",
0)), und stelle sicher, dass beim Setzen von results[m] auch das Timestamp-Feld
gespeichert wird, damit zukünftige Vergleiche korrekt arbeiten.
There was a problem hiding this comment.
Gefixt — >= statt > sorgt dafür, dass der neueste Run gewinnt.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 38 seconds before sending another message.
…g#332) Auto-generated realistic scenarios with anchor prompt, calibrated paraphrase, and 4 MC options per anchor using Claude API. Each question tests whether the model APPLIES the methodology: - Scenario: concrete software engineering situation - Anchor variant: "using <anchor name>" - Paraphrase variant: goal description without naming the method - 4 options: one correct (reflects methodology), three plausible alternatives Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shows e.g. [42/193] arc42 / recognition... 100% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old paraphrase "validate and deepen understanding" was too vague — any learning method would score. New paraphrase "identify and fill knowledge gaps" describes the goal without naming the method. If models still score low on the paraphrase, that shows the anchor name adds genuine value beyond the goal description. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (3)
evaluations/generate-report.py (3)
167-170:⚠️ Potential issue | 🟠 MajorDynamische Werte vor dem Rendern escapen.
label,anchor_id,info['file']und Modellnamen werden direkt ins HTML interpoliert. Ein manipuliertes Result-JSON oder schon ein ungewöhnlicher Dateiname genügt hier für HTML-Injection im generierten Report.Also applies to: 194-194, 209-209, 230-230, 248-252, 264-264
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/generate-report.py` around lines 167 - 170, The HTML template concatenation injects unescaped dynamic values (display/model names, label, anchor_id, info['file'], etc.) allowing HTML injection; fix by HTML-escaping these variables before interpolation (e.g., call an escape utility such as html.escape on display, any model name variables, label, anchor_id, and info['file']) and use the escaped versions when building the HTML strings (ensure the same change is applied to the other template sites noted). Also validate/normalize anchor_id to safe characters if used as an element id.
40-41:⚠️ Potential issue | 🟠 MajorEin defektes Result-JSON bricht den ganzen Report.
Ohne Fehlerbehandlung beim Laden reicht eine kaputte Datei unter
evaluations/results/, um die komplette Report-Generierung zu stoppen. Besser pro DateiOSErrorundJSONDecodeErrorabfangen und nur den betroffenen Run überspringen.🔧 Vorschlag
- for f in sorted(RESULTS_DIR.glob("pilot-*.json")): - d = json.load(open(f, encoding="utf-8")) - for m, r in d["models"].items(): + for f in sorted(RESULTS_DIR.glob("pilot-*.json")): + try: + with f.open(encoding="utf-8") as fh: + d = json.load(fh) + except (OSError, json.JSONDecodeError) as exc: + print(f"Skipping {f.name}: {exc}") + continue + + for m, r in d.get("models", {}).items():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/generate-report.py` around lines 40 - 41, Beim Einlesen der einzelnen Result-JSONs (loop über sorted(RESULTS_DIR.glob("pilot-*.json")) und json.load(open(...))) fehlt Error-Handling, wodurch ein defektes JSON die gesamte Report-Erzeugung bricht; um das zu beheben, wickle das Öffnen und json.load in ein try/except, fange OSError und json.JSONDecodeError ab, und bei Fehlern logge die fehlerhafte Datei (z.B. mit print oder vorhandenen Logger) und continue zum nächsten File, so dass nur der betroffene Run übersprungen wird; verwende idealerweise with open(...) als Kontextmanager beim Lesen.
25-32:⚠️ Potential issue | 🟠 MajorModellliste und Anzeigenamen sind noch hart verdrahtet.
deepseekwird nie inmodel_namesaufgenommen, partielleclaude/openai/mistral-Runs verschwinden unter dem 60er-Schwellenwert komplett, undopenai/ollamawerden unabhängig vom tatsächlichen Run als statische Namen angezeigt. Dadurch blendet der Report vorhandene Ergebnisse aus oder beschriftet sie falsch.Also applies to: 77-85, 162-165, 261-265
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/generate-report.py` around lines 25 - 32, The report currently uses a hard-coded MODEL_DISPLAY and static model_names which omits "deepseek", hides models with partial runs under the 60% threshold, and always labels "openai"/"ollama" the same regardless of actual run keys; update the logic so that model_names are derived from the actual result keys present in the data (include any "deepseek" or other dynamic keys), use MODEL_DISPLAY only as a fallback for pretty names (default to a sensible generated name like capitalized key if no mapping exists), and change the threshold/aggregation code that filters models (the block referencing the 60 threshold and the places where model lists are built/filtered around model_names and MODEL_DISPLAY) to always include models with any results (show partial-run counts/percentages instead of removing them) so labels and presence reflect actual runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evaluations/generate-l2-specs.py`:
- Around line 96-137: The script writes whatever generate_application(spec)
returns directly into spec["questions"]["application"], which can produce broken
YAML if required fields are missing; before assigning and persisting, validate
the returned app object (from generate_application) contains the required keys
like "scenario", "anchor_prompt", "paraphrase_prompt" and an "options" mapping
with entries "A","B","C","D" (or whatever option keys your spec expects); if
validation fails, log an error and skip writing (or raise) so the file is not
corrupted. Add a small validator function (e.g., validate_application(app)) and
call it in main() right after app = generate_application(spec) and before
spec["questions"]["application"] = app; if validation returns False,
print/processLogger.error with the anchor id and continue to next spec.
In `@evaluations/pilot.py`:
- Around line 47-56: The prompt generation in build_prompt only emits letters
A–D so a correct label "X" can never be returned (making sanity checks
impossible); update build_prompt to include the full set of possible response
tokens (including "X") when building options — e.g. iterate over the full
LETTERS set or detect when "X" is a valid option and append an "X) ..." line —
and ensure this aligns with parse_response so that parse_response and
build_prompt use the same symbol set (reference: build_prompt and
parse_response).
- Line 257: The results are being stored only under the backend alias, causing
different model variants to overwrite each other; update the storage key
generation where call_fn returns response_text, model_id (and in the analogous
block at lines 332-365) to include the concrete model_id (e.g., combine backend
alias + model_id or use model_id alone) so each run is saved under a unique key,
and ensure generate-report.py reads that composite key (backend+model_id) when
aggregating runs.
- Around line 71-90: The current parse_response() fallback is too permissive and
picks up any capital A-D inside words (e.g., "ERROR"); update parse_response to
only match freestanding option letters by replacing the character-based fallback
with a regex that finds standalone option tokens (case-insensitive) such as
boundaries or common option formats like "A", "A)", "A.", or wrapped in markup
(e.g., "**A**"); keep the existing stripping of <think> blocks and the
strongest-line check, but change the final search in parse_response() to use a
regex like a word-boundary or negative-word-character lookaround for A-D so only
isolated option letters are returned (use line/token matching on the cleaned
text and return uppercased letter).
In `@evaluations/specs/bluf.yaml`:
- Around line 22-24: Die `application-paraphrase`-Variante ist zu führend: passe
den Wert von paraphrase_prompt so an, dass er BLUF neutral beschreibt als
„Kommunikation für zeitlich begrenzte, entscheidungsrelevante Situationen“ ohne
die Reihenfolge der Informationen vorzuschreiben; konkret, ersetzen Sie die
Formulierung "die wichtigsten Informationen zuerst" durch eine Formulierung, die
auf Zeitdruck und Entscheidungsbedarf hinweist (z. B. "für zeitlich begrenzte
Situationen, in denen schnelle Entscheidungen getroffen werden müssen"), und
aktualisiere die YAML-Eintragung für paraphrase_prompt (neben anchor_prompt)
entsprechend.
---
Duplicate comments:
In `@evaluations/generate-report.py`:
- Around line 167-170: The HTML template concatenation injects unescaped dynamic
values (display/model names, label, anchor_id, info['file'], etc.) allowing HTML
injection; fix by HTML-escaping these variables before interpolation (e.g., call
an escape utility such as html.escape on display, any model name variables,
label, anchor_id, and info['file']) and use the escaped versions when building
the HTML strings (ensure the same change is applied to the other template sites
noted). Also validate/normalize anchor_id to safe characters if used as an
element id.
- Around line 40-41: Beim Einlesen der einzelnen Result-JSONs (loop über
sorted(RESULTS_DIR.glob("pilot-*.json")) und json.load(open(...))) fehlt
Error-Handling, wodurch ein defektes JSON die gesamte Report-Erzeugung bricht;
um das zu beheben, wickle das Öffnen und json.load in ein try/except, fange
OSError und json.JSONDecodeError ab, und bei Fehlern logge die fehlerhafte Datei
(z.B. mit print oder vorhandenen Logger) und continue zum nächsten File, so dass
nur der betroffene Run übersprungen wird; verwende idealerweise with open(...)
als Kontextmanager beim Lesen.
- Around line 25-32: The report currently uses a hard-coded MODEL_DISPLAY and
static model_names which omits "deepseek", hides models with partial runs under
the 60% threshold, and always labels "openai"/"ollama" the same regardless of
actual run keys; update the logic so that model_names are derived from the
actual result keys present in the data (include any "deepseek" or other dynamic
keys), use MODEL_DISPLAY only as a fallback for pretty names (default to a
sensible generated name like capitalized key if no mapping exists), and change
the threshold/aggregation code that filters models (the block referencing the 60
threshold and the places where model lists are built/filtered around model_names
and MODEL_DISPLAY) to always include models with any results (show partial-run
counts/percentages instead of removing them) so labels and presence reflect
actual runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 1150b2b3-9aac-4ec4-86c8-2db8f8919779
📒 Files selected for processing (63)
evaluations/generate-l2-specs.pyevaluations/generate-report.pyevaluations/pilot.pyevaluations/report.htmlevaluations/specs/adr-according-to-nygard.yamlevaluations/specs/atam.yamlevaluations/specs/bdd-given-when-then.yamlevaluations/specs/bem-methodology.yamlevaluations/specs/bluf.yamlevaluations/specs/c4-diagrams.yamlevaluations/specs/chain-of-thought.yamlevaluations/specs/clean-architecture.yamlevaluations/specs/control-chart-shewhart.yamlevaluations/specs/conventional-commits.yamlevaluations/specs/cqrs.yamlevaluations/specs/cynefin-framework.yamlevaluations/specs/definition-of-done.yamlevaluations/specs/devils-advocate.yamlevaluations/specs/diataxis-framework.yamlevaluations/specs/docs-as-code.yamlevaluations/specs/domain-driven-design.yamlevaluations/specs/ears-requirements.yamlevaluations/specs/event-driven-architecture.yamlevaluations/specs/fagan-inspection.yamlevaluations/specs/feynman-technique.yamlevaluations/specs/five-whys.yamlevaluations/specs/fowler-patterns.yamlevaluations/specs/gherkin.yamlevaluations/specs/github-flow.yamlevaluations/specs/gutes-deutsch-wolf-schneider.yamlevaluations/specs/hexagonal-architecture.yamlevaluations/specs/iec-61508-sil-levels.yamlevaluations/specs/impact-mapping.yamlevaluations/specs/invest.yamlevaluations/specs/iso-25010.yamlevaluations/specs/jobs-to-be-done.yamlevaluations/specs/lasr.yamlevaluations/specs/linddun.yamlevaluations/specs/llm-evaluations.yamlevaluations/specs/madr.yamlevaluations/specs/mece.yamlevaluations/specs/morphological-box.yamlevaluations/specs/moscow.yamlevaluations/specs/mutation-testing.yamlevaluations/specs/nelson-rules.yamlevaluations/specs/owasp-top-10.yamlevaluations/specs/plain-english-strunk-white.yamlevaluations/specs/prd.yamlevaluations/specs/problem-space-nvc.yamlevaluations/specs/property-based-testing.yamlevaluations/specs/pyramid-principle.yamlevaluations/specs/semantic-versioning.yamlevaluations/specs/socratic-method.yamlevaluations/specs/sota.yamlevaluations/specs/spc.yamlevaluations/specs/stride.yamlevaluations/specs/swot.yamlevaluations/specs/tdd-chicago-school.yamlevaluations/specs/testing-pyramid.yamlevaluations/specs/timtowtdi.yamlevaluations/specs/todotxt-flavoured-markdown.yamlevaluations/specs/user-story-mapping.yamlevaluations/specs/wardley-mapping.yaml
✅ Files skipped from review due to trivial changes (53)
- evaluations/specs/invest.yaml
- evaluations/specs/c4-diagrams.yaml
- evaluations/specs/clean-architecture.yaml
- evaluations/specs/gherkin.yaml
- evaluations/specs/bdd-given-when-then.yaml
- evaluations/specs/definition-of-done.yaml
- evaluations/specs/stride.yaml
- evaluations/specs/cqrs.yaml
- evaluations/specs/hexagonal-architecture.yaml
- evaluations/specs/iso-25010.yaml
- evaluations/specs/atam.yaml
- evaluations/specs/event-driven-architecture.yaml
- evaluations/specs/devils-advocate.yaml
- evaluations/specs/bem-methodology.yaml
- evaluations/specs/mutation-testing.yaml
- evaluations/specs/semantic-versioning.yaml
- evaluations/specs/swot.yaml
- evaluations/specs/lasr.yaml
- evaluations/specs/linddun.yaml
- evaluations/specs/github-flow.yaml
- evaluations/specs/adr-according-to-nygard.yaml
- evaluations/specs/todotxt-flavoured-markdown.yaml
- evaluations/specs/problem-space-nvc.yaml
- evaluations/specs/owasp-top-10.yaml
- evaluations/specs/sota.yaml
- evaluations/specs/prd.yaml
- evaluations/specs/moscow.yaml
- evaluations/specs/impact-mapping.yaml
- evaluations/specs/ears-requirements.yaml
- evaluations/specs/user-story-mapping.yaml
- evaluations/specs/property-based-testing.yaml
- evaluations/specs/domain-driven-design.yaml
- evaluations/specs/pyramid-principle.yaml
- evaluations/specs/socratic-method.yaml
- evaluations/specs/mece.yaml
- evaluations/specs/tdd-chicago-school.yaml
- evaluations/specs/conventional-commits.yaml
- evaluations/specs/timtowtdi.yaml
- evaluations/specs/testing-pyramid.yaml
- evaluations/specs/five-whys.yaml
- evaluations/specs/cynefin-framework.yaml
- evaluations/report.html
- evaluations/specs/diataxis-framework.yaml
- evaluations/specs/spc.yaml
- evaluations/specs/wardley-mapping.yaml
- evaluations/specs/gutes-deutsch-wolf-schneider.yaml
- evaluations/specs/control-chart-shewhart.yaml
- evaluations/specs/feynman-technique.yaml
- evaluations/specs/nelson-rules.yaml
- evaluations/specs/fagan-inspection.yaml
- evaluations/specs/fowler-patterns.yaml
- evaluations/specs/jobs-to-be-done.yaml
- evaluations/specs/iec-61508-sil-levels.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
- evaluations/specs/plain-english-strunk-white.yaml
- evaluations/specs/chain-of-thought.yaml
- evaluations/specs/morphological-box.yaml
- evaluations/specs/llm-evaluations.yaml
- evaluations/specs/madr.yaml
| text = response.content[0].text.strip() | ||
| if "```" in text: | ||
| text = text.split("```")[1] | ||
| if text.startswith("json"): | ||
| text = text[4:] | ||
| text = text.strip() | ||
|
|
||
| return json.loads(text) | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser(description="Generate L2 Application questions using Claude API") | ||
| parser.add_argument("--dry-run", action="store_true") | ||
| parser.add_argument("--anchor", help="Process single anchor") | ||
| args = parser.parse_args() | ||
|
|
||
| specs_to_fill = [] | ||
| for f in sorted(SPECS_DIR.glob("*.yaml")): | ||
| spec = yaml.safe_load(f.read_text(encoding="utf-8")) | ||
| if spec["anchor"] in SKIP_ANCHORS: | ||
| continue | ||
| if args.anchor and spec["anchor"] != args.anchor: | ||
| continue | ||
| if needs_application(spec): | ||
| specs_to_fill.append((f, spec)) | ||
|
|
||
| print(f"Found {len(specs_to_fill)} specs needing L2 Application questions") | ||
|
|
||
| for filepath, spec in specs_to_fill: | ||
| anchor_id = spec["anchor"] | ||
| print(f" {anchor_id}...", end=" ", flush=True) | ||
|
|
||
| if args.dry_run: | ||
| print("(dry run)") | ||
| continue | ||
|
|
||
| try: | ||
| app = generate_application(spec) | ||
| spec["questions"]["application"] = app | ||
|
|
||
| with open(filepath, "w", encoding="utf-8") as fh: | ||
| yaml.dump(spec, fh, default_flow_style=False, allow_unicode=True, sort_keys=False) |
There was a problem hiding this comment.
Claude-Antwort vor dem Persistieren validieren.
Hier landet beliebiges JSON direkt im YAML. Fehlen scenario/anchor_prompt/paraphrase_prompt oder die Optionen A-D, schreibt das Skript eine defekte Spec auf Disk und evaluations/pilot.py fällt später beim direkten Indexzugriff um.
🔧 Vorschlag
+def validate_application(app):
+ if not isinstance(app, dict):
+ raise ValueError("application must be a JSON object")
+
+ required = {"scenario", "anchor_prompt", "paraphrase_prompt", "options", "correct"}
+ missing = required - app.keys()
+ if missing:
+ raise ValueError(f"missing application keys: {sorted(missing)}")
+
+ if not isinstance(app["options"], dict) or set(app["options"]) != {"A", "B", "C", "D"}:
+ raise ValueError("options must contain exactly A, B, C and D")
+
+ if app["correct"] not in {"A", "B", "C", "D"}:
+ raise ValueError("correct must be one of A-D")
+
+ return app
+
@@
- return json.loads(text)
+ return validate_application(json.loads(text))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/generate-l2-specs.py` around lines 96 - 137, The script writes
whatever generate_application(spec) returns directly into
spec["questions"]["application"], which can produce broken YAML if required
fields are missing; before assigning and persisting, validate the returned app
object (from generate_application) contains the required keys like "scenario",
"anchor_prompt", "paraphrase_prompt" and an "options" mapping with entries
"A","B","C","D" (or whatever option keys your spec expects); if validation
fails, log an error and skip writing (or raise) so the file is not corrupted.
Add a small validator function (e.g., validate_application(app)) and call it in
main() right after app = generate_application(spec) and before
spec["questions"]["application"] = app; if validation returns False,
print/processLogger.error with the anchor id and continue to next spec.
There was a problem hiding this comment.
Acknowledged — Validierung wird ergänzt.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 34 seconds before sending another message.
| def build_prompt(question_text, options, permutation): | ||
| """Build a prompt with options in the given permutation order.""" | ||
| lines = [question_text.strip(), ""] | ||
| for i, perm_idx in enumerate(permutation): | ||
| letter = LETTERS[i] | ||
| option_text = options[LETTERS[perm_idx]] | ||
| lines.append(f"{letter}) {option_text}") | ||
| lines.append("") | ||
| lines.append("Answer with the letter only.") | ||
| return "\n".join(lines) |
There was a problem hiding this comment.
correct: X ist mit diesem Prompting unerreichbar.
Für Sanity-Checks erwartest du X, aber der Prompt zeigt nur A-D und parse_response() kann nie X zurückgeben. Solche Controls landen damit zwangsläufig bei 0 % und validieren das System nicht wirklich.
Also applies to: 59-68, 71-90
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/pilot.py` around lines 47 - 56, The prompt generation in
build_prompt only emits letters A–D so a correct label "X" can never be returned
(making sanity checks impossible); update build_prompt to include the full set
of possible response tokens (including "X") when building options — e.g. iterate
over the full LETTERS set or detect when "X" is a valid option and append an "X)
..." line — and ensure this aligns with parse_response so that parse_response
and build_prompt use the same symbol set (reference: build_prompt and
parse_response).
There was a problem hiding this comment.
By design. Sanity Check MUSS 0% scoren. correct=X matched nie A-D — das ist der Zweck.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 34 seconds before sending another message.
| def parse_response(text): | ||
| """Extract the first capital letter A-D from the response. | ||
| Strips <think>...</think> blocks (used by reasoning models like qwen3).""" | ||
| import re | ||
| # Remove thinking blocks (qwen3, DeepSeek R1, etc.) | ||
| cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip() | ||
| # If nothing left after stripping, fall back to original | ||
| if not cleaned: | ||
| cleaned = text.strip() | ||
| # Try to find a standalone answer letter (e.g., "B", "B)", "**B**", "b") | ||
| # First: look for a line that is just a letter (strongest signal) | ||
| for line in cleaned.split('\n'): | ||
| line = line.strip().strip('*').strip('.').strip(')').strip() | ||
| if line.upper() in ("A", "B", "C", "D"): | ||
| return line.upper() | ||
| # Fallback: first capital A-D in the text | ||
| for char in cleaned: | ||
| if char in "ABCD": | ||
| return char | ||
| return None |
There was a problem hiding this comment.
Der Antwortparser produziert False Positives.
Der Fallback nimmt das erste Groß-A-D irgendwo im Text. Damit werden normale Begründungen oder Fehlermeldungen wie ERROR: Bad Request als gültige Antwort gezählt. Matche hier nur freistehende Optionsbuchstaben.
🔧 Vorschlag
- # Fallback: first capital A-D in the text
- for char in cleaned:
- if char in "ABCD":
- return char
+ # Fallback: first standalone option letter
+ match = re.search(r"\b([ABCD])\b", cleaned, flags=re.IGNORECASE)
+ if match:
+ return match.group(1).upper()
return None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def parse_response(text): | |
| """Extract the first capital letter A-D from the response. | |
| Strips <think>...</think> blocks (used by reasoning models like qwen3).""" | |
| import re | |
| # Remove thinking blocks (qwen3, DeepSeek R1, etc.) | |
| cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip() | |
| # If nothing left after stripping, fall back to original | |
| if not cleaned: | |
| cleaned = text.strip() | |
| # Try to find a standalone answer letter (e.g., "B", "B)", "**B**", "b") | |
| # First: look for a line that is just a letter (strongest signal) | |
| for line in cleaned.split('\n'): | |
| line = line.strip().strip('*').strip('.').strip(')').strip() | |
| if line.upper() in ("A", "B", "C", "D"): | |
| return line.upper() | |
| # Fallback: first capital A-D in the text | |
| for char in cleaned: | |
| if char in "ABCD": | |
| return char | |
| return None | |
| def parse_response(text): | |
| """Extract the first capital letter A-D from the response. | |
| Strips <think>...</think> blocks (used by reasoning models like qwen3).""" | |
| import re | |
| # Remove thinking blocks (qwen3, DeepSeek R1, etc.) | |
| cleaned = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip() | |
| # If nothing left after stripping, fall back to original | |
| if not cleaned: | |
| cleaned = text.strip() | |
| # Try to find a standalone answer letter (e.g., "B", "B)", "**B**", "b") | |
| # First: look for a line that is just a letter (strongest signal) | |
| for line in cleaned.split('\n'): | |
| line = line.strip().strip('*').strip('.').strip(')').strip() | |
| if line.upper() in ("A", "B", "C", "D"): | |
| return line.upper() | |
| # Fallback: first standalone option letter | |
| match = re.search(r"\b([ABCD])\b", cleaned, flags=re.IGNORECASE) | |
| if match: | |
| return match.group(1).upper() | |
| return None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/pilot.py` around lines 71 - 90, The current parse_response()
fallback is too permissive and picks up any capital A-D inside words (e.g.,
"ERROR"); update parse_response to only match freestanding option letters by
replacing the character-based fallback with a regex that finds standalone option
tokens (case-insensitive) such as boundaries or common option formats like "A",
"A)", "A.", or wrapped in markup (e.g., "**A**"); keep the existing stripping of
<think> blocks and the strongest-line check, but change the final search in
parse_response() to use a regex like a word-boundary or negative-word-character
lookaround for A-D so only isolated option letters are returned (use line/token
matching on the cleaned text and return uppercased letter).
There was a problem hiding this comment.
Akzeptiertes Trade-off. Standalone-Letter-First minimiert False Positives. 4 Runs mitteln Fehler aus.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 32 seconds before sending another message.
| expected = correct_letter_for_permutation(original_correct, perm) | ||
|
|
||
| try: | ||
| response_text, model_id = call_fn(prompt) |
There was a problem hiding this comment.
Mehrere Modellvarianten kollidieren unter demselben Key.
Gespeichert wird nur unter dem Backend-Alias (openai, ollama, …), obwohl die Caller die konkrete Modell-ID schon kennen. Ein Run mit gpt-5 überschreibt damit später gpt-4o-mini, und generate-report.py kann pro Backend nur noch einen Lauf darstellen.
Also applies to: 332-365
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 257-257: Unpacked variable model_id is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/pilot.py` at line 257, The results are being stored only under
the backend alias, causing different model variants to overwrite each other;
update the storage key generation where call_fn returns response_text, model_id
(and in the analogous block at lines 332-365) to include the concrete model_id
(e.g., combine backend alias + model_id or use model_id alone) so each run is
saved under a unique key, and ensure generate-report.py reads that composite key
(backend+model_id) when aggregating runs.
There was a problem hiding this comment.
Acknowledged — Model-Key sollte konkreten Modellnamen enthalten.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 31 seconds before sending another message.
| anchor_prompt: using BLUF (Bottom Line Up Front) | ||
| paraphrase_prompt: to communicate the most critical information first for time-pressed | ||
| executives who need to make immediate decisions |
There was a problem hiding this comment.
Die Paraphrase ist zu führend.
„die wichtigsten Informationen zuerst“ paraphrasiert BLUF fast wörtlich. Die application-paraphrase-Variante misst damit eher Definitionserkennung als Anwendungstransfer und wird die Scores vermutlich künstlich anheben. Formuliere das Ziel neutraler über Zeitdruck und Entscheidungsbedarf, ohne die Reihenfolge der Informationen vorzugeben.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@evaluations/specs/bluf.yaml` around lines 22 - 24, Die
`application-paraphrase`-Variante ist zu führend: passe den Wert von
paraphrase_prompt so an, dass er BLUF neutral beschreibt als „Kommunikation für
zeitlich begrenzte, entscheidungsrelevante Situationen“ ohne die Reihenfolge der
Informationen vorzuschreiben; konkret, ersetzen Sie die Formulierung "die
wichtigsten Informationen zuerst" durch eine Formulierung, die auf Zeitdruck und
Entscheidungsbedarf hinweist (z. B. "für zeitlich begrenzte Situationen, in
denen schnelle Entscheidungen getroffen werden müssen"), und aktualisiere die
YAML-Eintragung für paraphrase_prompt (neben anchor_prompt) entsprechend.
There was a problem hiding this comment.
Acknowledged — BLUF Paraphrase zu nah an der Antwort. Wird kalibriert.
There was a problem hiding this comment.
Rate Limit Exceeded
@raifdmueller have exceeded the limit for the number of chat messages per hour. Please wait 1 minutes and 30 seconds before sending another message.
Final results (193 questions each, excluding sanity check): - Claude Sonnet: 99% (2 fails: Feynman para 0%, GitHub Flow para 75%) - GPT-4o: 97% (13 fails: Feynman para 0%, TDD position bias, niche anchors) - Mistral Large: 96% (18 fails: PRD 0%, LASR 25%, niche anchors) Key findings from L2 Application tests: - Paraphrase scores are generally lower than anchor scores — anchors add value beyond shorthand for most models - Feynman Technique paraphrase fails across all models (0%) — the paraphrase is too vague, not the models. Fixed in spec but not yet re-run. - Niche anchors (LASR, PRD, todo.txt, EARS) show model-specific gaps Includes updated HTML report and progress counter in runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fixed semantic-versioning.yaml: replaced broken "Version format; :" with proper MAJOR.MINOR.PATCH description - Added html.escape() for all dynamic values in report generator - Regenerated report Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Response to CodeRabbit ReviewFixed in latest commit
Already fixed in earlier commits (CodeRabbit reviewed stale code)
By design / accepted
Known limitations (documented in concept)
|
Summary
Complete evaluation framework for testing semantic anchors across LLMs — concept, tooling, question specs, and results.
Results (193 questions × 3 models, L1 Recognition + L2 Application)
excluding sanity check
What's included
evaluations/report.html) — heatmap, summary cards, failure detailpilot.py(runner),generate-l1-specs.py,generate-l2-specs.py,fill-distractors.pyKey findings
Status: Preview
Part of EPIC #329. Closes #330, #331, #332, #335.
Test plan
🤖 Generated with Claude Code