feat: evaluation specs + results for PERT, GRASP, VSA by raifdmueller · Pull Request #363 · LLM-Coding/Semantic-Anchors

raifdmueller · 2026-03-26T13:39:12Z

Summary

Add L1+L2 evaluation specs for 3 newly merged anchors
All score 100% on Claude Sonnet, GPT-4o, and Mistral Large
Add --anchor filter to runner (avoids re-running all 66 specs)

Results

Anchor	Claude	GPT-4o	Mistral Large
PERT	100%	100%	100%
GRASP	100%	100%	100%
VSA	100%	100%	100%

New feature

python3 pilot.py --model claude --anchor pert grasp vsa

🤖 Generated with Claude Code

Summary by CodeRabbit

Neue Funktionen
- Neue CLI‑Option --anchor zum Filtern von Evaluierungen nach Ankern
Neue Bewertungsspezifikationen
- Mehrere neue Quiz-/Evaluierungsspezifikationen hinzugefügt (z. B. GRASP, PERT, Vertical Slice Architecture, KISS, PARA, Explicit Contract Surface, Spec‑Driven Development)
Dokumentation / Website
- Aktualisierte Bewertungsübersicht und Ergebnisbericht (sichtbare Score‑ und Fehlereinträge)
Chores
- Zahlreiche neue Auswertungs-/Zusammenfassungsdateien (JSON) hinzugefügt

L1 Recognition + L2 Application questions for the 3 newly merged anchors: - PERT (Tier 2): three-point estimation, network diagrams - GRASP (Tier 3): 9 OO responsibility assignment patterns - Vertical Slice Architecture (Tier 3): feature cohesion, CQRS alignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Results: all 3 new anchors score 100% on Claude, GPT-4o, and Mistral Large (L1 Recognition + L2 Application, 9 questions × 3 models). Added --anchor filter to pilot.py to avoid re-running all specs: python3 pilot.py --model claude --anchor pert grasp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-26T13:41:32Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: aa9d7a95-deb5-4cba-ad2d-b409dadc8fec

📥 Commits

Reviewing files that changed from the base of the PR and between eedd91b and 5ae0803.

📒 Files selected for processing (17)

evaluations/report.html
evaluations/specs/explicit-contract-surface.yaml
evaluations/specs/kiss-principle.yaml
evaluations/specs/para-method.yaml
evaluations/specs/spec-driven-development.yaml
evaluations/summaries/pilot-20260326-123346_gpt-5.4-mini-2026-03-17.json
evaluations/summaries/pilot-20260326-133606_claude-sonnet-4-20250514.json
evaluations/summaries/pilot-20260326-133720_gpt-4o.json
evaluations/summaries/pilot-20260326-133810_mistral-large-2512.json
evaluations/summaries/pilot-20260326-134613_claude-opus-4-6.json
evaluations/summaries/pilot-20260326-134803_claude-haiku-4-5-20251001.json
evaluations/summaries/pilot-20260326-134856_gpt-5.4-2026-03-05.json
evaluations/summaries/pilot-20260326-134952_gpt-5.4-mini-2026-03-17.json
evaluations/summaries/pilot-20260326-135041_mistral-medium-2508.json
evaluations/summaries/pilot-20260326-135130_mistral-small-2603.json
evaluations/summaries/pilot-20260326-135220_devstral-2512.json
website/public/evaluation-report.html

Walkthrough

Die PR fügt der Pilot-Funktion eine optionale Anchor-Filterung und ein CLI-Argument zum Filtern nach Ankern hinzu und ergänzt mehrere neue Evaluierungs-Spezifikationen sowie zahlreiche erzeugte Run-Summary/Report-Dateien.

Changes

Cohort / File(s)	Summary
Pilot-Funktion & CLI `evaluations/pilot.py`	Signatur von `run_pilot(...)` erweitert um `anchor_filter=None`; nach Laden der YAML-Spezifikationen werden diese optional nach `anchor_filter` gefiltert. CLI erhält `--anchor` (`nargs="+"`) und leitet Werte an `run_pilot`.
Neue Evaluierungs-Spezifikationen `evaluations/specs/grasp.yaml`, `evaluations/specs/pert.yaml`, `evaluations/specs/vertical-slice-architecture.yaml`, `evaluations/specs/explicit-contract-surface.yaml`, `evaluations/specs/kiss-principle.yaml`, `evaluations/specs/para-method.yaml`, `evaluations/specs/spec-driven-development.yaml`	Sieben neue YAML-Spezifikationsdateien hinzugefügt; jede definiert `anchor`, `tier` und Multiple-Choice-Fragen (Recognition/Application) mit markierten korrekten Antworten.
Zusammenfassungs-/Run-Output Artefakte `evaluations/summaries/pilot-20260326-.json` (mehrere Dateien)	Viele neue JSON-Run-Summary-Dateien für verschiedene Modelle hinzugefügt (Zeitstempel, Modellkonfiguration, per-label Scores, duration_seconds).
Bericht / Website-Report `evaluations/report.html`, `website/public/evaluation-report.html`	HTML-Reports aktualisiert: gpt-5.4-mini-2026-03-17 Aggregate-Score geändert (98% → 97%), referenziertes Run-Artifact aktualisiert, Heatmap-/Failures-Tabellenwerte und Run-Metadaten angepasst.

Sequence Diagram(s)

(nicht erzeugt)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat: evaluate 6 models including Mistral Small/Medium/Devstral #353: Modifiziert ebenfalls evaluations/pilot.py — Änderungen betreffen run_pilot-Verhalten (Ergebnisdateien / Summary-Schreiben), starke Code-Relation.
feat: evaluation framework with 63 anchor specs and pilot results #343: Bearbeitet dieselbe Datei evaluations/pilot.py und ist direkt mit der eingeführten Anchor-Filter-/CLI-Änderung verknüpft.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Der Pull-Request-Titel fasst präzise die Hauptänderungen zusammen: Hinzufügen von Evaluierungsspezifikationen und Ergebnissen für drei neue Anker (PERT, GRASP, VSA). Der Titel ist spezifisch, prägnant und verständlich für Entwickler, die die Git-Historie durchsuchen.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

evaluations/pilot.py (2)

301-303: Fehlende Validierung bei ungültigen Anchor-Namen.

Wenn der Benutzer --anchor foo bar mit nicht existierenden Anchors angibt, wird specs leer und die Ausgabe Loaded 0 anchor specs erscheint ohne weitere Erklärung. Eine Warnung für nicht gefundene Anchors wäre hilfreich.

♻️ Vorgeschlagene Verbesserung mit Validierung

     if anchor_filter:
+        available_anchors = {s["anchor"] for s in specs}
+        unknown = set(anchor_filter) - available_anchors
+        if unknown:
+            print(f"Warning: Unknown anchor(s) ignored: {', '.join(sorted(unknown))}")
         specs = [s for s in specs if s["anchor"] in anchor_filter]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 301 - 303, When filtering specs by
anchor_filter, validate anchors and warn about any names that didn't match:
capture the set of requested anchors (anchor_filter) and the set of found
anchors (from specs' "anchor" values before or after filtering), compute missing
= requested - found, and if missing is non-empty call processLogger.warn or
print a warning listing the missing anchor names before the existing
print(f"Loaded {len(specs)} anchor specs"); ensure you still proceed with the
matched specs but surface the missing anchors to the user.

6-11: Docstring-Aktualisierung für --anchor fehlt.

Die neue --anchor-Option sollte in der Usage-Dokumentation am Dateianfang ergänzt werden, damit Benutzer sie bei --help oder beim Lesen des Codes finden.

📝 Vorgeschlagene Ergänzung

 Usage:
   python3 pilot.py --model claude      # Claude Sonnet via Anthropic API
   python3 pilot.py --model ollama      # Local model via Ollama (OpenAI-compatible)
   python3 pilot.py --model claude ollama  # Both
+  python3 pilot.py --anchor pert grasp  # Only evaluate specific anchors
   python3 pilot.py --dry-run           # Show prompts without sending

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 6 - 11, The top-of-file Usage docstring in
pilot.py is missing the new --anchor option; update the initial module docstring
(the triple-quoted Usage block) to include an example and short note for
--anchor (e.g., how to pass an anchor value or flag) so it appears in --help and
when reading the file; modify the Usage block near the top of pilot.py (the
module-level docstring/Usage section) to add a line like "python3 pilot.py
--anchor <value>  # Use anchor for ..." that matches the style of the existing
examples.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@evaluations/pilot.py`:
- Around line 301-303: When filtering specs by anchor_filter, validate anchors
and warn about any names that didn't match: capture the set of requested anchors
(anchor_filter) and the set of found anchors (from specs' "anchor" values before
or after filtering), compute missing = requested - found, and if missing is
non-empty call processLogger.warn or print a warning listing the missing anchor
names before the existing print(f"Loaded {len(specs)} anchor specs"); ensure you
still proceed with the matched specs but surface the missing anchors to the
user.
- Around line 6-11: The top-of-file Usage docstring in pilot.py is missing the
new --anchor option; update the initial module docstring (the triple-quoted
Usage block) to include an example and short note for --anchor (e.g., how to
pass an anchor value or flag) so it appears in --help and when reading the file;
modify the Usage block near the top of pilot.py (the module-level
docstring/Usage section) to add a line like "python3 pilot.py --anchor <value> 
# Use anchor for ..." that matches the style of the existing examples.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: d736a79e-ab86-49e4-8be3-56f129c120d5

📥 Commits

Reviewing files that changed from the base of the PR and between 9589f07 and eedd91b.

📒 Files selected for processing (4)

evaluations/pilot.py
evaluations/specs/grasp.yaml
evaluations/specs/pert.yaml
evaluations/specs/vertical-slice-architecture.yaml

…tests New anchors tested on 10 models (L1+L2, 9 questions each): - PERT: 100% on 9/10 models (GPT-5.4-mini 75% on application) - GRASP: 100% on all 10 models - VSA: 100% on 9/10 models (Mistral Medium 0% on application!) Unmerged anchor tests (L1 only, Claude/GPT-4o/Mistral Large): - KISS, P.A.R.A., Spec-Driven Dev, Explicit Contract Surface: all 100% Added --anchor filter to avoid re-running all specs: python3 pilot.py --model claude --anchor pert grasp Finding: Mistral Medium doesn't know Vertical Slice Architecture despite Large and Small knowing it — a model-specific blind spot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

raifdmueller and others added 2 commits March 26, 2026 13:26

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

rdmueller merged commit 77361ee into LLM-Coding:main Mar 26, 2026
5 of 7 checks passed

raifdmueller mentioned this pull request Mar 26, 2026

Write Level 1 (Recognition) questions for Tier 3 anchors #331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: evaluation specs + results for PERT, GRASP, VSA#363

feat: evaluation specs + results for PERT, GRASP, VSA#363
rdmueller merged 3 commits into
LLM-Coding:mainfrom
raifdmueller:feat/new-anchor-evals

raifdmueller commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

Review failed

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raifdmueller commented Mar 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

New feature

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raifdmueller commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading