Skip to content

feat: evaluation specs + results for PERT, GRASP, VSA#363

Merged
rdmueller merged 3 commits into
LLM-Coding:mainfrom
raifdmueller:feat/new-anchor-evals
Mar 26, 2026
Merged

feat: evaluation specs + results for PERT, GRASP, VSA#363
rdmueller merged 3 commits into
LLM-Coding:mainfrom
raifdmueller:feat/new-anchor-evals

Conversation

@raifdmueller
Copy link
Copy Markdown
Contributor

@raifdmueller raifdmueller commented Mar 26, 2026

Summary

  • Add L1+L2 evaluation specs for 3 newly merged anchors
  • All score 100% on Claude Sonnet, GPT-4o, and Mistral Large
  • Add --anchor filter to runner (avoids re-running all 66 specs)

Results

Anchor Claude GPT-4o Mistral Large
PERT 100% 100% 100%
GRASP 100% 100% 100%
VSA 100% 100% 100%

New feature

python3 pilot.py --model claude --anchor pert grasp vsa

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Neue Funktionen
    • Neue CLI‑Option --anchor zum Filtern von Evaluierungen nach Ankern
  • Neue Bewertungsspezifikationen
    • Mehrere neue Quiz-/Evaluierungsspezifikationen hinzugefügt (z. B. GRASP, PERT, Vertical Slice Architecture, KISS, PARA, Explicit Contract Surface, Spec‑Driven Development)
  • Dokumentation / Website
    • Aktualisierte Bewertungsübersicht und Ergebnisbericht (sichtbare Score‑ und Fehlereinträge)
  • Chores
    • Zahlreiche neue Auswertungs-/Zusammenfassungsdateien (JSON) hinzugefügt

raifdmueller and others added 2 commits March 26, 2026 13:26
L1 Recognition + L2 Application questions for the 3 newly merged anchors:
- PERT (Tier 2): three-point estimation, network diagrams
- GRASP (Tier 3): 9 OO responsibility assignment patterns
- Vertical Slice Architecture (Tier 3): feature cohesion, CQRS alignment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results: all 3 new anchors score 100% on Claude, GPT-4o, and Mistral Large
(L1 Recognition + L2 Application, 9 questions × 3 models).

Added --anchor filter to pilot.py to avoid re-running all specs:
  python3 pilot.py --model claude --anchor pert grasp

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: aa9d7a95-deb5-4cba-ad2d-b409dadc8fec

📥 Commits

Reviewing files that changed from the base of the PR and between eedd91b and 5ae0803.

📒 Files selected for processing (17)
  • evaluations/report.html
  • evaluations/specs/explicit-contract-surface.yaml
  • evaluations/specs/kiss-principle.yaml
  • evaluations/specs/para-method.yaml
  • evaluations/specs/spec-driven-development.yaml
  • evaluations/summaries/pilot-20260326-123346_gpt-5.4-mini-2026-03-17.json
  • evaluations/summaries/pilot-20260326-133606_claude-sonnet-4-20250514.json
  • evaluations/summaries/pilot-20260326-133720_gpt-4o.json
  • evaluations/summaries/pilot-20260326-133810_mistral-large-2512.json
  • evaluations/summaries/pilot-20260326-134613_claude-opus-4-6.json
  • evaluations/summaries/pilot-20260326-134803_claude-haiku-4-5-20251001.json
  • evaluations/summaries/pilot-20260326-134856_gpt-5.4-2026-03-05.json
  • evaluations/summaries/pilot-20260326-134952_gpt-5.4-mini-2026-03-17.json
  • evaluations/summaries/pilot-20260326-135041_mistral-medium-2508.json
  • evaluations/summaries/pilot-20260326-135130_mistral-small-2603.json
  • evaluations/summaries/pilot-20260326-135220_devstral-2512.json
  • website/public/evaluation-report.html

Walkthrough

Die PR fügt der Pilot-Funktion eine optionale Anchor-Filterung und ein CLI-Argument zum Filtern nach Ankern hinzu und ergänzt mehrere neue Evaluierungs-Spezifikationen sowie zahlreiche erzeugte Run-Summary/Report-Dateien.

Changes

Cohort / File(s) Summary
Pilot-Funktion & CLI
evaluations/pilot.py
Signatur von run_pilot(...) erweitert um anchor_filter=None; nach Laden der YAML-Spezifikationen werden diese optional nach anchor_filter gefiltert. CLI erhält --anchor (nargs="+") und leitet Werte an run_pilot.
Neue Evaluierungs-Spezifikationen
evaluations/specs/grasp.yaml, evaluations/specs/pert.yaml, evaluations/specs/vertical-slice-architecture.yaml, evaluations/specs/explicit-contract-surface.yaml, evaluations/specs/kiss-principle.yaml, evaluations/specs/para-method.yaml, evaluations/specs/spec-driven-development.yaml
Sieben neue YAML-Spezifikationsdateien hinzugefügt; jede definiert anchor, tier und Multiple-Choice-Fragen (Recognition/Application) mit markierten korrekten Antworten.
Zusammenfassungs-/Run-Output Artefakte
evaluations/summaries/*pilot-20260326-*.json (mehrere Dateien)
Viele neue JSON-Run-Summary-Dateien für verschiedene Modelle hinzugefügt (Zeitstempel, Modellkonfiguration, per-label Scores, duration_seconds).
Bericht / Website-Report
evaluations/report.html, website/public/evaluation-report.html
HTML-Reports aktualisiert: gpt-5.4-mini-2026-03-17 Aggregate-Score geändert (98% → 97%), referenziertes Run-Artifact aktualisiert, Heatmap-/Failures-Tabellenwerte und Run-Metadaten angepasst.

Sequence Diagram(s)

(nicht erzeugt)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Der Pull-Request-Titel fasst präzise die Hauptänderungen zusammen: Hinzufügen von Evaluierungsspezifikationen und Ergebnissen für drei neue Anker (PERT, GRASP, VSA). Der Titel ist spezifisch, prägnant und verständlich für Entwickler, die die Git-Historie durchsuchen.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
evaluations/pilot.py (2)

301-303: Fehlende Validierung bei ungültigen Anchor-Namen.

Wenn der Benutzer --anchor foo bar mit nicht existierenden Anchors angibt, wird specs leer und die Ausgabe Loaded 0 anchor specs erscheint ohne weitere Erklärung. Eine Warnung für nicht gefundene Anchors wäre hilfreich.

♻️ Vorgeschlagene Verbesserung mit Validierung
     if anchor_filter:
+        available_anchors = {s["anchor"] for s in specs}
+        unknown = set(anchor_filter) - available_anchors
+        if unknown:
+            print(f"Warning: Unknown anchor(s) ignored: {', '.join(sorted(unknown))}")
         specs = [s for s in specs if s["anchor"] in anchor_filter]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 301 - 303, When filtering specs by
anchor_filter, validate anchors and warn about any names that didn't match:
capture the set of requested anchors (anchor_filter) and the set of found
anchors (from specs' "anchor" values before or after filtering), compute missing
= requested - found, and if missing is non-empty call processLogger.warn or
print a warning listing the missing anchor names before the existing
print(f"Loaded {len(specs)} anchor specs"); ensure you still proceed with the
matched specs but surface the missing anchors to the user.

6-11: Docstring-Aktualisierung für --anchor fehlt.

Die neue --anchor-Option sollte in der Usage-Dokumentation am Dateianfang ergänzt werden, damit Benutzer sie bei --help oder beim Lesen des Codes finden.

📝 Vorgeschlagene Ergänzung
 Usage:
   python3 pilot.py --model claude      # Claude Sonnet via Anthropic API
   python3 pilot.py --model ollama      # Local model via Ollama (OpenAI-compatible)
   python3 pilot.py --model claude ollama  # Both
+  python3 pilot.py --anchor pert grasp  # Only evaluate specific anchors
   python3 pilot.py --dry-run           # Show prompts without sending
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@evaluations/pilot.py` around lines 6 - 11, The top-of-file Usage docstring in
pilot.py is missing the new --anchor option; update the initial module docstring
(the triple-quoted Usage block) to include an example and short note for
--anchor (e.g., how to pass an anchor value or flag) so it appears in --help and
when reading the file; modify the Usage block near the top of pilot.py (the
module-level docstring/Usage section) to add a line like "python3 pilot.py
--anchor <value>  # Use anchor for ..." that matches the style of the existing
examples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@evaluations/pilot.py`:
- Around line 301-303: When filtering specs by anchor_filter, validate anchors
and warn about any names that didn't match: capture the set of requested anchors
(anchor_filter) and the set of found anchors (from specs' "anchor" values before
or after filtering), compute missing = requested - found, and if missing is
non-empty call processLogger.warn or print a warning listing the missing anchor
names before the existing print(f"Loaded {len(specs)} anchor specs"); ensure you
still proceed with the matched specs but surface the missing anchors to the
user.
- Around line 6-11: The top-of-file Usage docstring in pilot.py is missing the
new --anchor option; update the initial module docstring (the triple-quoted
Usage block) to include an example and short note for --anchor (e.g., how to
pass an anchor value or flag) so it appears in --help and when reading the file;
modify the Usage block near the top of pilot.py (the module-level
docstring/Usage section) to add a line like "python3 pilot.py --anchor <value> 
# Use anchor for ..." that matches the style of the existing examples.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: d736a79e-ab86-49e4-8be3-56f129c120d5

📥 Commits

Reviewing files that changed from the base of the PR and between 9589f07 and eedd91b.

📒 Files selected for processing (4)
  • evaluations/pilot.py
  • evaluations/specs/grasp.yaml
  • evaluations/specs/pert.yaml
  • evaluations/specs/vertical-slice-architecture.yaml

…tests

New anchors tested on 10 models (L1+L2, 9 questions each):
- PERT: 100% on 9/10 models (GPT-5.4-mini 75% on application)
- GRASP: 100% on all 10 models
- VSA: 100% on 9/10 models (Mistral Medium 0% on application!)

Unmerged anchor tests (L1 only, Claude/GPT-4o/Mistral Large):
- KISS, P.A.R.A., Spec-Driven Dev, Explicit Contract Surface: all 100%

Added --anchor filter to avoid re-running all specs:
  python3 pilot.py --model claude --anchor pert grasp

Finding: Mistral Medium doesn't know Vertical Slice Architecture
despite Large and Small knowing it — a model-specific blind spot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rdmueller rdmueller merged commit 77361ee into LLM-Coding:main Mar 26, 2026
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants