feat: evaluation specs + results for PERT, GRASP, VSA#363
Conversation
L1 Recognition + L2 Application questions for the 3 newly merged anchors: - PERT (Tier 2): three-point estimation, network diagrams - GRASP (Tier 3): 9 OO responsibility assignment patterns - Vertical Slice Architecture (Tier 3): feature cohesion, CQRS alignment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Results: all 3 new anchors score 100% on Claude, GPT-4o, and Mistral Large (L1 Recognition + L2 Application, 9 questions × 3 models). Added --anchor filter to pilot.py to avoid re-running all specs: python3 pilot.py --model claude --anchor pert grasp Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
WalkthroughDie PR fügt der Pilot-Funktion eine optionale Anchor-Filterung und ein CLI-Argument zum Filtern nach Ankern hinzu und ergänzt mehrere neue Evaluierungs-Spezifikationen sowie zahlreiche erzeugte Run-Summary/Report-Dateien. Changes
Sequence Diagram(s)(nicht erzeugt) Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
evaluations/pilot.py (2)
301-303: Fehlende Validierung bei ungültigen Anchor-Namen.Wenn der Benutzer
--anchor foo barmit nicht existierenden Anchors angibt, wirdspecsleer und die AusgabeLoaded 0 anchor specserscheint ohne weitere Erklärung. Eine Warnung für nicht gefundene Anchors wäre hilfreich.♻️ Vorgeschlagene Verbesserung mit Validierung
if anchor_filter: + available_anchors = {s["anchor"] for s in specs} + unknown = set(anchor_filter) - available_anchors + if unknown: + print(f"Warning: Unknown anchor(s) ignored: {', '.join(sorted(unknown))}") specs = [s for s in specs if s["anchor"] in anchor_filter]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/pilot.py` around lines 301 - 303, When filtering specs by anchor_filter, validate anchors and warn about any names that didn't match: capture the set of requested anchors (anchor_filter) and the set of found anchors (from specs' "anchor" values before or after filtering), compute missing = requested - found, and if missing is non-empty call processLogger.warn or print a warning listing the missing anchor names before the existing print(f"Loaded {len(specs)} anchor specs"); ensure you still proceed with the matched specs but surface the missing anchors to the user.
6-11: Docstring-Aktualisierung für--anchorfehlt.Die neue
--anchor-Option sollte in der Usage-Dokumentation am Dateianfang ergänzt werden, damit Benutzer sie bei--helpoder beim Lesen des Codes finden.📝 Vorgeschlagene Ergänzung
Usage: python3 pilot.py --model claude # Claude Sonnet via Anthropic API python3 pilot.py --model ollama # Local model via Ollama (OpenAI-compatible) python3 pilot.py --model claude ollama # Both + python3 pilot.py --anchor pert grasp # Only evaluate specific anchors python3 pilot.py --dry-run # Show prompts without sending🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@evaluations/pilot.py` around lines 6 - 11, The top-of-file Usage docstring in pilot.py is missing the new --anchor option; update the initial module docstring (the triple-quoted Usage block) to include an example and short note for --anchor (e.g., how to pass an anchor value or flag) so it appears in --help and when reading the file; modify the Usage block near the top of pilot.py (the module-level docstring/Usage section) to add a line like "python3 pilot.py --anchor <value> # Use anchor for ..." that matches the style of the existing examples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@evaluations/pilot.py`:
- Around line 301-303: When filtering specs by anchor_filter, validate anchors
and warn about any names that didn't match: capture the set of requested anchors
(anchor_filter) and the set of found anchors (from specs' "anchor" values before
or after filtering), compute missing = requested - found, and if missing is
non-empty call processLogger.warn or print a warning listing the missing anchor
names before the existing print(f"Loaded {len(specs)} anchor specs"); ensure you
still proceed with the matched specs but surface the missing anchors to the
user.
- Around line 6-11: The top-of-file Usage docstring in pilot.py is missing the
new --anchor option; update the initial module docstring (the triple-quoted
Usage block) to include an example and short note for --anchor (e.g., how to
pass an anchor value or flag) so it appears in --help and when reading the file;
modify the Usage block near the top of pilot.py (the module-level
docstring/Usage section) to add a line like "python3 pilot.py --anchor <value>
# Use anchor for ..." that matches the style of the existing examples.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: d736a79e-ab86-49e4-8be3-56f129c120d5
📒 Files selected for processing (4)
evaluations/pilot.pyevaluations/specs/grasp.yamlevaluations/specs/pert.yamlevaluations/specs/vertical-slice-architecture.yaml
…tests New anchors tested on 10 models (L1+L2, 9 questions each): - PERT: 100% on 9/10 models (GPT-5.4-mini 75% on application) - GRASP: 100% on all 10 models - VSA: 100% on 9/10 models (Mistral Medium 0% on application!) Unmerged anchor tests (L1 only, Claude/GPT-4o/Mistral Large): - KISS, P.A.R.A., Spec-Driven Dev, Explicit Contract Surface: all 100% Added --anchor filter to avoid re-running all specs: python3 pilot.py --model claude --anchor pert grasp Finding: Mistral Medium doesn't know Vertical Slice Architecture despite Large and Small knowing it — a model-specific blind spot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
--anchorfilter to runner (avoids re-running all 66 specs)Results
New feature
python3 pilot.py --model claude --anchor pert grasp vsa🤖 Generated with Claude Code
Summary by CodeRabbit
--anchorzum Filtern von Evaluierungen nach Ankern