Objective
Run a head-to-head comparison of HiveSpec vs Superpowers (https://github.com/obra/superpowers/) using Claude Code as the eval target. Measure whether the enriched HiveSpec skills (post EntityProcess/hivespec#1) achieve parity or better on the capabilities that Superpowers covers.
Background
A gap analysis identified three areas where Superpowers had deeper coverage than HiveSpec:
- TDD rationalization prevention —
superpowers:test-driven-development
- Systematic debugging methodology —
superpowers:systematic-debugging
- Receiving code review discipline —
superpowers:receiving-code-review
These gaps were addressed in EntityProcess/hivespec#1 by enriching hs-implement and hs-verify. This eval should verify the enrichments are effective.
Eval Design
Scenarios to cover
- TDD discipline — Give the agent a feature to implement. Measure whether it writes a failing test first, watches it fail, then implements. Pressure-test with rationalizations ("this is too simple to test", "I already wrote the code").
- Systematic debugging — Present a failing test or broken feature. Measure whether the agent follows root cause investigation before proposing fixes. Check for the 3-failure escalation pattern.
- Code review response — Provide review feedback (mix of correct, incorrect, and YAGNI suggestions). Measure whether the agent evaluates technically rather than performatively agreeing.
- End-to-end lifecycle — Full issue → PR flow. Measure phase discipline, verification evidence, and shipping quality.
Target
- Agent: Claude Code (CLI)
- Plugins: Run each scenario twice — once with HiveSpec installed, once with Superpowers installed
- Metrics: Pass/fail per grading rubric, plus qualitative comparison of agent behavior
Grading approach
Use llm-grader rubrics that check for specific behavioral signals (e.g., "agent wrote test before implementation", "agent traced root cause before fixing", "agent pushed back on incorrect review feedback").
Acceptance criteria
Objective
Run a head-to-head comparison of HiveSpec vs Superpowers (
https://github.com/obra/superpowers/) using Claude Code as the eval target. Measure whether the enriched HiveSpec skills (post EntityProcess/hivespec#1) achieve parity or better on the capabilities that Superpowers covers.Background
A gap analysis identified three areas where Superpowers had deeper coverage than HiveSpec:
superpowers:test-driven-developmentsuperpowers:systematic-debuggingsuperpowers:receiving-code-reviewThese gaps were addressed in EntityProcess/hivespec#1 by enriching
hs-implementandhs-verify. This eval should verify the enrichments are effective.Eval Design
Scenarios to cover
Target
Grading approach
Use
llm-graderrubrics that check for specific behavioral signals (e.g., "agent wrote test before implementation", "agent traced root cause before fixing", "agent pushed back on incorrect review feedback").Acceptance criteria
agentv compareoutput