Compare HiveSpec vs Superpowers using Claude Code target

## Objective

Run a head-to-head comparison of **HiveSpec** vs **Superpowers** (`https://github.com/obra/superpowers/`) using Claude Code as the eval target. Measure whether the enriched HiveSpec skills (post EntityProcess/hivespec#1) achieve parity or better on the capabilities that Superpowers covers.

## Background

A gap analysis identified three areas where Superpowers had deeper coverage than HiveSpec:
1. **TDD rationalization prevention** — `superpowers:test-driven-development`
2. **Systematic debugging methodology** — `superpowers:systematic-debugging`
3. **Receiving code review discipline** — `superpowers:receiving-code-review`

These gaps were addressed in EntityProcess/hivespec#1 by enriching `hs-implement` and `hs-verify`. This eval should verify the enrichments are effective.

## Eval Design

### Scenarios to cover

1. **TDD discipline** — Give the agent a feature to implement. Measure whether it writes a failing test first, watches it fail, then implements. Pressure-test with rationalizations ("this is too simple to test", "I already wrote the code").
2. **Systematic debugging** — Present a failing test or broken feature. Measure whether the agent follows root cause investigation before proposing fixes. Check for the 3-failure escalation pattern.
3. **Code review response** — Provide review feedback (mix of correct, incorrect, and YAGNI suggestions). Measure whether the agent evaluates technically rather than performatively agreeing.
4. **End-to-end lifecycle** — Full issue → PR flow. Measure phase discipline, verification evidence, and shipping quality.

### Target

- **Agent:** Claude Code (CLI)
- **Plugins:** Run each scenario twice — once with HiveSpec installed, once with Superpowers installed
- **Metrics:** Pass/fail per grading rubric, plus qualitative comparison of agent behavior

### Grading approach

Use `llm-grader` rubrics that check for specific behavioral signals (e.g., "agent wrote test before implementation", "agent traced root cause before fixing", "agent pushed back on incorrect review feedback").

## Acceptance criteria

- [ ] Eval YAML files created for all 4 scenario categories
- [ ] Baseline run with Superpowers
- [ ] Comparison run with HiveSpec
- [ ] Results published to this repo with `agentv compare` output
- [ ] Summary of parity/gaps documented

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare HiveSpec vs Superpowers using Claude Code target #1

Objective

Background

Eval Design

Scenarios to cover

Target

Grading approach

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compare HiveSpec vs Superpowers using Claude Code target #1

Description

Objective

Background

Eval Design

Scenarios to cover

Target

Grading approach

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions