tool(eval): harness compare — A/B dashboard chat vs Claude Code on same play_id

## Why

We're running two AI surfaces over the same data:
- **Dashboard chat panel** (#497) — fast, constrained-tool, embedded at the chart
- **Claude Code CLI** — full reasoning, shell access, unconstrained tools

Each has wins on different question types but there's no rigorous way to evaluate them side-by-side. Today comparison is "open both, eyeball, judge" — slow, lossy, no archive, no trend.

Goal: a CLI subcommand that asks the same question of both backends in parallel and produces a structured comparison file you can annotate + archive. Builds a corpus over time to inform decisions like \"does the dashboard chat actually add value vs Claude Code?\" or \"on which question types does each win?\"

## What

`harness compare <play_id> \"<question>\" [--profile P --model M] [--out DIR]`

Runs in parallel:
1. **Dashboard chat side** — POST to `/api/v2/chat` with auto-built scope (player_id resolved via `get_play_summary(play_id)`), selected profile/model, the question. Capture the SSE stream end-to-end (assistant text + tool calls + tool results + usage + cost). Get the persisted chat handoff path.
2. **Claude Code side** — invoke `claude --print` non-interactively with a templated prompt: \"Investigate play `<play_id>` using your tools and answer: <question>\". COLD — no handoff context, independent run.

Output: one markdown file at `.claude/comparisons/<play_id>-<UTC_TS>.md`:

```markdown
# Comparison — play 0a9c4308 — 2026-05-25 19:32 PDT (02:32 UTC)

## Question
<question>

## Scope (auto-derived from play_id)
- player_id: …
- play_id: …
- play span: started_at → last_seen_at

## Dashboard chat (<profile>/<model>)
**Time:** 47.2s · **Cost:** $0.018 · **Tool calls:** 4
**Handoff:** cat ~/test-dev/.claude/chats/<chat_id>.md

<full assistant response, including tool-call summaries>

---

## Claude Code (claude-opus-4-7)
**Time:** 89.4s · **Cost:** ~$0.41

<full claude response from --print mode>

---

## Operator verdict (fill in)
- [ ] Dashboard chat is clearly better
- [ ] Claude Code is clearly better
- [ ] Roughly equivalent
- [ ] Both wrong; new answer needed
- Notes:
```

## Implementation notes

- Reuses existing `/api/v2/chat` SSE handler — no server changes needed.
- Claude Code side uses `claude --print` (non-interactive flag) — already shipped in the CLI.
- Parallel execution via goroutines / shell `&`.
- Cost capture: dashboard side from the SSE `usage` event; Claude Code side from the CLI's own usage report (or rough estimate from token counts + Opus rates).
- `.claude/comparisons/` is a new directory — add to standards alongside `findings/`, `chats/`, etc.

## Out of scope (separate issues if useful)

- **Web UI** for comparison browsing. CLI + markdown files are sufficient for the initial \"is this useful\" question; UI later if the corpus grows large enough to want filtering/search.
- **Automated batch evaluation** — running 10 standard questions across 10 reference plays nightly. Premature until we've done some manual comparisons.
- **Handoff-aware Claude side** — option to have Claude Code read the dashboard's chat handoff first then answer. The user explicitly chose COLD-only for v1 to keep the comparison rigorous (independent).

## Acceptance

- `harness compare <play_id> \"<question>\"` produces a markdown file at `.claude/comparisons/`.
- File contains: question, scope, both responses with timing+cost+tool-call metadata, blank operator-verdict rubric.
- 10+ comparisons across different question types give a clear-enough signal to decide \"keep / refine / deprecate\" for the dashboard chat surface in 2-4 weeks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool(eval): harness compare — A/B dashboard chat vs Claude Code on same play_id #513

Why

What

Implementation notes

Out of scope (separate issues if useful)

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tool(eval): harness compare — A/B dashboard chat vs Claude Code on same play_id #513

Description

Why

What

Implementation notes

Out of scope (separate issues if useful)

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions