Why
We're running two AI surfaces over the same data:
Each has wins on different question types but there's no rigorous way to evaluate them side-by-side. Today comparison is "open both, eyeball, judge" — slow, lossy, no archive, no trend.
Goal: a CLI subcommand that asks the same question of both backends in parallel and produces a structured comparison file you can annotate + archive. Builds a corpus over time to inform decisions like "does the dashboard chat actually add value vs Claude Code?" or "on which question types does each win?"
What
harness compare <play_id> \"<question>\" [--profile P --model M] [--out DIR]
Runs in parallel:
- Dashboard chat side — POST to
/api/v2/chat with auto-built scope (player_id resolved via get_play_summary(play_id)), selected profile/model, the question. Capture the SSE stream end-to-end (assistant text + tool calls + tool results + usage + cost). Get the persisted chat handoff path.
- Claude Code side — invoke
claude --print non-interactively with a templated prompt: "Investigate play <play_id> using your tools and answer: ". COLD — no handoff context, independent run.
Output: one markdown file at .claude/comparisons/<play_id>-<UTC_TS>.md:
# Comparison — play 0a9c4308 — 2026-05-25 19:32 PDT (02:32 UTC)
## Question
<question>
## Scope (auto-derived from play_id)
- player_id: …
- play_id: …
- play span: started_at → last_seen_at
## Dashboard chat (<profile>/<model>)
**Time:** 47.2s · **Cost:** $0.018 · **Tool calls:** 4
**Handoff:** cat ~/test-dev/.claude/chats/<chat_id>.md
<full assistant response, including tool-call summaries>
---
## Claude Code (claude-opus-4-7)
**Time:** 89.4s · **Cost:** ~$0.41
<full claude response from --print mode>
---
## Operator verdict (fill in)
- [ ] Dashboard chat is clearly better
- [ ] Claude Code is clearly better
- [ ] Roughly equivalent
- [ ] Both wrong; new answer needed
- Notes:
Implementation notes
- Reuses existing
/api/v2/chat SSE handler — no server changes needed.
- Claude Code side uses
claude --print (non-interactive flag) — already shipped in the CLI.
- Parallel execution via goroutines / shell
&.
- Cost capture: dashboard side from the SSE
usage event; Claude Code side from the CLI's own usage report (or rough estimate from token counts + Opus rates).
.claude/comparisons/ is a new directory — add to standards alongside findings/, chats/, etc.
Out of scope (separate issues if useful)
- Web UI for comparison browsing. CLI + markdown files are sufficient for the initial "is this useful" question; UI later if the corpus grows large enough to want filtering/search.
- Automated batch evaluation — running 10 standard questions across 10 reference plays nightly. Premature until we've done some manual comparisons.
- Handoff-aware Claude side — option to have Claude Code read the dashboard's chat handoff first then answer. The user explicitly chose COLD-only for v1 to keep the comparison rigorous (independent).
Acceptance
harness compare <play_id> \"<question>\" produces a markdown file at .claude/comparisons/.
- File contains: question, scope, both responses with timing+cost+tool-call metadata, blank operator-verdict rubric.
- 10+ comparisons across different question types give a clear-enough signal to decide "keep / refine / deprecate" for the dashboard chat surface in 2-4 weeks.
Why
We're running two AI surfaces over the same data:
Each has wins on different question types but there's no rigorous way to evaluate them side-by-side. Today comparison is "open both, eyeball, judge" — slow, lossy, no archive, no trend.
Goal: a CLI subcommand that asks the same question of both backends in parallel and produces a structured comparison file you can annotate + archive. Builds a corpus over time to inform decisions like "does the dashboard chat actually add value vs Claude Code?" or "on which question types does each win?"
What
harness compare <play_id> \"<question>\" [--profile P --model M] [--out DIR]Runs in parallel:
/api/v2/chatwith auto-built scope (player_id resolved viaget_play_summary(play_id)), selected profile/model, the question. Capture the SSE stream end-to-end (assistant text + tool calls + tool results + usage + cost). Get the persisted chat handoff path.claude --printnon-interactively with a templated prompt: "Investigate play<play_id>using your tools and answer: ". COLD — no handoff context, independent run.Output: one markdown file at
.claude/comparisons/<play_id>-<UTC_TS>.md:Implementation notes
/api/v2/chatSSE handler — no server changes needed.claude --print(non-interactive flag) — already shipped in the CLI.&.usageevent; Claude Code side from the CLI's own usage report (or rough estimate from token counts + Opus rates)..claude/comparisons/is a new directory — add to standards alongsidefindings/,chats/, etc.Out of scope (separate issues if useful)
Acceptance
harness compare <play_id> \"<question>\"produces a markdown file at.claude/comparisons/.