Skip to content

tool(eval): harness compare — A/B dashboard chat vs Claude Code on same play_id #513

@jonathaneoliver

Description

@jonathaneoliver

Why

We're running two AI surfaces over the same data:

Each has wins on different question types but there's no rigorous way to evaluate them side-by-side. Today comparison is "open both, eyeball, judge" — slow, lossy, no archive, no trend.

Goal: a CLI subcommand that asks the same question of both backends in parallel and produces a structured comparison file you can annotate + archive. Builds a corpus over time to inform decisions like "does the dashboard chat actually add value vs Claude Code?" or "on which question types does each win?"

What

harness compare <play_id> \"<question>\" [--profile P --model M] [--out DIR]

Runs in parallel:

  1. Dashboard chat side — POST to /api/v2/chat with auto-built scope (player_id resolved via get_play_summary(play_id)), selected profile/model, the question. Capture the SSE stream end-to-end (assistant text + tool calls + tool results + usage + cost). Get the persisted chat handoff path.
  2. Claude Code side — invoke claude --print non-interactively with a templated prompt: "Investigate play <play_id> using your tools and answer: ". COLD — no handoff context, independent run.

Output: one markdown file at .claude/comparisons/<play_id>-<UTC_TS>.md:

# Comparison — play 0a9c4308 — 2026-05-25 19:32 PDT (02:32 UTC)

## Question
<question>

## Scope (auto-derived from play_id)
- player_id: …
- play_id: …
- play span: started_at → last_seen_at

## Dashboard chat (<profile>/<model>)
**Time:** 47.2s · **Cost:** $0.018 · **Tool calls:** 4
**Handoff:** cat ~/test-dev/.claude/chats/<chat_id>.md

<full assistant response, including tool-call summaries>

---

## Claude Code (claude-opus-4-7)
**Time:** 89.4s · **Cost:** ~$0.41

<full claude response from --print mode>

---

## Operator verdict (fill in)
- [ ] Dashboard chat is clearly better
- [ ] Claude Code is clearly better
- [ ] Roughly equivalent
- [ ] Both wrong; new answer needed
- Notes:

Implementation notes

  • Reuses existing /api/v2/chat SSE handler — no server changes needed.
  • Claude Code side uses claude --print (non-interactive flag) — already shipped in the CLI.
  • Parallel execution via goroutines / shell &.
  • Cost capture: dashboard side from the SSE usage event; Claude Code side from the CLI's own usage report (or rough estimate from token counts + Opus rates).
  • .claude/comparisons/ is a new directory — add to standards alongside findings/, chats/, etc.

Out of scope (separate issues if useful)

  • Web UI for comparison browsing. CLI + markdown files are sufficient for the initial "is this useful" question; UI later if the corpus grows large enough to want filtering/search.
  • Automated batch evaluation — running 10 standard questions across 10 reference plays nightly. Premature until we've done some manual comparisons.
  • Handoff-aware Claude side — option to have Claude Code read the dashboard's chat handoff first then answer. The user explicitly chose COLD-only for v1 to keep the comparison rigorous (independent).

Acceptance

  • harness compare <play_id> \"<question>\" produces a markdown file at .claude/comparisons/.
  • File contains: question, scope, both responses with timing+cost+tool-call metadata, blank operator-verdict rubric.
  • 10+ comparisons across different question types give a clear-enough signal to decide "keep / refine / deprecate" for the dashboard chat surface in 2-4 weeks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions