Skip to content

compare-trace skill: A/B compare two MC agent conversations#93

Draft
santiagoaguiar wants to merge 3 commits into
mainfrom
compare-traces
Draft

compare-trace skill: A/B compare two MC agent conversations#93
santiagoaguiar wants to merge 3 commits into
mainfrom
compare-traces

Conversation

@santiagoaguiar

Copy link
Copy Markdown
Contributor

Summary

Adds /compare-trace, a trace-driven backport of monte-carlo-data/ai-agent#1236 (Johanna's Agent A/B Evaluation Framework). The original ran the agent itself against fixed scenarios; this skill operates on already-captured conversations fetched via the Monte Carlo MCP server, so it works for any traced agent without re-running anything.

What it does

Takes two conversation_ids (UUIDv7 per the OTel GenAI gen_ai.conversation.id convention), walks each to its main OTel trace, normalizes structural data, runs five evaluators, and renders an HTML report.

The 5 signals

Signal Type Source
Graph Path deterministic Jaccard on node sets + LCS/max ordering — ported verbatim from PR #1236
Latency & Tokens deterministic Per-metric ratios (exec time, LLM calls, tokens, tool calls); flag if candidate > 1.5× baseline
Tool Call Sequence + Args deterministic Levenshtein on tool-name sequences; matched calls get added/removed/changed arg-key diff with truncated inline values + expand-toggle for full values
Semantic Diff LLM (inline) Claude runs the prompt over both final-completion texts
Entity Overlap LLM (inline) Extracts 8 entity types from final completions, computes per-type Jaccard

Fall-back path: when get_agent_trace returns "Incomplete trace", the skill reconstructs structural data from cached conversation edges and marks source: "conversation_fallback" so the user knows the graph view is shallower.

Layout

  • Canonical skill: skills/compare-trace/ (SKILL.md, README, scripts/compare_traces.py, evaluators ported from PR #1236, references/PR1236_MAPPING.md)
  • Editor plugins: symlinks under plugins/{claude-code,cursor,codex,copilot,opencode}/skills/
  • Skill tables: added an "Evaluate" section to root README.md and a row in skills/README.md

Validation

End-to-end tested against two real ai-agent conversations from June 3/4:

  • Phase 2 conversation walk discovers OTel trace_ids, drops turn_errors retries, picks the largest non-error trace.
  • Phase 2 step 4 parses tool-call args from assistant completions (LangChain/Bedrock arguments: str JSON shape, with isinstance(arguments, dict) fallback for OpenAI / Anthropic-native shapes).
  • Phase 3 fell back to conversation-edge reconstruction for one of the two trace_ids ("Incomplete trace") — fallback path works.
  • Phase 4c "not actually an A/B?" sanity check fires when overall_jaccard ≈ 0 AND exec-time ratio > 5× (caught the test pair, which were different scenarios).
  • Arg-diff produced 7 verified matches with verbatim values (start_time: 2026-05-20T00:00:00Z → 2026-05-21T00:00:00Z etc.).

Test plan

  • Smoke-test driver with short-arg fixtures — values render inline, no truncation toggle
  • Smoke-test driver with long-SQL fixtures — values truncated at 80 chars with working ▸ show full values expand
  • Smoke-test the empty-trace path (both sides) — 1.00 similarity, "Identical paths" fallback message
  • Smoke-test the identical-trace path — shared lists populate correctly
  • Subagent end-to-end run against real production conversations — all 5 signals render, sanity check fires
  • Subagent re-run after the v0.2 → v0.3 patches — confirmed first=100 cap, snake_case fields, turn_errors exclusion, JSON completion parsing, Incomplete-trace fallback, arg-diff with values

Notes

  • No .work/ directory in this PR — the work was conversational with Claude Code, not from a Linear ticket.
  • plugins/mc-agent-toolkit/ left untracked. That's pre-existing local WIP unrelated to this PR; the codex variant of compare-trace there will land with whatever PR ships the broader plugin.

🤖 Generated with Claude Code

santiagoaguiar and others added 3 commits June 4, 2026 16:57
Adds /compare-trace, a trace-driven backport of monte-carlo-data/ai-agent#1236
that fetches two MC agent conversations via MCP and produces an HTML report
with five evaluators: graph-path diff, latency/token diff, tool-call sequence
+ argument diff, and (when final completions are reachable) inline LLM-driven
semantic and entity-overlap diffs.

The skill takes conversation_ids as primary input (per the OTel GenAI
`gen_ai.conversation.id` convention), walks each conversation to its main OTel
trace, normalizes structural data, and renders the report. Falls back to
reconstructing structural data from conversation edges when get_agent_trace
returns "Incomplete trace".

Tool-call argument diff (v0.3) shows added/removed/changed arg keys with
inline truncated values and a per-row expand toggle for the full value, with
greedy proximity matching ported from PR #1236.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running plugins/codex/scripts/install.sh with this repo as target
creates .agents/, .codex/hooks.json (with absolute home paths!), and
a self-copy of the codex plugin under plugins/mc-agent-toolkit/.
Those are install state for downstream consumers, not source for this
repo — gitignore them so they don't accidentally land in commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an alternative trace-acquisition path to the skill: collect spans
locally from any OTel-instrumented agent instead of pulling MC-stored
conversations. Both paths feed the same normalized shape and the same
comparator/HTML report.

New scripts (framework-neutral):
- scripts/local_otlp_receiver.py — OTLP/HTTP protobuf receiver writing
  spans as JSONL. Works with any GenAI-semconv-emitting stack.
- scripts/sources/otel_spans.py — JSONL → normalized-trace converter.
  Reads OTel GenAI semconv attributes + LangGraph node-span naming.
  Dedupes tool calls by id across spans — Traceloop's Bedrock
  instrumentor only emits them under gen_ai.prompt.*.tool_calls.* on
  the next LLM call, never on the completion that produced them.

Docs:
- references/local-otel-collection.md — 4-step pipeline (receiver →
  agent → normalize → diff), dialect coverage, ai-agent integration
  appendix with MCP_SERVER_URL / slowapi / monkeypatch gotchas.
- SKILL.md gains a Trace sources section pointing to both ingestion
  paths; version bumped to 0.4.0.

Tests + CI:
- skills/compare-trace/tests/test_otel_spans.py — subprocess-driven
  smoke test against two fixture JSONLs (Bedrock and OpenAI/Anthropic
  dialects). 16 assertions, stdlib only.
- .github/workflows/compare-trace-tests.yml — paths-scoped CI.

Trigger evals:
- plugins/claude-code/evals/compare-trace/trigger-evals.json — 11
  trigger + 9 no-trigger cases distinguishing this skill from
  analyze-root-cause, monitoring-advisor, and generic comparison.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant