compare-trace skill: A/B compare two MC agent conversations by santiagoaguiar · Pull Request #93 · monte-carlo-data/mc-agent-toolkit

santiagoaguiar · 2026-06-04T19:59:28Z

Summary

Adds /compare-trace, a trace-driven backport of monte-carlo-data/ai-agent#1236 (Johanna's Agent A/B Evaluation Framework). The original ran the agent itself against fixed scenarios; this skill operates on already-captured conversations fetched via the Monte Carlo MCP server, so it works for any traced agent without re-running anything.

What it does

Takes two conversation_ids (UUIDv7 per the OTel GenAI gen_ai.conversation.id convention), walks each to its main OTel trace, normalizes structural data, runs five evaluators, and renders an HTML report.

The 5 signals

Signal	Type	Source
Graph Path	deterministic	Jaccard on node sets + LCS/max ordering — ported verbatim from PR #1236
Latency & Tokens	deterministic	Per-metric ratios (exec time, LLM calls, tokens, tool calls); flag if candidate > 1.5× baseline
Tool Call Sequence + Args	deterministic	Levenshtein on tool-name sequences; matched calls get added/removed/changed arg-key diff with truncated inline values + expand-toggle for full values
Semantic Diff	LLM (inline)	Claude runs the prompt over both final-completion texts
Entity Overlap	LLM (inline)	Extracts 8 entity types from final completions, computes per-type Jaccard

Fall-back path: when get_agent_trace returns "Incomplete trace", the skill reconstructs structural data from cached conversation edges and marks source: "conversation_fallback" so the user knows the graph view is shallower.

Layout

Canonical skill: skills/compare-trace/ (SKILL.md, README, scripts/compare_traces.py, evaluators ported from PR #1236, references/PR1236_MAPPING.md)
Editor plugins: symlinks under plugins/{claude-code,cursor,codex,copilot,opencode}/skills/
Skill tables: added an "Evaluate" section to root README.md and a row in skills/README.md

Validation

End-to-end tested against two real ai-agent conversations from June 3/4:

Phase 2 conversation walk discovers OTel trace_ids, drops turn_errors retries, picks the largest non-error trace.
Phase 2 step 4 parses tool-call args from assistant completions (LangChain/Bedrock arguments: str JSON shape, with isinstance(arguments, dict) fallback for OpenAI / Anthropic-native shapes).
Phase 3 fell back to conversation-edge reconstruction for one of the two trace_ids ("Incomplete trace") — fallback path works.
Phase 4c "not actually an A/B?" sanity check fires when overall_jaccard ≈ 0 AND exec-time ratio > 5× (caught the test pair, which were different scenarios).
Arg-diff produced 7 verified matches with verbatim values (start_time: 2026-05-20T00:00:00Z → 2026-05-21T00:00:00Z etc.).

Test plan

Smoke-test driver with short-arg fixtures — values render inline, no truncation toggle
Smoke-test driver with long-SQL fixtures — values truncated at 80 chars with working ▸ show full values expand
Smoke-test the empty-trace path (both sides) — 1.00 similarity, "Identical paths" fallback message
Smoke-test the identical-trace path — shared lists populate correctly
Subagent end-to-end run against real production conversations — all 5 signals render, sanity check fires
Subagent re-run after the v0.2 → v0.3 patches — confirmed first=100 cap, snake_case fields, turn_errors exclusion, JSON completion parsing, Incomplete-trace fallback, arg-diff with values

Notes

No .work/ directory in this PR — the work was conversational with Claude Code, not from a Linear ticket.
plugins/mc-agent-toolkit/ left untracked. That's pre-existing local WIP unrelated to this PR; the codex variant of compare-trace there will land with whatever PR ships the broader plugin.

🤖 Generated with Claude Code

Adds /compare-trace, a trace-driven backport of monte-carlo-data/ai-agent#1236 that fetches two MC agent conversations via MCP and produces an HTML report with five evaluators: graph-path diff, latency/token diff, tool-call sequence + argument diff, and (when final completions are reachable) inline LLM-driven semantic and entity-overlap diffs. The skill takes conversation_ids as primary input (per the OTel GenAI `gen_ai.conversation.id` convention), walks each conversation to its main OTel trace, normalizes structural data, and renders the report. Falls back to reconstructing structural data from conversation edges when get_agent_trace returns "Incomplete trace". Tool-call argument diff (v0.3) shows added/removed/changed arg keys with inline truncated values and a per-row expand toggle for the full value, with greedy proximity matching ported from PR #1236. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Running plugins/codex/scripts/install.sh with this repo as target creates .agents/, .codex/hooks.json (with absolute home paths!), and a self-copy of the codex plugin under plugins/mc-agent-toolkit/. Those are install state for downstream consumers, not source for this repo — gitignore them so they don't accidentally land in commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an alternative trace-acquisition path to the skill: collect spans locally from any OTel-instrumented agent instead of pulling MC-stored conversations. Both paths feed the same normalized shape and the same comparator/HTML report. New scripts (framework-neutral): - scripts/local_otlp_receiver.py — OTLP/HTTP protobuf receiver writing spans as JSONL. Works with any GenAI-semconv-emitting stack. - scripts/sources/otel_spans.py — JSONL → normalized-trace converter. Reads OTel GenAI semconv attributes + LangGraph node-span naming. Dedupes tool calls by id across spans — Traceloop's Bedrock instrumentor only emits them under gen_ai.prompt.*.tool_calls.* on the next LLM call, never on the completion that produced them. Docs: - references/local-otel-collection.md — 4-step pipeline (receiver → agent → normalize → diff), dialect coverage, ai-agent integration appendix with MCP_SERVER_URL / slowapi / monkeypatch gotchas. - SKILL.md gains a Trace sources section pointing to both ingestion paths; version bumped to 0.4.0. Tests + CI: - skills/compare-trace/tests/test_otel_spans.py — subprocess-driven smoke test against two fixture JSONLs (Bedrock and OpenAI/Anthropic dialects). 16 assertions, stdlib only. - .github/workflows/compare-trace-tests.yml — paths-scoped CI. Trigger evals: - plugins/claude-code/evals/compare-trace/trigger-evals.json — 11 trigger + 9 no-trigger cases distinguishing this skill from analyze-root-cause, monitoring-advisor, and generic comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

santiagoaguiar and others added 3 commits June 4, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compare-trace skill: A/B compare two MC agent conversations#93

compare-trace skill: A/B compare two MC agent conversations#93
santiagoaguiar wants to merge 3 commits into
mainfrom
compare-traces

santiagoaguiar commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

santiagoaguiar commented Jun 4, 2026

Summary

What it does

The 5 signals

Layout

Validation

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant