compare-trace skill: A/B compare two MC agent conversations#93
Draft
santiagoaguiar wants to merge 3 commits into
Draft
compare-trace skill: A/B compare two MC agent conversations#93santiagoaguiar wants to merge 3 commits into
santiagoaguiar wants to merge 3 commits into
Conversation
Adds /compare-trace, a trace-driven backport of monte-carlo-data/ai-agent#1236 that fetches two MC agent conversations via MCP and produces an HTML report with five evaluators: graph-path diff, latency/token diff, tool-call sequence + argument diff, and (when final completions are reachable) inline LLM-driven semantic and entity-overlap diffs. The skill takes conversation_ids as primary input (per the OTel GenAI `gen_ai.conversation.id` convention), walks each conversation to its main OTel trace, normalizes structural data, and renders the report. Falls back to reconstructing structural data from conversation edges when get_agent_trace returns "Incomplete trace". Tool-call argument diff (v0.3) shows added/removed/changed arg keys with inline truncated values and a per-row expand toggle for the full value, with greedy proximity matching ported from PR #1236. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running plugins/codex/scripts/install.sh with this repo as target creates .agents/, .codex/hooks.json (with absolute home paths!), and a self-copy of the codex plugin under plugins/mc-agent-toolkit/. Those are install state for downstream consumers, not source for this repo — gitignore them so they don't accidentally land in commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an alternative trace-acquisition path to the skill: collect spans locally from any OTel-instrumented agent instead of pulling MC-stored conversations. Both paths feed the same normalized shape and the same comparator/HTML report. New scripts (framework-neutral): - scripts/local_otlp_receiver.py — OTLP/HTTP protobuf receiver writing spans as JSONL. Works with any GenAI-semconv-emitting stack. - scripts/sources/otel_spans.py — JSONL → normalized-trace converter. Reads OTel GenAI semconv attributes + LangGraph node-span naming. Dedupes tool calls by id across spans — Traceloop's Bedrock instrumentor only emits them under gen_ai.prompt.*.tool_calls.* on the next LLM call, never on the completion that produced them. Docs: - references/local-otel-collection.md — 4-step pipeline (receiver → agent → normalize → diff), dialect coverage, ai-agent integration appendix with MCP_SERVER_URL / slowapi / monkeypatch gotchas. - SKILL.md gains a Trace sources section pointing to both ingestion paths; version bumped to 0.4.0. Tests + CI: - skills/compare-trace/tests/test_otel_spans.py — subprocess-driven smoke test against two fixture JSONLs (Bedrock and OpenAI/Anthropic dialects). 16 assertions, stdlib only. - .github/workflows/compare-trace-tests.yml — paths-scoped CI. Trigger evals: - plugins/claude-code/evals/compare-trace/trigger-evals.json — 11 trigger + 9 no-trigger cases distinguishing this skill from analyze-root-cause, monitoring-advisor, and generic comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
/compare-trace, a trace-driven backport of monte-carlo-data/ai-agent#1236 (Johanna's Agent A/B Evaluation Framework). The original ran the agent itself against fixed scenarios; this skill operates on already-captured conversations fetched via the Monte Carlo MCP server, so it works for any traced agent without re-running anything.What it does
Takes two
conversation_ids (UUIDv7 per the OTel GenAIgen_ai.conversation.idconvention), walks each to its main OTel trace, normalizes structural data, runs five evaluators, and renders an HTML report.The 5 signals
Fall-back path: when
get_agent_tracereturns"Incomplete trace", the skill reconstructs structural data from cached conversation edges and markssource: "conversation_fallback"so the user knows the graph view is shallower.Layout
skills/compare-trace/(SKILL.md, README,scripts/compare_traces.py, evaluators ported from PR #1236,references/PR1236_MAPPING.md)plugins/{claude-code,cursor,codex,copilot,opencode}/skills/README.mdand a row inskills/README.mdValidation
End-to-end tested against two real
ai-agentconversations from June 3/4:trace_ids, dropsturn_errorsretries, picks the largest non-error trace.arguments: strJSON shape, withisinstance(arguments, dict)fallback for OpenAI / Anthropic-native shapes).overall_jaccard ≈ 0AND exec-time ratio > 5× (caught the test pair, which were different scenarios).start_time: 2026-05-20T00:00:00Z → 2026-05-21T00:00:00Zetc.).Test plan
▸ show full valuesexpandfirst=100cap, snake_case fields, turn_errors exclusion, JSON completion parsing, Incomplete-trace fallback, arg-diff with valuesNotes
.work/directory in this PR — the work was conversational with Claude Code, not from a Linear ticket.plugins/mc-agent-toolkit/left untracked. That's pre-existing local WIP unrelated to this PR; the codex variant of compare-trace there will land with whatever PR ships the broader plugin.🤖 Generated with Claude Code