monte-carlo-data · santiagoaguiar · Jun 4, 2026 · Jun 4, 2026 · Jun 8, 2026
diff --git a/.github/workflows/compare-trace-tests.yml b/.github/workflows/compare-trace-tests.yml
@@ -0,0 +1,30 @@
+name: Compare Trace Skill Tests
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'skills/compare-trace/**'
+      - '.github/workflows/compare-trace-tests.yml'
+  pull_request:
+    paths:
+      - 'skills/compare-trace/**'
+      - '.github/workflows/compare-trace-tests.yml'
+
+permissions:
+  contents: read
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      # test_otel_spans.py exercises sources/otel_spans.py via subprocess.
+      # Stdlib only — no pip install needed.
+      - name: Run otel_spans normalizer tests
+        run: python3 skills/compare-trace/tests/test_otel_spans.py
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,9 @@ temp/
 __pycache__/
 *.pyc
 .work/
+
+# Local install artifacts from running plugins/codex/scripts/install.sh
+# against this repo. Intended for downstream target repos, not this one.
+.agents/
+.codex/
+plugins/mc-agent-toolkit/
diff --git a/README.md b/README.md
@@ -48,6 +48,12 @@ Skills are grouped by the job they help you do. Orchestrated workflows sequence
 | **Storage Cost Analysis** | Identifies storage waste (unread, zombie, dead-end tables); uses lineage to verify cleanup is safe and estimates savings. | [README](skills/storage-cost-analysis/README.md) |
 | **Performance Diagnosis** | Diagnoses slow pipelines and expensive queries across Airflow, dbt, Databricks, and other platforms. | [README](skills/performance-diagnosis/README.md) |
 
+### Evaluate — compare agent runs
+
+| Skill | Description | Details |
+|---|---|---|
+| **Compare Trace** | A/B compares two existing agent traces (by ID) — graph path, latency/tokens, tool-call sequence, plus LLM-based semantic and entity-overlap diffs over the final answers. Emits an HTML report. | [README](skills/compare-trace/README.md) |
+
 ### Setup — ingestion and connections
 
 | Skill | Description | Details |

diff --git a/plugins/claude-code/evals/compare-trace/trigger-evals.json b/plugins/claude-code/evals/compare-trace/trigger-evals.json
@@ -0,0 +1,126 @@
+{
+  "skill": "monte-carlo-compare-trace",
+  "description": "Trigger accuracy evals for the monte-carlo-compare-trace skill. Each case specifies whether the skill SHOULD or SHOULD NOT be triggered by the given prompt.",
+  "cases": [
+    {
+      "id": "should-01",
+      "prompt": "Compare these two agent conversations: 019e8f2a-24ae-7880-8901-cbc79aca43ed and 019e9319-e88e-7080-bc78-2aff46543849",
+      "expected": "trigger",
+      "rationale": "Direct A/B compare with two conversation IDs -- core skill use case"
+    },
+    {
+      "id": "should-02",
+      "prompt": "Diff these two agent runs and tell me what changed.",
+      "expected": "trigger",
+      "rationale": "Explicit 'diff two agent runs' phrasing -- listed trigger"
+    },
+    {
+      "id": "should-03",
+      "prompt": "I tweaked the system prompt for the coverage agent and re-ran it on the same conversation. Did the change cause a regression? Here are the two IDs.",
+      "expected": "trigger",
+      "rationale": "Prompt change regression check between two runs -- explicit use case"
+    },
+    {
+      "id": "should-04",
+      "prompt": "/compare-trace 019e8f2a-24ae-7880-8901-cbc79aca43ed 019e9319-e88e-7080-bc78-2aff46543849",
+      "expected": "trigger",
+      "rationale": "Explicit slash-command invocation -- skill's own command"
+    },
+    {
+      "id": "should-05",
+      "prompt": "Here are two conversation IDs from the chat agent — show me how the tool sequences differ.",
+      "expected": "trigger",
+      "rationale": "Tool-sequence diff between two conversations -- tool_call_diff evaluator territory"
+    },
+    {
+      "id": "should-06",
+      "prompt": "We swapped the agent model from claude-3-5 to claude-sonnet-4 and re-ran a fixed scenario. Compare the two runs.",
+      "expected": "trigger",
+      "rationale": "Model-swap A/B with two runs to compare -- explicit use case"
+    },
+    {
+      "id": "should-07",
+      "prompt": "Compare the OTel traces from these two agent runs and produce a side-by-side report.",
+      "expected": "trigger",
+      "rationale": "Trace-level comparison with HTML-report intent -- matches skill output"
+    },
+    {
+      "id": "should-08",
+      "prompt": "I have two agent traces I want to look at side-by-side. Conversation IDs are X and Y.",
+      "expected": "trigger",
+      "rationale": "Side-by-side trace comparison with both IDs supplied"
+    },
+    {
+      "id": "should-09",
+      "prompt": "Show me the difference in graph path between baseline and candidate runs of the coverage agent.",
+      "expected": "trigger",
+      "rationale": "Graph-path diff between two runs -- graph_path_diff evaluator territory"
+    },
+    {
+      "id": "should-10",
+      "prompt": "Did removing the get_use_cases tool change how the coverage agent handles 'what's my coverage gap?' Compare a before and after run.",
+      "expected": "trigger",
+      "rationale": "Tool-loadout change A/B between two runs -- explicit use case"
+    },
+    {
+      "id": "should-11",
+      "prompt": "Compare the latency and token usage between these two agent conversations.",
+      "expected": "trigger",
+      "rationale": "Latency/token diff between two runs -- latency_diff evaluator territory"
+    },
+    {
+      "id": "should-not-01",
+      "prompt": "What went wrong with this agent run? Here's the conversation ID.",
+      "expected": "no-trigger",
+      "rationale": "Single-trace troubleshooting -- not a comparison; routes to analyze-root-cause / incident-response"
+    },
+    {
+      "id": "should-not-02",
+      "prompt": "Investigate why this trace failed. The conversation ID is 019e8f2a-24ae-7880-8901-cbc79aca43ed.",
+      "expected": "no-trigger",
+      "rationale": "Single-trace failure investigation -- not an A/B comparison"
+    },
+    {
+      "id": "should-not-03",
+      "prompt": "Compare row counts between our staging and production orders tables.",
+      "expected": "no-trigger",
+      "rationale": "Cross-table data comparison -- routes to monitoring-advisor (comparison monitor); not agent A/B"
+    },
+    {
+      "id": "should-not-04",
+      "prompt": "How does my chat agent perform overall? Show me aggregate metrics.",
+      "expected": "no-trigger",
+      "rationale": "Aggregate performance question with no two specific conversation IDs to compare"
+    },
+    {
+      "id": "should-not-05",
+      "prompt": "Set up an evaluation monitor for my chat agent to track response quality over time.",
+      "expected": "no-trigger",
+      "rationale": "Agent eval monitor creation -- routes to monitoring-advisor"
+    },
+    {
+      "id": "should-not-06",
+      "prompt": "Diff these two SQL queries and tell me which one is more efficient.",
+      "expected": "no-trigger",
+      "rationale": "SQL comparison -- wrong domain (not agent traces)"
+    },
+    {
+      "id": "should-not-07",
+      "prompt": "Show me the trace for conversation 019e8f2a-24ae-7880-8901-cbc79aca43ed.",
+      "expected": "no-trigger",
+      "rationale": "Single-trace inspection -- not a comparison"
+    },
+    {
+      "id": "should-not-08",
+      "prompt": "Help me build a prompt eval framework for my LangGraph agent.",
+      "expected": "no-trigger",
+      "rationale": "Generic eval-framework engineering -- not an A/B compare on existing runs"
+    },
+    {
+      "id": "should-not-09",
+      "prompt": "Compare these two dbt models and tell me which one has more downstream tables.",
+      "expected": "no-trigger",
+      "rationale": "dbt model comparison -- wrong domain"
+    }
+  ]
+}
diff --git a/plugins/claude-code/skills/compare-trace b/plugins/claude-code/skills/compare-trace
@@ -0,0 +1 @@
+../../../skills/compare-trace
diff --git a/plugins/codex/skills/compare-trace b/plugins/codex/skills/compare-trace
@@ -0,0 +1 @@
+../../../skills/compare-trace
diff --git a/plugins/copilot/skills/compare-trace b/plugins/copilot/skills/compare-trace
@@ -0,0 +1 @@
+../../../skills/compare-trace
diff --git a/plugins/cursor/skills/compare-trace b/plugins/cursor/skills/compare-trace
@@ -0,0 +1 @@
+../../../skills/compare-trace
diff --git a/plugins/opencode/skills/compare-trace b/plugins/opencode/skills/compare-trace
@@ -0,0 +1 @@
+../../../skills/compare-trace
diff --git a/skills/README.md b/skills/README.md
@@ -22,6 +22,7 @@ Skills are platform-agnostic instruction sets that tell an AI coding agent what
 | **[Tune Monitor](tune-monitor/)** | Analyzes a Monte Carlo metric monitor's alert history and recommends configuration changes to reduce noise — sensitivity, WHERE conditions, segment exclusions, schedule, and aggregation. |
 | **[Connection Auth Rules](connection-auth-rules/)** | Build a Connection Auth Rules configuration for a Monte Carlo connection type. Fetches live connector schemas and transform steps from the apollo-agent repo. |
 | **[Instrument Agent](instrument-agent/)** | Instruments a Python AI agent for Monte Carlo Agent Observability — detects AI libraries, installs the Monte Carlo OpenTelemetry SDK, sets up tracing, and verifies traces in Monte Carlo. Asks before editing. |
+| **[Compare Trace](compare-trace/)** | A/B compares two Monte Carlo agent traces by ID — runs graph-path, latency/token, and tool-call diffs plus LLM-based semantic and entity-overlap evals over the final answers, and opens an HTML report. |
 
 ## Standalone Installation
 

diff --git a/skills/compare-trace/README.md b/skills/compare-trace/README.md
@@ -0,0 +1,44 @@
+# compare-trace
+
+A/B compare two Monte Carlo agent conversations by ID and produce an HTML report.
+
+Trace-driven backport of the [Agent A/B Evaluation Framework](https://github.com/monte-carlo-data/ai-agent/pull/1236) (PR #1236 in `ai-agent`). The original ran the agent itself against fixed scenarios; this skill operates on already-captured conversations fetched via the Monte Carlo MCP server.
+
+## Invocation
+
+```
+/compare-trace <conv_id_a> <conv_id_b>
+```
+
+Optional flags: `--mcon`, `--agent`, `--trace-ids a,b` (force specific OTel trace_ids when a conversation has multiple), `--labels A,B`, `--output path.html`.
+
+## ID model
+
+`conversation_id` is the user-facing identifier (per the OTel GenAI `gen_ai.conversation.id` semantic convention). It's stored as a span attribute, **not** as the OTel `trace_id`. One conversation can contain multiple OTel traces (retries, fan-outs, multi-turn).
+
+The skill resolves `conversation_id → trace_id` via `get_agent_conversation`. By default it picks the trace with the most spans (= the "main" execution); override with `--trace-ids` to compare specific sub-traces.
+
+## Signals
+
+| Signal | Type | Notes |
+|---|---|---|
+| Graph Path | deterministic | Jaccard on node sets + LCS/max ordering |
+| Latency & Tokens | deterministic | Per-metric ratios; flag if candidate > 1.5x baseline |
+| Tool Call Sequence + Args | deterministic | Levenshtein on tool-name sequences; matched calls also get a top-level arg-key diff (added / removed / changed) |
+| Semantic Diff | LLM (inline) | Claude runs prompt over both final-completion texts |
+| Entity Overlap | LLM (inline) | Extracts 8 entity types, computes per-type Jaccard |
+
+The two LLM signals require non-empty `final_output_text` for both sides (pulled from the last completion span in each conversation). Without that, the report ships with the 3 structural signals.
+
+## Files
+
+- `SKILL.md` — full workflow Claude follows
+- `scripts/compare_traces.py` — driver that consumes normalized trace JSON + optional LLM-eval JSON and writes HTML
+- `scripts/evaluators/{graph_path_diff,latency_diff,tool_call_diff}.py` — pure-Python evaluators ported from PR #1236
+- `references/PR1236_MAPPING.md` — fields-and-signals mapping from PR #1236 to the trace API
+
+## Known limitations (v0.3)
+
+- Picks one trace per conversation (the largest non-error one by edge count). Multi-trace conversations (retries, fan-outs) currently get their other traces dropped — pass `--trace-ids` to override.
+- Arg-diff matches calls by name + nearest position (greedy). When a tool's count differs between A and B, the surplus calls go unmatched. v0.4 plan: stable-ID fallback using `tool_use_id` when present.
+- No "structured fields" diff (the 6th evaluator in PR #1236) — only meaningful when you control the agent's output schema, which we don't from trace-land.