Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/compare-trace-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Compare Trace Skill Tests

on:
push:
branches: [main]
paths:
- 'skills/compare-trace/**'
- '.github/workflows/compare-trace-tests.yml'
pull_request:
paths:
- 'skills/compare-trace/**'
- '.github/workflows/compare-trace-tests.yml'

permissions:
contents: read

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.11'

# test_otel_spans.py exercises sources/otel_spans.py via subprocess.
# Stdlib only — no pip install needed.
- name: Run otel_spans normalizer tests
run: python3 skills/compare-trace/tests/test_otel_spans.py
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,9 @@ temp/
__pycache__/
*.pyc
.work/

# Local install artifacts from running plugins/codex/scripts/install.sh
# against this repo. Intended for downstream target repos, not this one.
.agents/
.codex/
plugins/mc-agent-toolkit/
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,12 @@ Skills are grouped by the job they help you do. Orchestrated workflows sequence
| **Storage Cost Analysis** | Identifies storage waste (unread, zombie, dead-end tables); uses lineage to verify cleanup is safe and estimates savings. | [README](skills/storage-cost-analysis/README.md) |
| **Performance Diagnosis** | Diagnoses slow pipelines and expensive queries across Airflow, dbt, Databricks, and other platforms. | [README](skills/performance-diagnosis/README.md) |

### Evaluate — compare agent runs

| Skill | Description | Details |
|---|---|---|
| **Compare Trace** | A/B compares two existing agent traces (by ID) — graph path, latency/tokens, tool-call sequence, plus LLM-based semantic and entity-overlap diffs over the final answers. Emits an HTML report. | [README](skills/compare-trace/README.md) |

### Setup — ingestion and connections

| Skill | Description | Details |
Expand Down
126 changes: 126 additions & 0 deletions plugins/claude-code/evals/compare-trace/trigger-evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
{
"skill": "monte-carlo-compare-trace",
"description": "Trigger accuracy evals for the monte-carlo-compare-trace skill. Each case specifies whether the skill SHOULD or SHOULD NOT be triggered by the given prompt.",
"cases": [
{
"id": "should-01",
"prompt": "Compare these two agent conversations: 019e8f2a-24ae-7880-8901-cbc79aca43ed and 019e9319-e88e-7080-bc78-2aff46543849",
"expected": "trigger",
"rationale": "Direct A/B compare with two conversation IDs -- core skill use case"
},
{
"id": "should-02",
"prompt": "Diff these two agent runs and tell me what changed.",
"expected": "trigger",
"rationale": "Explicit 'diff two agent runs' phrasing -- listed trigger"
},
{
"id": "should-03",
"prompt": "I tweaked the system prompt for the coverage agent and re-ran it on the same conversation. Did the change cause a regression? Here are the two IDs.",
"expected": "trigger",
"rationale": "Prompt change regression check between two runs -- explicit use case"
},
{
"id": "should-04",
"prompt": "/compare-trace 019e8f2a-24ae-7880-8901-cbc79aca43ed 019e9319-e88e-7080-bc78-2aff46543849",
"expected": "trigger",
"rationale": "Explicit slash-command invocation -- skill's own command"
},
{
"id": "should-05",
"prompt": "Here are two conversation IDs from the chat agent — show me how the tool sequences differ.",
"expected": "trigger",
"rationale": "Tool-sequence diff between two conversations -- tool_call_diff evaluator territory"
},
{
"id": "should-06",
"prompt": "We swapped the agent model from claude-3-5 to claude-sonnet-4 and re-ran a fixed scenario. Compare the two runs.",
"expected": "trigger",
"rationale": "Model-swap A/B with two runs to compare -- explicit use case"
},
{
"id": "should-07",
"prompt": "Compare the OTel traces from these two agent runs and produce a side-by-side report.",
"expected": "trigger",
"rationale": "Trace-level comparison with HTML-report intent -- matches skill output"
},
{
"id": "should-08",
"prompt": "I have two agent traces I want to look at side-by-side. Conversation IDs are X and Y.",
"expected": "trigger",
"rationale": "Side-by-side trace comparison with both IDs supplied"
},
{
"id": "should-09",
"prompt": "Show me the difference in graph path between baseline and candidate runs of the coverage agent.",
"expected": "trigger",
"rationale": "Graph-path diff between two runs -- graph_path_diff evaluator territory"
},
{
"id": "should-10",
"prompt": "Did removing the get_use_cases tool change how the coverage agent handles 'what's my coverage gap?' Compare a before and after run.",
"expected": "trigger",
"rationale": "Tool-loadout change A/B between two runs -- explicit use case"
},
{
"id": "should-11",
"prompt": "Compare the latency and token usage between these two agent conversations.",
"expected": "trigger",
"rationale": "Latency/token diff between two runs -- latency_diff evaluator territory"
},
{
"id": "should-not-01",
"prompt": "What went wrong with this agent run? Here's the conversation ID.",
"expected": "no-trigger",
"rationale": "Single-trace troubleshooting -- not a comparison; routes to analyze-root-cause / incident-response"
},
{
"id": "should-not-02",
"prompt": "Investigate why this trace failed. The conversation ID is 019e8f2a-24ae-7880-8901-cbc79aca43ed.",
"expected": "no-trigger",
"rationale": "Single-trace failure investigation -- not an A/B comparison"
},
{
"id": "should-not-03",
"prompt": "Compare row counts between our staging and production orders tables.",
"expected": "no-trigger",
"rationale": "Cross-table data comparison -- routes to monitoring-advisor (comparison monitor); not agent A/B"
},
{
"id": "should-not-04",
"prompt": "How does my chat agent perform overall? Show me aggregate metrics.",
"expected": "no-trigger",
"rationale": "Aggregate performance question with no two specific conversation IDs to compare"
},
{
"id": "should-not-05",
"prompt": "Set up an evaluation monitor for my chat agent to track response quality over time.",
"expected": "no-trigger",
"rationale": "Agent eval monitor creation -- routes to monitoring-advisor"
},
{
"id": "should-not-06",
"prompt": "Diff these two SQL queries and tell me which one is more efficient.",
"expected": "no-trigger",
"rationale": "SQL comparison -- wrong domain (not agent traces)"
},
{
"id": "should-not-07",
"prompt": "Show me the trace for conversation 019e8f2a-24ae-7880-8901-cbc79aca43ed.",
"expected": "no-trigger",
"rationale": "Single-trace inspection -- not a comparison"
},
{
"id": "should-not-08",
"prompt": "Help me build a prompt eval framework for my LangGraph agent.",
"expected": "no-trigger",
"rationale": "Generic eval-framework engineering -- not an A/B compare on existing runs"
},
{
"id": "should-not-09",
"prompt": "Compare these two dbt models and tell me which one has more downstream tables.",
"expected": "no-trigger",
"rationale": "dbt model comparison -- wrong domain"
}
]
}
1 change: 1 addition & 0 deletions plugins/claude-code/skills/compare-trace
1 change: 1 addition & 0 deletions plugins/codex/skills/compare-trace
1 change: 1 addition & 0 deletions plugins/copilot/skills/compare-trace
1 change: 1 addition & 0 deletions plugins/cursor/skills/compare-trace
1 change: 1 addition & 0 deletions plugins/opencode/skills/compare-trace
1 change: 1 addition & 0 deletions skills/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Skills are platform-agnostic instruction sets that tell an AI coding agent what
| **[Tune Monitor](tune-monitor/)** | Analyzes a Monte Carlo metric monitor's alert history and recommends configuration changes to reduce noise — sensitivity, WHERE conditions, segment exclusions, schedule, and aggregation. |
| **[Connection Auth Rules](connection-auth-rules/)** | Build a Connection Auth Rules configuration for a Monte Carlo connection type. Fetches live connector schemas and transform steps from the apollo-agent repo. |
| **[Instrument Agent](instrument-agent/)** | Instruments a Python AI agent for Monte Carlo Agent Observability — detects AI libraries, installs the Monte Carlo OpenTelemetry SDK, sets up tracing, and verifies traces in Monte Carlo. Asks before editing. |
| **[Compare Trace](compare-trace/)** | A/B compares two Monte Carlo agent traces by ID — runs graph-path, latency/token, and tool-call diffs plus LLM-based semantic and entity-overlap evals over the final answers, and opens an HTML report. |

## Standalone Installation

Expand Down
44 changes: 44 additions & 0 deletions skills/compare-trace/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# compare-trace

A/B compare two Monte Carlo agent conversations by ID and produce an HTML report.

Trace-driven backport of the [Agent A/B Evaluation Framework](https://github.com/monte-carlo-data/ai-agent/pull/1236) (PR #1236 in `ai-agent`). The original ran the agent itself against fixed scenarios; this skill operates on already-captured conversations fetched via the Monte Carlo MCP server.

## Invocation

```
/compare-trace <conv_id_a> <conv_id_b>
```

Optional flags: `--mcon`, `--agent`, `--trace-ids a,b` (force specific OTel trace_ids when a conversation has multiple), `--labels A,B`, `--output path.html`.

## ID model

`conversation_id` is the user-facing identifier (per the OTel GenAI `gen_ai.conversation.id` semantic convention). It's stored as a span attribute, **not** as the OTel `trace_id`. One conversation can contain multiple OTel traces (retries, fan-outs, multi-turn).

The skill resolves `conversation_id → trace_id` via `get_agent_conversation`. By default it picks the trace with the most spans (= the "main" execution); override with `--trace-ids` to compare specific sub-traces.

## Signals

| Signal | Type | Notes |
|---|---|---|
| Graph Path | deterministic | Jaccard on node sets + LCS/max ordering |
| Latency & Tokens | deterministic | Per-metric ratios; flag if candidate > 1.5x baseline |
| Tool Call Sequence + Args | deterministic | Levenshtein on tool-name sequences; matched calls also get a top-level arg-key diff (added / removed / changed) |
| Semantic Diff | LLM (inline) | Claude runs prompt over both final-completion texts |
| Entity Overlap | LLM (inline) | Extracts 8 entity types, computes per-type Jaccard |

The two LLM signals require non-empty `final_output_text` for both sides (pulled from the last completion span in each conversation). Without that, the report ships with the 3 structural signals.

## Files

- `SKILL.md` — full workflow Claude follows
- `scripts/compare_traces.py` — driver that consumes normalized trace JSON + optional LLM-eval JSON and writes HTML
- `scripts/evaluators/{graph_path_diff,latency_diff,tool_call_diff}.py` — pure-Python evaluators ported from PR #1236
- `references/PR1236_MAPPING.md` — fields-and-signals mapping from PR #1236 to the trace API

## Known limitations (v0.3)

- Picks one trace per conversation (the largest non-error one by edge count). Multi-trace conversations (retries, fan-outs) currently get their other traces dropped — pass `--trace-ids` to override.
- Arg-diff matches calls by name + nearest position (greedy). When a tool's count differs between A and B, the surplus calls go unmatched. v0.4 plan: stable-ID fallback using `tool_use_id` when present.
- No "structured fields" diff (the 6th evaluator in PR #1236) — only meaningful when you control the agent's output schema, which we don't from trace-land.
Loading
Loading