AI-256: analyze-root-cause runs TSA first when an incident UUID is present (#79)

elor-arieli · claude · web-flow · commit 39c4dd65eb0f · 2026-05-11T18:08:34.000+02:00
## Summary Updates the `analyze-root-cause` skill so that, when intake yields a Monte Carlo incident UUID, it auto-invokes the Troubleshooting Agent (`run_troubleshooting_agent`) async right after intake and merges its findings into the existing interactive investigation. Today the skill never calls TSA at all — it's purely a manual investigation walker — which means invoking it via MCP does not consume MC credits the same way the UI Troubleshooting Agent does. This PR closes that gap. Docs-only change to one skill. No code, no new MCP tools — `alert_assessment`, `run_troubleshooting_agent`, and `get_troubleshooting_agent_results` already ship in the default toolset. ## What this PR enables - **Auto-invoke TSA at a new Step 1.5** — when intake produces an incident UUID and the user hasn't opted out, the skill kicks off `run_troubleshooting_agent(incident_id=..., async_mode=True)` before any manual investigation begins. The tool's built-in idempotency means an existing successful run is reused; `force_rerun=True` is reserved for explicit user request. - **Three explicit skip conditions** — TSA is intentionally not invoked when (1) intake produced no incident UUID (the no-incident reference path), (2) the user is asking a narrow scoped question like "is X stale right now?", or (3) the user explicitly opts out ("skip TSA", "manual only"). - **Parallel manual + TSA flow** — Step 2 onwards continues the interactive investigation while TSA runs in the background. Two poll points (Step 4 ~30s in, Step 7 ~60–90s after) gather TSA results without blocking. If TSA hasn't returned by Step 7, the skill presents the manual findings and tells the user TSA is still working. - **Findings-merge guidance in Step 7** — covers four cases: TSA agrees, TSA contradicts, TSA returns low-signal, TSA failed. Each case has a presentation pattern so the agent doesn't have to improvise. - **Credit-cost note in the MCP Tools table** — calls out that `alert_assessment` and `run_troubleshooting_agent` consume MC credits the same way the UI TSA does. This is the question that originally motivated the ticket (Slack thread linked on AI-256). - **Two new Important rules** — never invoke TSA without an incident UUID; honor explicit user opt-outs. - **Reference + README touch-ups** — `intake-no-incident.md` now explicitly notes TSA is skipped on that path (and how to rejoin the main flow if an alert is found). `README.md` flow diagram shows the TSA branch with poll points and the skip conditions. ## Key Decisions See [AI-256](https://linear.app/montecarlodata/issue/AI-256/update-analyze-root-cause-skill-to-run-ta-first) for the originating ask and [the parent Slack thread](https://montecarloai.slack.com/archives/C0AM84B7F0D/p1778022753347149) for the credit-consumption question that motivated it. - **Skip `alert_assessment` in the auto-flow.** The `automated-triage` skill uses `alert_assessment` as a cheap gate on every alert before deciding whether to escalate to TSA. `analyze-root-cause` is a different shape — it's invoked on a single incident the user already cares about, so the user-facing latency cost of the extra ~2-min scoring step outweighs the savings of skipping TSA on LOW-confidence alerts. Surfaced `alert_assessment` in the tools table as available, but the auto-flow goes straight to TSA. Revisit if cost/latency feedback says otherwise. - **Async + parallel over sync.** TSA takes 4–8 minutes. Sync (`async_mode=False`) would block the conversation for that long; "async + wait" would block silently. Async-with-parallel-manual-investigation gives the user findings either way and lets TSA's deeper analysis fold in when it lands. - **No structured "narrow check" classifier.** The skill relies on agent judgment with example user phrasings ("is X stale right now?", "what's the row count of Y?"). If this proves too fuzzy in practice, follow-up work could tighten it (e.g. require an explicit "investigate" verb to gate TSA), but a classifier is overkill for v1. - **Idempotency via tool default, not new flag.** `run_troubleshooting_agent` already returns existing results when status is `success` or `running`. The skill leans on that and explicitly instructs the agent not to set `force_rerun=True` unless the user asks — protecting against accidental billable re-runs. - **No version bump.** Patterned after PR #76 (docs-only changes to a single skill don't bump the plugin version). If you'd prefer a version bump for visibility, easy to add. ## Test plan - [x] Diff between the marketplace-installed copy (`~/.claude/plugins/marketplaces/mc-marketplace/skills/analyze-root-cause/`) and the updated source shows only intentional additions — no drift, no accidental edits. - [x] Walked through both the incident-UUID path and the no-incident path mentally — flow is consistent in both directions. - [ ] Smoke-test the skill in Claude Code against a real Monte Carlo incident UUID — confirm the agent kicks off TSA at Step 1.5 and merges findings in Step 7. - [ ] Smoke-test the no-incident path — confirm TSA is never invoked. - [ ] Smoke-test explicit opt-out ("skip TSA, just investigate manually") — confirm TSA is not invoked. ## Checklist - [x] Docs updated - [na] Tests added (markdown-only change, no code) - [na] Version bumped (docs-only single-skill change; matches PR #76 pattern) - [na] Migration notes - [x] No secrets or credentials in changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/skills/analyze-root-cause/README.md b/skills/analyze-root-cause/README.md
@@ -32,6 +32,11 @@ Connect to Monte Carlo's MCP server (`integrations.getmontecarlo.com/mcp`). The
 | `get_etl_jobs` | Find ETL jobs writing to tables (Airflow, dbt, Databricks) — pass `platform` param |
 | `get_github_prs` | Recent GitHub PRs (via MC's GitHub integration) |
 | `get_jobs_performance` | Job runtime stats, failure rates, trends |
+| `alert_assessment` | Optional ~2-min triage of an incident (HIGH/MEDIUM/LOW confidence + impact) |
+| `run_troubleshooting_agent` | Starts the Troubleshooting Agent (TSA) on an incident; auto-invoked when an incident UUID is present |
+| `get_troubleshooting_agent_results` | Polls TSA results for an incident |
+
+> **Credits:** `alert_assessment` and `run_troubleshooting_agent` consume Monte Carlo credits the same way the Troubleshooting Agent does when launched from the Monte Carlo UI.
 
 **Optional:** A database MCP server (Snowflake, BigQuery, Redshift) for direct SQL queries.
 
@@ -48,19 +53,24 @@ Connect to Monte Carlo's MCP server (`integrations.getmontecarlo.com/mcp`). The
 ```
 Intake (alert ID or user description)
     ↓
-Map blast radius (upstream + downstream lineage)
-    ↓
-Investigate by issue type (freshness / volume / schema / ETL / query / field)
-    ↓
-Check upstream causes (walk lineage chain)
-    ↓
-Profile data (if DB connector available)
-    ↓
-Check code changes (GitHub MCP or MC query changes)
-    ↓
-Synthesize: root cause + evidence + impact + fix
+Auto-invoke TSA (if incident UUID + not opt-out + not narrow check)  ─┐
+    ↓                                                                  │
+Map blast radius (upstream + downstream lineage)                       │ TSA runs
+    ↓                                                                  │ async in
+Investigate by issue type (freshness / volume / schema / ETL / query)  │ parallel
+    ↓                                                                  │
+Check upstream causes (walk lineage chain)  ── poll TSA #1 ────────────┤
+    ↓                                                                  │
+Profile data (if DB connector available)                               │
+    ↓                                                                  │
+Check code changes (GitHub MCP or MC query changes)                    │
+    ↓                                                                  │
+Synthesize: root cause + evidence + impact + fix  ── poll TSA #2 ─────┘
+                                                    + merge findings
 ```
 
+When intake has no incident UUID, when the user explicitly opts out, or when the request is a narrow scoped check (e.g. "is X stale right now?"), TSA is skipped and the manual flow runs alone.
+
 ## Reference files
 
 | File | Description |
diff --git a/skills/analyze-root-cause/SKILL.md b/skills/analyze-root-cause/SKILL.md
@@ -71,6 +71,11 @@ Do not activate when the user is:
 | `get_jobs_performance` | Job runtime stats, failure rates, 7-day trends |
 | `get_change_timeline` | Unified timeline: query changes + volume + ETL failures |
 | `get_current_time` | Current timestamp for relative time ranges |
+| `alert_assessment` | Optional ~2-min triage of an incident — returns HIGH/MEDIUM/LOW confidence and impact. Useful when you want a quick read before deciding to escalate to TSA. |
+| `run_troubleshooting_agent` | Starts the Troubleshooting Agent (TSA) on an incident. Async by default; idempotent (returns existing results unless `force_rerun=True`). Auto-invoked at Step 1.5 when an incident UUID is present. |
+| `get_troubleshooting_agent_results` | Polls TSA results for an incident (`status` is `not_found` / `running` / `success` / `failed`). Use to check on the async run started at Step 1.5. |
+
+> **Credits:** `alert_assessment` and `run_troubleshooting_agent` consume Monte Carlo credits the same way the Troubleshooting Agent does when launched from the Monte Carlo UI. Each fresh `run_troubleshooting_agent` call is a billable run; reuse via the built-in idempotency (don't pass `force_rerun=True` unless the user explicitly asks for a fresh analysis).
 
 ### Optional external MCP tools
 
@@ -98,8 +103,33 @@ Read `references/intake-no-incident.md` for the full intake flow. In short:
 4. Check table health: `get_table_freshness`, `get_table_size_history`
 5. Narrow down the issue type and proceed to Step 2.
 
+### Step 1.5: Auto-invoke TSA (when applicable)
+
+When intake produces a Monte Carlo **incident UUID**, kick off the Troubleshooting Agent (TSA) **before** continuing to Step 2. TSA runs the same root-cause analysis the Monte Carlo UI uses; running it here in parallel with the manual investigation usually beats running either path alone.
+
+**Skip TSA when any of these is true:**
+
+1. **No incident UUID.** `run_troubleshooting_agent` requires a UUID. The no-incident intake path (`references/intake-no-incident.md`) does not feed TSA. If that path later identifies a matching alert, return to Step 1 with the alert's incident UUID — Step 1.5 then applies normally.
+2. **Narrow scoped check.** The user wants a single fact, not an investigation. Examples: "is `analytics.orders` stale right now?", "what's the row count of X?", "show me the schema of Y", "did this query run today?". Answer the question with the relevant tool and stop. TSA is overkill for these.
+3. **Explicit user opt-out.** The user says "skip TSA", "don't run TSA", "manual only", "just do it yourself", or similar. Honor the opt-out and proceed to Step 2 without invoking TSA.
+
+**Default invocation (async, parallel):**
+
+```
+run_troubleshooting_agent(incident_id="<uuid>", async_mode=True)
+```
+
+- The tool is **idempotent** by default: if a previous successful TSA run exists for this incident, it returns those results immediately. Do **not** pass `force_rerun=True` unless the user explicitly asks for a fresh analysis (each fresh run is a billable Monte Carlo credit consumption).
+- If status is `success` on the first call, you have results — fold them straight into Step 7's synthesis and continue Steps 2–6 to corroborate.
+- If status is `queued` or `running`, continue to Step 2 immediately. TSA typically completes in 4–8 minutes; you'll poll for results via `get_troubleshooting_agent_results` later in the flow (see Step 4 and Step 7).
+- If status is `failed`, note the error and continue with the manual investigation only — do not re-run automatically.
+
+Tell the user what you started: "I've kicked off the Troubleshooting Agent on this incident — it usually finishes in 4–8 minutes. While it runs, I'll continue investigating manually so we have findings either way."
+
 ### Step 2: Map the blast radius
 
+> **TSA in parallel:** if you started TSA at Step 1.5, it is running in the background while you do this step. Do not block on it.
+
 1. Call `get_asset_lineage(mcons=[table_mcon], direction="UPSTREAM")` — what feeds this table?
 2. Call `get_asset_lineage(mcons=[table_mcon], direction="DOWNSTREAM")` — what does this table feed?
 3. If the issue involves specific fields, call `get_field_lineage` to trace which upstream fields feed the affected columns.
@@ -132,6 +162,8 @@ Data issues often originate upstream. Walk the lineage chain:
 2. Use `get_field_lineage` to trace the specific field that has bad data back to its source.
 3. Check what upstream field values correlate with the anomaly (if DB connector is available — see Step 5).
 
+**TSA poll #1.** If you started TSA at Step 1.5 and it has not yet returned `success`, call `get_troubleshooting_agent_results(incident_id=...)` once here (~30s after Step 1.5). If status is `success`, hold the result for Step 7. If still `running`, keep going — you'll poll again before Step 7. Don't block on it.
+
 ### Step 5: Profile data (if database MCP is available)
 
 If the user has a database MCP server connected (Snowflake, BigQuery, Redshift, Databricks, etc.), read `references/data-exploration.md` for SQL investigation patterns including:
@@ -152,6 +184,8 @@ Also call `get_query_changes` with the affected table MCONs to detect SQL text m
 
 ### Step 7: Synthesize and present
 
+**TSA poll #2.** If you started TSA at Step 1.5 and don't yet have results, call `get_troubleshooting_agent_results(incident_id=...)` one more time (~60–90s after poll #1). Stop on `success` or `failed`; if still `running` after this poll, present the manual findings now and tell the user TSA is still working ("TSA is still running on this incident — I'll fold its findings in once it completes if you'd like, or you can ask me to check back in a minute").
+
 Read `references/common-root-causes.md` to match findings against known patterns. Present:
 
 1. **Root cause** — what happened and when, with evidence from tools
@@ -160,6 +194,13 @@ Read `references/common-root-causes.md` to match findings against known patterns
 4. **Recommended fix** — specific action to resolve the issue
 5. **Prevention** — suggest monitoring to catch this earlier next time
 
+**Merging TSA findings:**
+
+- **TSA succeeded and agrees with the manual investigation** — lead with the unified root cause; cite both TSA's evidence chain and the corroborating manual findings.
+- **TSA succeeded and contradicts the manual investigation** — surface both. Show TSA's verdict, show what the manual investigation found, and explain the disagreement (e.g. "TSA blames the upstream Airflow job, but `get_table_freshness` on that table is healthy"). Ask the user which thread they want to pull on.
+- **TSA succeeded with low-signal output** (e.g. "no clear root cause") — present the manual findings as primary; cite TSA as a corroborating null result.
+- **TSA failed or timed out** — present the manual findings only; mention TSA's failure briefly so the user knows it was tried.
+
 ---
 
 ## Important rules
@@ -170,3 +211,5 @@ Read `references/common-root-causes.md` to match findings against known patterns
 - **Be specific about what you can't check.** If no DB connector is available, explain what additional investigation would be possible with one.
 - **Never expose MCONs, UUIDs, or internal identifiers** to the user. Use human-readable table names.
 - **Cross-platform awareness.** ETL issues can come from Airflow, dbt, or Databricks. Check all platforms that are relevant.
+- **Do not invoke TSA without an incident UUID.** `run_troubleshooting_agent` requires one. If intake is on the no-incident path, skip TSA entirely until/unless an alert is identified.
+- **Honor explicit user opt-outs.** If the user says "skip TSA", "manual only", or similar, do not call `run_troubleshooting_agent` or `alert_assessment` — proceed with the manual investigation only.
diff --git a/skills/analyze-root-cause/references/intake-no-incident.md b/skills/analyze-root-cause/references/intake-no-incident.md
@@ -57,6 +57,8 @@ Based on the evidence gathered, determine the issue type:
 
 Once you've identified the table, issue type, and approximate timeline, continue with Step 2 (Map the blast radius) from the main SKILL.md workflow.
 
+> **TSA note.** This intake path intentionally does **not** invoke the Troubleshooting Agent (TSA), because `run_troubleshooting_agent` requires a Monte Carlo incident UUID and this path starts without one. If Step 3 above identifies a matching alert, treat the user as having provided that alert's incident ID and re-enter the main `SKILL.md` flow at Step 1 — Step 1.5 there will auto-invoke TSA. If no matching alert is found, run the manual investigation only.
+
 ## Tips
 
 - **Users often know the symptom but not the cause.** "The dashboard shows yesterday's numbers" = freshness issue. "Revenue is way too high" = volume or field anomaly.