Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions skills/automated-triage/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ All tools are available via the `monte-carlo` MCP server.
| Tool | Toolset | Purpose |
| -------------------------------- | -------- | --------------------------------------------------------------- |
| `get_alerts` | default | Fetch recent alerts for a time window |
| `alert_assessment` | default | Score an alert by confidence and impact (HIGH/MEDIUM/LOW each) |
| `alert_assessment` | default | Score an alert by incident likelihood and potential impact (HIGH/MEDIUM/LOW each) |
| `run_troubleshooting_agent` | default | Run the Monte Carlo Troubleshooting Agent on a single alert; async by default — returns immediately, reuses existing results when available |
| `get_troubleshooting_agent_results` | default | Poll an async troubleshooting run by `incident_id`; returns status (`not_found`/`running`/`success`/`failed`) and results when complete |
| `update_alert` | default | Update an alert's status and/or declare an incident by setting severity |
Expand All @@ -62,7 +62,7 @@ All tools are available via the `monte-carlo` MCP server.
Read `references/triage-stages.md` for a full description of each stage and how to customise it. The high-level flow is:

1. **Fetch alerts** — decide which alerts to triage and over what time window
2. **Initial investigation** — score every alert by confidence and impact using `alert_assessment`
2. **Initial investigation** — score every alert by incident likelihood and potential impact using `alert_assessment`
3. **Deep troubleshooting** — run `run_troubleshooting_agent` on high-signal alerts to get root cause analysis
4. **Classify** — use the troubleshooting output to classify each alert
5. **Take actions** — post comments, update statuses, message Slack, create tickets
Expand Down Expand Up @@ -105,7 +105,7 @@ The user wants to look at specific alerts now. Use the triage tools directly to

1. Clarify the scope (Ask about the time window and whether the user is interested in a specific domain, audience or alert type).
2. Fetch alerts with `get_alerts` (applying any domain or audience filter from step 1), run `alert_assessment` in parallel on all of them, and report the results clearly.
3. For any alert where both confidence and impact are MEDIUM or higher, offer to run `run_troubleshooting_agent` for a deeper root cause analysis. Wait for confirmation before running it.
3. For any alert where both incident likelihood and potential impact are MEDIUM or higher, offer to run `run_troubleshooting_agent` for a deeper root cause analysis. Wait for confirmation before running it.
4. Summarise findings. Do not prompt to save a workflow file or set up automation unless the user brings it up.

**Write tools in interactive triage:** After findings are clear, proactively offer relevant actions — updating status, declaring a severity, assigning an owner, posting a comment, or marking events as normal (for alerts that are natural data variation). Ask before executing.
Expand Down
12 changes: 6 additions & 6 deletions skills/automated-triage/references/triage-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Run in recommendation mode first. Once the classifications and recommendations m

1. Asks whether to run in recommendation or action mode
2. Fetches all alerts from the last 3 hours
3. Scores every alert by confidence and impact (in parallel)
3. Scores every alert by incident likelihood and potential impact (in parallel)
4. Fires deep troubleshooting on all high-signal alerts simultaneously, classifying each as results arrive
5. In **action mode**: posts a triage comment on every alert and updates statuses
In **recommendation mode**: outputs what it would comment and what status it would set — no writes
Expand All @@ -38,11 +38,11 @@ If no alerts are returned, report: "No alerts in the last 3 hours." and stop.

### Step 3: Score each alert

Call `alert_assessment` in parallel for every alert from step 2, in batches of up to 10 at a time. Each result includes `alert_confidence`, `alert_impact` (HIGH/MEDIUM/LOW each), `alert_description` (plain-language description of what happened in the incident), and `triage_summary` (the key reasoning behind the confidence and impact scores). Use `alert_description` and `triage_summary` to inform the triage comment in step 4 for alerts that don't go through troubleshooting.
Call `alert_assessment` in parallel for every alert from step 2, in batches of up to 10 at a time. Each result includes `incident_likelihood`, `alert_impact` (HIGH/MEDIUM/LOW each), `alert_description` (plain-language description of what happened in the incident), and `triage_summary` (the key reasoning behind the incident likelihood and potential impact scores). Use `alert_description` and `triage_summary` to inform the triage comment in step 4 for alerts that don't go through troubleshooting.

### Step 4: Troubleshoot and classify high-signal alerts

For each alert where BOTH `alert_confidence` AND `alert_impact` are MEDIUM or HIGH, call `run_troubleshooting_agent` (default `async_mode=True`). Fire all eligible alerts simultaneously — each call returns immediately with one of: `success` (previous results available immediately), `queued` (accepted, not started yet), or `running` (in progress).
For each alert where BOTH `incident_likelihood` AND `alert_impact` are MEDIUM or HIGH, call `run_troubleshooting_agent` (default `async_mode=True`). Fire all eligible alerts simultaneously — each call returns immediately with one of: `success` (previous results available immediately), `queued` (accepted, not started yet), or `running` (in progress).

Skip any alert where either value is LOW — troubleshooting is expensive and not warranted for low-signal alerts.

Expand All @@ -66,7 +66,7 @@ Alerts that did not go through troubleshooting are left unclassified.
**Action mode:**

Call `create_or_update_alert_comment` for each alert:
- **Untroubleshot alerts**: one sentence describing the anomaly and the confidence/impact scores. Do not explain why it wasn't troubleshot. No recommendations.
- **Untroubleshot alerts**: one sentence describing the anomaly and the incident likelihood/potential impact scores. Do not explain why it wasn't troubleshot. No recommendations.
- **Troubleshot alerts**: 2–4 sentences covering classification, reasoning from the troubleshooting output, any action taken, and a recommendation.

Then call `update_alert` for each classified alert:
Expand Down Expand Up @@ -97,8 +97,8 @@ Do not call any write tools. Instead, for each alert output:

After completing all steps, produce a summary table:

| Alert ID | Type | Confidence | Impact | Classification | Action Taken |
|----------|------|------------|--------|----------------|--------------|
| Alert ID | Type | Incident Likelihood | Potential Impact | Classification | Action Taken |
|----------|------|---------------------|------------------|----------------|--------------|

Include every alert from step 1. For untroubleshot alerts, leave Classification blank and set Action Taken to "Comment only".

Expand Down
8 changes: 4 additions & 4 deletions skills/automated-triage/references/triage-stages.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,10 @@ Take care to avoid triaging too many alerts in one batch — where required, spl
This stage replicates what a knowledgeable engineer does when scanning the alert feed — quickly assessing what's fired and how serious it looks. `alert_assessment` is lightweight enough to run on every alert.

It returns:
- **`alert_confidence`** (HIGH/MEDIUM/LOW) — how likely the alert represents a real issue. Affected by: number of events, presence of concerning root causes (query changes, failures), how much thresholds were exceeded, and how noisy the monitor typically is.
- **`incident_likelihood`** (HIGH/MEDIUM/LOW) — how likely the alert represents a real issue. Affected by: number of events, presence of concerning root causes (query changes, failures), how much thresholds were exceeded, and how noisy the monitor typically is.
- **`alert_impact`** (HIGH/MEDIUM/LOW) — how significant the potential downstream impact is. Use cases impacted. Dashboards affected etc.
- **`alert_description`** — plain-language description of what happened in the incident.
- **`triage_summary`** — the key reasoning behind the confidence and impact scores.
- **`triage_summary`** — the key reasoning behind the incident likelihood and potential impact scores.

**Run `alert_assessment` in parallel**, in batches of up to 10 at a time.

Expand All @@ -64,7 +64,7 @@ Start with the defaults and tune `user_instructions` once you've seen real outpu

`run_troubleshooting_agent` runs the Monte Carlo Troubleshooting Agent on a single alert. This is substantially more expensive than `alert_assessment` — it tracks the issue upstream through lineage, analyses all queries involved, examines relevant PRs, and samples affected tables to identify root cause.

**Only run `run_troubleshooting_agent` on alerts that warrant it.** A common filter: run troubleshooting only when BOTH `alert_confidence` AND `alert_impact` are MEDIUM or HIGH. Skip any alert where either is LOW.
**Only run `run_troubleshooting_agent` on alerts that warrant it.** A common filter: run troubleshooting only when BOTH `incident_likelihood` AND `alert_impact` are MEDIUM or HIGH. Skip any alert where either is LOW.

You can adjust this threshold based on your environment — for example, also running troubleshooting when either score is HIGH (even if the other is LOW), while still requiring MEDIUM/MEDIUM as the baseline.

Expand Down Expand Up @@ -103,7 +103,7 @@ What you do after triage depends on your integrations, your team's workflow, and
`create_or_update_alert_comment` — always a good starting point. Comments provide a record of what the agent found and recommended, without taking any irreversible action. Useful at every stage, regardless of whether you automate anything else.

Suggested comment content:
- **Scored but not troubleshot**: one sentence describing the anomaly and the confidence/impact scores. Do not explain why it wasn't troubleshot. No recommendations.
- **Scored but not troubleshot**: one sentence describing the anomaly and the incident likelihood/potential impact scores. Do not explain why it wasn't troubleshot. No recommendations.
- **Troubleshot alerts**: 2–4 sentences — classification, reasoning, action taken or recommended

### Updating alert status
Expand Down
6 changes: 3 additions & 3 deletions skills/remediation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,10 +93,10 @@ alert_assessment(
)
```

This returns `triage_confidence` (HIGH/MEDIUM/LOW), `alert_impact` (HIGH/MEDIUM/LOW), and a summary. Use this to decide urgency:
This returns `incident_likelihood` (HIGH/MEDIUM/LOW), `alert_impact` (HIGH/MEDIUM/LOW), and a summary. Use this to decide urgency:

- **HIGH impact + HIGH confidence** → proceed immediately to Troubleshooting Agent (TSA) analysis
- **LOW impact or LOW confidence** → still run TSA, but note to the user that this may not warrant immediate remediation
- **HIGH impact + HIGH incident likelihood** → proceed immediately to Troubleshooting Agent (TSA) analysis
- **LOW impact or LOW incident likelihood** → still run TSA, but note to the user that this may not warrant immediate remediation

#### Step 3: Root cause analysis (TSA)

Expand Down
Loading