|
| 1 | +--- |
| 2 | +name: incident-postmortem |
| 3 | +description: 'Use when an outage, production incident, or significant service degradation has occurred and the team needs to write a structured blameless post-mortem. Triggers on phrases like "write a post-mortem", "incident review", "what went wrong", "outage report", "root cause analysis", or "RCA". Covers timeline reconstruction, contributing factor analysis, impact quantification, and action item generation with owners.' |
| 4 | +--- |
| 5 | + |
| 6 | +# Incident Post-Mortem |
| 7 | + |
| 8 | +Guide a team through writing a structured, blameless post-mortem after a production incident. The output is a document that builds shared understanding, identifies root causes without blame, and produces concrete action items to prevent recurrence. |
| 9 | + |
| 10 | +## Blameless Principle |
| 11 | + |
| 12 | +Systems fail, not people. The goal is to understand HOW the incident happened — not WHO caused it. Avoid language like "X forgot to", "Y should have known". Use "the system did not", "the process lacked", "the alert did not fire". |
| 13 | + |
| 14 | +## When to Use |
| 15 | + |
| 16 | +- Production outage or service degradation has been resolved |
| 17 | +- A significant near-miss occurred (would have been an incident if caught later) |
| 18 | +- User-facing errors, data loss, or SLA breach happened |
| 19 | +- Team wants to capture learnings before context fades |
| 20 | + |
| 21 | +**Not for:** Minor bugs caught in staging, planned maintenance windows, or incidents with no learning value. |
| 22 | + |
| 23 | +## Input Requirements |
| 24 | + |
| 25 | +Gather these details before writing the post-mortem. Ask for anything missing: |
| 26 | + |
| 27 | +### Incident Metadata |
| 28 | +- Incident title (short, descriptive) |
| 29 | +- Date and time of detection (with timezone) |
| 30 | +- Date and time of resolution |
| 31 | +- Severity / impact level (P1–P4 or equivalent) |
| 32 | +- Incident commander / on-call owner |
| 33 | + |
| 34 | +### Impact |
| 35 | +- Affected services and systems |
| 36 | +- User-facing impact (errors, slowness, full outage) |
| 37 | +- Estimated number of users affected |
| 38 | +- Data loss or corruption (yes/no, scope) |
| 39 | +- SLA/SLO breach (yes/no, by how much) |
| 40 | + |
| 41 | +### Timeline Events |
| 42 | +Key moments to reconstruct: |
| 43 | +- First symptom occurred |
| 44 | +- Alert fired (or was noticed manually) |
| 45 | +- On-call paged / incident declared |
| 46 | +- Investigation started |
| 47 | +- Root cause identified |
| 48 | +- Mitigation applied |
| 49 | +- Full resolution confirmed |
| 50 | +- Customer communication sent (if any) |
| 51 | + |
| 52 | +### Contributing Factors |
| 53 | +Ask the team: "What made this worse than it needed to be?" — not "who failed". Examples: |
| 54 | +- Alert threshold too high / alert didn't fire |
| 55 | +- Runbook was missing or outdated |
| 56 | +- Deploy lacked a feature flag for rollback |
| 57 | +- Monitoring didn't cover this failure mode |
| 58 | +- On-call handoff missed context |
| 59 | + |
| 60 | +## Process |
| 61 | + |
| 62 | +### Step 1 — Gather Metadata |
| 63 | +If the user has not provided full incident details, ask for them section by section. Don't proceed to writing until you have: title, times, severity, affected services, and at least a rough timeline. |
| 64 | + |
| 65 | +### Step 2 — Reconstruct Timeline |
| 66 | +Work with the user to build a precise chronological timeline. For each event: |
| 67 | +- Exact time (UTC preferred) |
| 68 | +- What happened (system event or human action) |
| 69 | +- Who observed it or took the action |
| 70 | +- Link to log / alert / Slack message if available |
| 71 | + |
| 72 | +Flag gaps: "We don't know what happened between 14:32 and 14:47 — worth checking logs." |
| 73 | + |
| 74 | +### Step 3 — Root Cause Analysis |
| 75 | +Use the **5 Whys** iteratively: |
| 76 | + |
| 77 | +``` |
| 78 | +Why did users see 500 errors? |
| 79 | +→ The API pods were crash-looping. |
| 80 | +
|
| 81 | +Why were they crash-looping? |
| 82 | +→ Memory limit was exceeded. |
| 83 | +
|
| 84 | +Why was the limit exceeded? |
| 85 | +→ A new query was loading full result sets into memory. |
| 86 | +
|
| 87 | +Why wasn't this caught before deploy? |
| 88 | +→ Load tests only covered the p50 case, not high-cardinality accounts. |
| 89 | +
|
| 90 | +Why did load tests only cover p50? |
| 91 | +→ We had no test fixtures for large accounts. |
| 92 | +``` |
| 93 | + |
| 94 | +Stop when you reach a system/process gap you can fix. The last "why" should point to an action item. |
| 95 | + |
| 96 | +Distinguish: |
| 97 | +- **Root cause** — the deepest systemic gap (one or two) |
| 98 | +- **Contributing factors** — conditions that made it worse but aren't the root cause |
| 99 | + |
| 100 | +### Step 4 — Impact Quantification |
| 101 | +Help the user be precise: |
| 102 | +- Duration: detection to resolution (not symptom start to resolution — separate these) |
| 103 | +- Error rate at peak vs. normal baseline |
| 104 | +- Percentage of traffic affected |
| 105 | +- Revenue / business impact if known |
| 106 | + |
| 107 | +### Step 5 — Action Items |
| 108 | +For each root cause and contributing factor, generate at least one action item: |
| 109 | + |
| 110 | +| # | Action | Owner | Due Date | Priority | |
| 111 | +|---|--------|-------|----------|----------| |
| 112 | +| 1 | Add load test fixtures for accounts > 10k records | @eng-team | 2026-07-01 | High | |
| 113 | +| 2 | Lower memory alert threshold from 90% to 75% | @platform | 2026-06-23 | High | |
| 114 | +| 3 | Add runbook for memory OOM pods | @on-call-rotation | 2026-06-30 | Medium | |
| 115 | + |
| 116 | +Action items must have an owner (a person, not a team) and a due date. Vague actions like "improve monitoring" are not acceptable — break them into specific deliverables. |
| 117 | + |
| 118 | +### Step 6 — Write the Document |
| 119 | +Produce the full post-mortem using the template below. Save to `docs/postmortems/YYYY-MM-DD-<slug>.md`. |
| 120 | + |
| 121 | +## Output Template |
| 122 | + |
| 123 | +```markdown |
| 124 | +# Post-Mortem: [Incident Title] |
| 125 | + |
| 126 | +**Date:** YYYY-MM-DD |
| 127 | +**Severity:** P[1-4] |
| 128 | +**Duration:** X hours Y minutes (HH:MM UTC – HH:MM UTC) |
| 129 | +**Incident Commander:** @name |
| 130 | +**Status:** Resolved |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## Summary |
| 135 | + |
| 136 | +[2–3 sentences. What happened, what was the user impact, how was it resolved. Written for someone who wasn't involved.] |
| 137 | + |
| 138 | +## Impact |
| 139 | + |
| 140 | +| Dimension | Value | |
| 141 | +|-----------|-------| |
| 142 | +| Affected services | [list] | |
| 143 | +| User-facing impact | [errors / degraded / full outage] | |
| 144 | +| Users affected | [estimated number or %] | |
| 145 | +| Peak error rate | [X% vs Y% baseline] | |
| 146 | +| Data loss | [none / describe scope] | |
| 147 | +| SLA breach | [yes/no — by how much] | |
| 148 | + |
| 149 | +## Timeline |
| 150 | + |
| 151 | +All times UTC. |
| 152 | + |
| 153 | +| Time | Event | |
| 154 | +|------|-------| |
| 155 | +| HH:MM | [First symptom / alert fired] | |
| 156 | +| HH:MM | [On-call paged] | |
| 157 | +| HH:MM | [Incident declared] | |
| 158 | +| HH:MM | [Root cause identified] | |
| 159 | +| HH:MM | [Mitigation applied] | |
| 160 | +| HH:MM | [Full resolution confirmed] | |
| 161 | +| HH:MM | [Customer communication sent] | |
| 162 | + |
| 163 | +## Root Cause |
| 164 | + |
| 165 | +[1–2 paragraphs. The deepest systemic gap that, if fixed, would have prevented the incident. Written in blameless language. Reference the 5 Whys chain if helpful.] |
| 166 | + |
| 167 | +## Contributing Factors |
| 168 | + |
| 169 | +- [Factor 1 — condition that made the incident worse] |
| 170 | +- [Factor 2] |
| 171 | +- [Factor 3] |
| 172 | + |
| 173 | +## What Went Well |
| 174 | + |
| 175 | +- [Thing that worked — good alert, fast response, clear runbook] |
| 176 | +- [Another positive] |
| 177 | + |
| 178 | +## What Could Have Gone Better |
| 179 | + |
| 180 | +- [Gap in process, tooling, or coverage — no blame language] |
| 181 | +- [Another gap] |
| 182 | + |
| 183 | +## Action Items |
| 184 | + |
| 185 | +| # | Action | Owner | Due Date | Priority | |
| 186 | +|---|--------|-------|----------|----------| |
| 187 | +| 1 | [Specific deliverable] | @person | YYYY-MM-DD | High/Medium/Low | |
| 188 | +| 2 | | | | | |
| 189 | + |
| 190 | +## Lessons Learned |
| 191 | + |
| 192 | +[Optional. 2–4 bullet points capturing non-obvious insights worth sharing with the broader team.] |
| 193 | +``` |
| 194 | + |
| 195 | +## Common Mistakes |
| 196 | + |
| 197 | +| Mistake | Fix | |
| 198 | +|---------|-----| |
| 199 | +| "Bob forgot to check the config" | "The deploy checklist did not include config validation" | |
| 200 | +| Root cause is "human error" | Keep asking Why — human error is always a symptom | |
| 201 | +| Action items without owners | Every item needs a named individual, not a team | |
| 202 | +| Timeline reconstructed from memory | Check logs, alerts, Slack, PagerDuty before writing | |
| 203 | +| "Improve monitoring" as an action | Specify: which service, which metric, what threshold, by when | |
| 204 | +| Post-mortem written weeks later | Write within 48–72 hours while context is fresh | |
0 commit comments