Skip to content

Commit 37cfa33

Browse files
authored
Incident postmortem (#2019)
* Base file * Update documentation
1 parent 7667bfe commit 37cfa33

2 files changed

Lines changed: 205 additions & 0 deletions

File tree

docs/README.skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
209209
| [image-manipulation-image-magick](../skills/image-manipulation-image-magick/SKILL.md)<br />`gh skills install github/awesome-copilot image-manipulation-image-magick` | Process and manipulate images using ImageMagick. Supports resizing, format conversion, batch processing, and retrieving image metadata. Use when working with images, creating thumbnails, resizing wallpapers, or performing batch image operations. | None |
210210
| [impediment-prioritization](../skills/impediment-prioritization/SKILL.md)<br />`gh skills install github/awesome-copilot impediment-prioritization` | Ranks any list of impediments and their countermeasures using a value-stream scoring model (ROI, Cost to Implement, Ease of Deployment, Risk Factor) and a fixed prioritization formula. Use when someone asks to prioritize, rank, sequence, or triage impediments, countermeasures, remediation items, risks, findings, gaps, action items, or backlog entries; or mentions value-stream prioritization, A3 / lean countermeasure ranking, ROI vs. effort scoring, or building a remediation / improvement backlog. Works with GHQR findings, audit results, retrospective action items, risk registers, architecture review gaps, or any free-form `{impediment, countermeasure}` list. | `references/scoring-rubric.md` |
211211
| [import-infrastructure-as-code](../skills/import-infrastructure-as-code/SKILL.md)<br />`gh skills install github/awesome-copilot import-infrastructure-as-code` | Import existing Azure resources into Terraform using Azure CLI discovery and Azure Verified Modules (AVM). Use when asked to reverse-engineer live Azure infrastructure, generate Infrastructure as Code from existing subscriptions/resource groups/resource IDs, map dependencies, derive exact import addresses from downloaded module source, prevent configuration drift, and produce AVM-based Terraform files ready for validation and planning across any Azure resource type. | None |
212+
| [incident-postmortem](../skills/incident-postmortem/SKILL.md)<br />`gh skills install github/awesome-copilot incident-postmortem` | Use when an outage, production incident, or significant service degradation has occurred and the team needs to write a structured blameless post-mortem. Triggers on phrases like "write a post-mortem", "incident review", "what went wrong", "outage report", "root cause analysis", or "RCA". Covers timeline reconstruction, contributing factor analysis, impact quantification, and action item generation with owners. | None |
212213
| [integrate-context-matic](../skills/integrate-context-matic/SKILL.md)<br />`gh skills install github/awesome-copilot integrate-context-matic` | Discovers and integrates third-party APIs using the context-matic MCP server. Uses `fetch_api` to find available API SDKs, `ask` for integration guidance, `model_search` and `endpoint_search` for SDK details. Use when the user asks to integrate a third-party API, add an API client, implement features with an external API, or work with any third-party API or SDK. | None |
213214
| [issue-fields-migration](../skills/issue-fields-migration/SKILL.md)<br />`gh skills install github/awesome-copilot issue-fields-migration` | Bulk-migrate metadata to GitHub issue fields from two sources: repo labels (e.g. priority labels to a Priority field) and Project V2 fields. Use when users say "migrate my labels to issue fields", "migrate project fields to issue fields", "convert labels to issue fields", "copy project field values to issue fields", or ask about adopting issue fields. Issue fields are org-level typed metadata (single select, text, number, date) that replace label-based workarounds with structured, searchable, cross-repo fields. | `references/issue-fields-api.md`<br />`references/labels-api.md`<br />`references/projects-api.md` |
214215
| [java-add-graalvm-native-image-support](../skills/java-add-graalvm-native-image-support/SKILL.md)<br />`gh skills install github/awesome-copilot java-add-graalvm-native-image-support` | GraalVM Native Image expert that adds native image support to Java applications, builds the project, analyzes build errors, applies fixes, and iterates until successful compilation using Oracle best practices. | None |
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
---
2+
name: incident-postmortem
3+
description: 'Use when an outage, production incident, or significant service degradation has occurred and the team needs to write a structured blameless post-mortem. Triggers on phrases like "write a post-mortem", "incident review", "what went wrong", "outage report", "root cause analysis", or "RCA". Covers timeline reconstruction, contributing factor analysis, impact quantification, and action item generation with owners.'
4+
---
5+
6+
# Incident Post-Mortem
7+
8+
Guide a team through writing a structured, blameless post-mortem after a production incident. The output is a document that builds shared understanding, identifies root causes without blame, and produces concrete action items to prevent recurrence.
9+
10+
## Blameless Principle
11+
12+
Systems fail, not people. The goal is to understand HOW the incident happened — not WHO caused it. Avoid language like "X forgot to", "Y should have known". Use "the system did not", "the process lacked", "the alert did not fire".
13+
14+
## When to Use
15+
16+
- Production outage or service degradation has been resolved
17+
- A significant near-miss occurred (would have been an incident if caught later)
18+
- User-facing errors, data loss, or SLA breach happened
19+
- Team wants to capture learnings before context fades
20+
21+
**Not for:** Minor bugs caught in staging, planned maintenance windows, or incidents with no learning value.
22+
23+
## Input Requirements
24+
25+
Gather these details before writing the post-mortem. Ask for anything missing:
26+
27+
### Incident Metadata
28+
- Incident title (short, descriptive)
29+
- Date and time of detection (with timezone)
30+
- Date and time of resolution
31+
- Severity / impact level (P1–P4 or equivalent)
32+
- Incident commander / on-call owner
33+
34+
### Impact
35+
- Affected services and systems
36+
- User-facing impact (errors, slowness, full outage)
37+
- Estimated number of users affected
38+
- Data loss or corruption (yes/no, scope)
39+
- SLA/SLO breach (yes/no, by how much)
40+
41+
### Timeline Events
42+
Key moments to reconstruct:
43+
- First symptom occurred
44+
- Alert fired (or was noticed manually)
45+
- On-call paged / incident declared
46+
- Investigation started
47+
- Root cause identified
48+
- Mitigation applied
49+
- Full resolution confirmed
50+
- Customer communication sent (if any)
51+
52+
### Contributing Factors
53+
Ask the team: "What made this worse than it needed to be?" — not "who failed". Examples:
54+
- Alert threshold too high / alert didn't fire
55+
- Runbook was missing or outdated
56+
- Deploy lacked a feature flag for rollback
57+
- Monitoring didn't cover this failure mode
58+
- On-call handoff missed context
59+
60+
## Process
61+
62+
### Step 1 — Gather Metadata
63+
If the user has not provided full incident details, ask for them section by section. Don't proceed to writing until you have: title, times, severity, affected services, and at least a rough timeline.
64+
65+
### Step 2 — Reconstruct Timeline
66+
Work with the user to build a precise chronological timeline. For each event:
67+
- Exact time (UTC preferred)
68+
- What happened (system event or human action)
69+
- Who observed it or took the action
70+
- Link to log / alert / Slack message if available
71+
72+
Flag gaps: "We don't know what happened between 14:32 and 14:47 — worth checking logs."
73+
74+
### Step 3 — Root Cause Analysis
75+
Use the **5 Whys** iteratively:
76+
77+
```
78+
Why did users see 500 errors?
79+
→ The API pods were crash-looping.
80+
81+
Why were they crash-looping?
82+
→ Memory limit was exceeded.
83+
84+
Why was the limit exceeded?
85+
→ A new query was loading full result sets into memory.
86+
87+
Why wasn't this caught before deploy?
88+
→ Load tests only covered the p50 case, not high-cardinality accounts.
89+
90+
Why did load tests only cover p50?
91+
→ We had no test fixtures for large accounts.
92+
```
93+
94+
Stop when you reach a system/process gap you can fix. The last "why" should point to an action item.
95+
96+
Distinguish:
97+
- **Root cause** — the deepest systemic gap (one or two)
98+
- **Contributing factors** — conditions that made it worse but aren't the root cause
99+
100+
### Step 4 — Impact Quantification
101+
Help the user be precise:
102+
- Duration: detection to resolution (not symptom start to resolution — separate these)
103+
- Error rate at peak vs. normal baseline
104+
- Percentage of traffic affected
105+
- Revenue / business impact if known
106+
107+
### Step 5 — Action Items
108+
For each root cause and contributing factor, generate at least one action item:
109+
110+
| # | Action | Owner | Due Date | Priority |
111+
|---|--------|-------|----------|----------|
112+
| 1 | Add load test fixtures for accounts > 10k records | @eng-team | 2026-07-01 | High |
113+
| 2 | Lower memory alert threshold from 90% to 75% | @platform | 2026-06-23 | High |
114+
| 3 | Add runbook for memory OOM pods | @on-call-rotation | 2026-06-30 | Medium |
115+
116+
Action items must have an owner (a person, not a team) and a due date. Vague actions like "improve monitoring" are not acceptable — break them into specific deliverables.
117+
118+
### Step 6 — Write the Document
119+
Produce the full post-mortem using the template below. Save to `docs/postmortems/YYYY-MM-DD-<slug>.md`.
120+
121+
## Output Template
122+
123+
```markdown
124+
# Post-Mortem: [Incident Title]
125+
126+
**Date:** YYYY-MM-DD
127+
**Severity:** P[1-4]
128+
**Duration:** X hours Y minutes (HH:MM UTC – HH:MM UTC)
129+
**Incident Commander:** @name
130+
**Status:** Resolved
131+
132+
---
133+
134+
## Summary
135+
136+
[2–3 sentences. What happened, what was the user impact, how was it resolved. Written for someone who wasn't involved.]
137+
138+
## Impact
139+
140+
| Dimension | Value |
141+
|-----------|-------|
142+
| Affected services | [list] |
143+
| User-facing impact | [errors / degraded / full outage] |
144+
| Users affected | [estimated number or %] |
145+
| Peak error rate | [X% vs Y% baseline] |
146+
| Data loss | [none / describe scope] |
147+
| SLA breach | [yes/no — by how much] |
148+
149+
## Timeline
150+
151+
All times UTC.
152+
153+
| Time | Event |
154+
|------|-------|
155+
| HH:MM | [First symptom / alert fired] |
156+
| HH:MM | [On-call paged] |
157+
| HH:MM | [Incident declared] |
158+
| HH:MM | [Root cause identified] |
159+
| HH:MM | [Mitigation applied] |
160+
| HH:MM | [Full resolution confirmed] |
161+
| HH:MM | [Customer communication sent] |
162+
163+
## Root Cause
164+
165+
[1–2 paragraphs. The deepest systemic gap that, if fixed, would have prevented the incident. Written in blameless language. Reference the 5 Whys chain if helpful.]
166+
167+
## Contributing Factors
168+
169+
- [Factor 1 — condition that made the incident worse]
170+
- [Factor 2]
171+
- [Factor 3]
172+
173+
## What Went Well
174+
175+
- [Thing that worked — good alert, fast response, clear runbook]
176+
- [Another positive]
177+
178+
## What Could Have Gone Better
179+
180+
- [Gap in process, tooling, or coverage — no blame language]
181+
- [Another gap]
182+
183+
## Action Items
184+
185+
| # | Action | Owner | Due Date | Priority |
186+
|---|--------|-------|----------|----------|
187+
| 1 | [Specific deliverable] | @person | YYYY-MM-DD | High/Medium/Low |
188+
| 2 | | | | |
189+
190+
## Lessons Learned
191+
192+
[Optional. 2–4 bullet points capturing non-obvious insights worth sharing with the broader team.]
193+
```
194+
195+
## Common Mistakes
196+
197+
| Mistake | Fix |
198+
|---------|-----|
199+
| "Bob forgot to check the config" | "The deploy checklist did not include config validation" |
200+
| Root cause is "human error" | Keep asking Why — human error is always a symptom |
201+
| Action items without owners | Every item needs a named individual, not a team |
202+
| Timeline reconstructed from memory | Check logs, alerts, Slack, PagerDuty before writing |
203+
| "Improve monitoring" as an action | Specify: which service, which metric, what threshold, by when |
204+
| Post-mortem written weeks later | Write within 48–72 hours while context is fresh |

0 commit comments

Comments
 (0)