|
| 1 | +# Incident Rehearsal Cadence |
| 2 | + |
| 3 | +Last Updated: 2026-03-29 |
| 4 | +Issue: `#150` OPS-19 incident rehearsal and recovery evidence program |
| 5 | + |
| 6 | +## Purpose |
| 7 | + |
| 8 | +Rehearsals validate that the team can diagnose and recover from production-realistic failures using real tooling and documented procedures. They also surface gaps in observability, runbooks, and recovery automation before real incidents expose them. |
| 9 | + |
| 10 | +## Monthly Lightweight Rehearsal |
| 11 | + |
| 12 | +| Field | Detail | |
| 13 | +| --- | --- | |
| 14 | +| Cadence | First working Thursday of each month | |
| 15 | +| Duration | ~30 minutes | |
| 16 | +| Scope | Single scenario from `docs/ops/rehearsal-scenarios/` | |
| 17 | +| Lead | Rotating (see assignment model below) | |
| 18 | +| Participants | Rehearsal lead + one observer minimum | |
| 19 | +| Artifacts | Evidence package filed in `docs/ops/rehearsals/` | |
| 20 | + |
| 21 | +Steps: |
| 22 | +1. Lead selects a scenario from the scenario library (prefer unexercised or recently-failed scenarios). |
| 23 | +2. Announce the rehearsal in the team channel at least 24 hours in advance. |
| 24 | +3. Execute the scenario using the template's injection method and diagnosis path. |
| 25 | +4. Record an evidence package using `docs/ops/EVIDENCE_TEMPLATE.md`. |
| 26 | +5. File any discovered issues per `docs/ops/REHEARSAL_BACKOFF_RULES.md`. |
| 27 | + |
| 28 | +## Quarterly Deep Drill |
| 29 | + |
| 30 | +| Field | Detail | |
| 31 | +| --- | --- | |
| 32 | +| Cadence | Second week of Q1/Q2/Q3/Q4 (January, April, July, October) | |
| 33 | +| Duration | ~2 hours | |
| 34 | +| Scope | Combined or cascading scenario (e.g., degraded health + deployment failure) | |
| 35 | +| Lead | Rotating (same rotation, offset from monthly) | |
| 36 | +| Participants | All active contributors | |
| 37 | +| Artifacts | Evidence package + retrospective summary | |
| 38 | + |
| 39 | +Steps: |
| 40 | +1. Lead designs a combined scenario at least one week before the drill date. |
| 41 | +2. Distribute the scenario brief (pre-conditions, scope, goals) to all participants 48 hours in advance. |
| 42 | +3. Execute the drill with explicit role assignments: incident commander, investigator, communicator. |
| 43 | +4. Record the evidence package and a retrospective summary covering what went well, what was slow, and what tooling or documentation was missing. |
| 44 | +5. File findings and retrospective actions per `docs/ops/REHEARSAL_BACKOFF_RULES.md`. |
| 45 | + |
| 46 | +## Rotation and Assignment Model |
| 47 | + |
| 48 | +Rehearsal lead rotates alphabetically by GitHub username among active contributors. |
| 49 | + |
| 50 | +| Month | Lead selection | |
| 51 | +| --- | --- | |
| 52 | +| Month N | First contributor alphabetically who has not led in the current quarter | |
| 53 | +| Fallback | If the assigned lead is unavailable, the next person in rotation picks up | |
| 54 | + |
| 55 | +The rotation resets each quarter. Deep drills use the same rotation but are offset (the deep-drill lead should not be the same person who led the preceding monthly rehearsal). |
| 56 | + |
| 57 | +To check the current rotation state, see the most recent evidence file in `docs/ops/rehearsals/` -- the lead is recorded in the metadata section. |
| 58 | + |
| 59 | +## Calendar Integration |
| 60 | + |
| 61 | +Add rehearsal dates to the team calendar: |
| 62 | + |
| 63 | +- **Monthly**: recurring event on the first Thursday of each month, 30 minutes, titled `[Taskdeck] Monthly Incident Rehearsal` |
| 64 | +- **Quarterly**: recurring event in the second week of Jan/Apr/Jul/Oct, 2 hours, titled `[Taskdeck] Quarterly Deep Drill` |
| 65 | + |
| 66 | +Include the following in the calendar event description: |
| 67 | + |
| 68 | +``` |
| 69 | +Scenario library: docs/ops/rehearsal-scenarios/ |
| 70 | +Evidence template: docs/ops/EVIDENCE_TEMPLATE.md |
| 71 | +Backlog rules: docs/ops/REHEARSAL_BACKOFF_RULES.md |
| 72 | +``` |
| 73 | + |
| 74 | +## Scenario Library |
| 75 | + |
| 76 | +Available scenarios in `docs/ops/rehearsal-scenarios/`: |
| 77 | + |
| 78 | +- `degraded-api-health.md` -- API health endpoint returns degraded/unhealthy status |
| 79 | +- `missing-telemetry-signal.md` -- Correlation ID missing from OpenTelemetry traces |
| 80 | +- `mcp-server-startup-regression.md` -- Optional MCP server fails at boot |
| 81 | +- `deployment-readiness-failure.md` -- Docker Compose startup fails readiness checks |
| 82 | + |
| 83 | +New scenarios should follow the same template structure (pre-conditions, injection, diagnosis, recovery, evidence checklist). File them in the `rehearsal-scenarios/` directory with a descriptive kebab-case filename. |
| 84 | + |
| 85 | +## Related Documents |
| 86 | + |
| 87 | +- `docs/ops/EVIDENCE_TEMPLATE.md` -- evidence package format |
| 88 | +- `docs/ops/REHEARSAL_BACKOFF_RULES.md` -- issue filing and SLA rules for findings |
| 89 | +- `docs/ops/FAILURE_INJECTION_DRILLS.md` -- automated drill scripts (complementary to manual rehearsals) |
| 90 | +- `docs/ops/OBSERVABILITY_BASELINE.md` -- telemetry and dashboard contract |
0 commit comments