Skip to content

Commit 71a834c

Browse files
lilyz-aiclaude
andcommitted
docs: add SEV1 post-mortem for model-engine 500s incident (MLI-6574)
Documents the Apr 24-25 outage caused by deploying an ORM model referencing a missing DB column, including timeline, root cause, impact, and follow-up action items. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 9fcc3c9 commit 71a834c

1 file changed

Lines changed: 78 additions & 0 deletions

File tree

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# SEV1 Post-Mortem: model-engine Deployment Caused 500s on All Endpoint Operations
2+
3+
**Incident ID:** MLI-6574
4+
**Severity:** SEV1
5+
**Date:** Apr 24–25, 2026
6+
**Duration:** ~58 min (23:50Z deployment → 00:48Z rollback)
7+
**Customer-facing impact:** ~47 min of 500s (23:50Z – 00:37Z PagerDuty alert; resolved 00:48Z)
8+
**Status:** Resolved
9+
10+
---
11+
12+
## Summary
13+
14+
A `kubectl set image` at 23:50Z on Apr 24 deployed a model-engine image that referenced a new ORM field (`endpoints.temporal_task_queue`) which did not exist in the production database. Every `list_model_endpoints` and `get_model_endpoint` call returned 500, blocking all endpoint deployments and operations for all users. Five patch attempts over ~34 minutes failed to address the root cause before a rollback to the previous image resolved the incident. At least one user missed a project delivery deadline.
15+
16+
---
17+
18+
## Timeline
19+
20+
| Time (UTC) | Event |
21+
|---|---|
22+
| Feb 27, 05:54Z | Stable image `f395ffa6…` deployed; ran cleanly for ~57 days |
23+
| **Apr 24, 23:50:25Z** | **`kubectl set image``04729cef…` (rev 293); ORM references missing `temporal_task_queue` column; all endpoint list/get → 500** |
24+
| 23:55:43Z | `-internal` tag rolled (rev 294); same bug, same 500s |
25+
| Apr 25, 00:09:32Z | `-patch` (rev 296) — does not fix missing column |
26+
| 00:15:29Z | `-patch2` (rev 297) — does not fix missing column |
27+
| 00:20:27Z | `-patch3` (rev 298) — does not fix missing column |
28+
| 00:24:11Z | `-patch4` (rev 299) — does not fix missing column |
29+
| **00:37Z** | **PagerDuty fires; Envoy 5xx ratio = 0.052** |
30+
| **00:48Z** | **Rollback to `f395ffa6…` (rev 300); errors clear; alert auto-resolves** |
31+
| 01:22Z | 0 errors in last 2 min; incident closed |
32+
33+
---
34+
35+
## Root Cause
36+
37+
The ORM model for the `endpoints` table was updated to declare a new column `temporal_task_queue` (added as part of the temporal endpoint type feature, MLI-6425). The corresponding Alembic database migration **was never applied to production** before the new image was rolled out.
38+
39+
When model-engine started, SQLAlchemy attempted to reference `endpoints.temporal_task_queue` in every endpoint query. Because the column did not exist in the live DB, PostgreSQL returned an error on every `list_model_endpoints` and `get_model_endpoint` call, causing universal 500s.
40+
41+
The five patch images deployed during the incident did not address this — they lacked the migration and the ORM field continued to reference the non-existent column.
42+
43+
**Contributing factors:**
44+
- No migration-before-rollout enforcement: the deployment pipeline does not block a rollout if pending Alembic migrations exist.
45+
- No startup schema validation: model-engine does not fail fast on ORM/DB schema drift; errors only surfaced at query time.
46+
- No rollback SLA: 5 patch attempts were made over ~34 minutes before rollback was chosen. The correct fix (rollback) was not prioritized early enough.
47+
- Delayed alerting: PagerDuty did not fire until 00:37Z, ~47 min after the bad deployment.
48+
49+
---
50+
51+
## Impact
52+
53+
| Dimension | Detail |
54+
|---|---|
55+
| Duration of 500s | ~47 min (23:50Z – 00:37Z alert; resolved 00:48Z) |
56+
| Affected operations | All endpoint list, get, create, update, delete |
57+
| Affected users | All model-engine users |
58+
| Known downstream impact | At least one user missed a project delivery deadline |
59+
60+
---
61+
62+
## Action Items
63+
64+
| # | Action | Owner | Priority |
65+
|---|---|---|---|
66+
| 1 | **Enforce migration-first deployment**: CI/CD pipeline must verify all pending Alembic migrations are applied before new image goes live (or gate rollout on a migration job completing). | Infra/model-engine | High |
67+
| 2 | **Add migration drift detection**: Startup health check that fails fast if ORM schema diverges from live DB schema, preventing silent 500s. | model-engine | High |
68+
| 3 | **Define rollback SLA**: If ≥2 patch attempts fail within 15 minutes, initiate rollback immediately. Document this in the on-call runbook. | On-call/Infra | Medium |
69+
| 4 | **Improve alerting latency**: Envoy 5xx alert threshold and evaluation window should catch this class of incident in <10 min, not 47 min. | Infra | Medium |
70+
| 5 | **Proactive incident communication**: Users mid-deployment should receive Slack/status-page notification during active SEV1s. | On-call/Eng | Medium |
71+
72+
---
73+
74+
## Lessons Learned
75+
76+
- **Schema changes must be deployed atomically with or ahead of the code that depends on them.** A two-phase deploy (migration first, code second) or a migration-gating CI step would have prevented this entirely.
77+
- **Fail fast on startup beats silently failing on every request.** A startup check that validates ORM↔DB schema alignment would have contained the blast radius to a failed rollout rather than a live outage.
78+
- **Five patches in 34 minutes is a sign to stop and rollback.** When the root cause is unclear, reverting to a known-good state is faster and safer than iterating forward.

0 commit comments

Comments
 (0)