Command: /postmortem Date: 2026-03-19 Pipeline span: /create-issue → /deploy-check (same day)
Observed at: /review (S1 flagged), /peer-review (escalated to REQUIRED), fixed at /peer-review
Root cause: backend-architect-agent never identified cost-surface exposure as a risk category. No prompt instruction required it to specify a rate limiting strategy for unauthenticated endpoints calling paid external APIs.
Preventative Rule: Any architecture spec that includes an unauthenticated endpoint calling a paid external API must include a rate limiting strategy. This is a blocking architecture requirement.
Observed at: /peer-review (RR1 MUST-FIX)
Root cause: Architecture spec specified the bundle_sessions table and sessionId field but gave no ordering constraint. Engineer generated sessionId after DB insert. If Neon was unavailable, sessionId = "unknown" — poisoning PostHog and causing 400s on PATCH.
Preventative Rule: When a sessionId is used across analytics, API routes, and DB, the architecture spec must state: "Generate sessionId before all downstream operations so it is stable regardless of DB or service failures."
Observed at: /peer-review (RR2 MUST-FIX)
Root cause: backend-architect-agent mentioned "API latency" as a risk but did not mandate a concrete timeout for Vercel's serverless environment. Vercel hard-kills functions at 10s returning HTML, which the client parsed as JSON and threw as "Network error."
Preventative Rule: All architecture specs with external AI API calls on Vercel must include: "Wrap in Promise.race with AbortController at 9s. Return JSON 504 on timeout — never let Vercel's HTML error page reach the client."
Observed at: /qa-test (QA1 HIGH — required fix)
Root cause: No frontend standard required a fallback + error state for clipboard operations. Engineer implemented happy path only.
Preventative Rule: Any clipboard copy interaction must implement: (1) navigator.clipboard.writeText() primary, (2) document.execCommand('copy') fallback, (3) visible inline error state if both fail. Silent catch blocks on user-facing actions are never acceptable.
Observed at: /metric-plan (gaps identified), /deploy-check (still missing, added)
Root cause: The CLAUDE.md rule ("Implement telemetry during feature development, not post-QA") was applied only to success-path events. Error-path events were treated as analytics concerns rather than implementation requirements. Additionally, PostHogProvider had capture_pageview: false (intentional) but no manual landing_page_viewed capture existed — metric plan assumed auto-capture.
Preventative Rule: During /execute-plan, every API route error path (catch blocks, timeout branches, rate-limit branches) must include a PostHog event. The metric plan verifies events — it does not define which ones exist.
| Improvement | Target | Priority |
|---|---|---|
| Add Vercel timeout constraint to architecture spec | backend-architect-agent.md | Critical |
| Add sessionId ordering rule to architecture spec | backend-architect-agent.md | Critical |
| Add rate limiting requirement for paid API endpoints | backend-architect-agent.md | Critical |
| Add clipboard fallback standard | coding-standards.md | High |
| Add error-path telemetry requirement | execute-plan command | High |
| Add pageview capture verification step | metric-plan command | Medium |
knowledge/engineering-lessons.md— 3 new rules (PM-1, PM-2, PM-3)knowledge/coding-standards.md— 1 new rule (PM-4)agents/backend-architect-agent.md— Mandatory Pre-Approval Checklist (Prompt Autopsy)agents/peer-review-agent.md— Step 4 exactness requirement (Prompt Autopsy)
What it missed: Three architecture-level requirements caught only at peer-review (2–3 stages late):
- No rate limiting for unauthenticated endpoint calling Gemini
- No ordering constraint on sessionId generation relative to DB
- No Vercel timeout constraint on Gemini call
Root cause in prompt: ## 6 Technical Risks uses vague examples. The Anti-Sycophancy 10x traffic question doesn't surface cost-abuse (100 bot requests ≠ 10x load). No checklist exists for serverless + paid-API constraints.
Proposed fix — add to agents/backend-architect-agent.md under ## 6 Technical Risks:
## Mandatory Pre-Approval Checklist (Serverless + AI)
Before finalizing the architecture, answer all of the following. Any gap must
be fixed in the spec before outputting.
1. Paid API exposure: Does any unauthenticated endpoint call a paid external API
(Gemini, OpenAI, Twilio, Stripe, etc.)?
→ If yes: specify rate limiting strategy (e.g., 5 req/60s per IP, in-memory Map).
→ This is a blocking architecture requirement, not a post-review improvement.
2. Vercel timeout: Does any API route call an external AI model?
→ If yes: specify "Wrap in Promise.race with AbortController at ≤ 9s.
Return JSON 504 on timeout — never expose Vercel's HTML error page to client."
3. SessionId / correlation ID ordering: Is there a sessionId used across
analytics, API routes, and DB?
→ If yes: specify "Generate sessionId (crypto.randomUUID()) before all
downstream operations. Never derive it from DB return values."
What it missed: Challenge Mode Step 4 (Prompt Autopsy Check) produced directional suggestions ("add: AbortController timeout rule") but not exact text additions to specific files. Learning only flows through if /learning correctly translates vague directions into concrete changes.
Root cause in prompt: Step 4 says "Report these as: Agent Prompt Improvements" but doesn't require exact file + section + text. The output is directional, not actionable.
Proposed fix — replace Step 4 in agents/peer-review-agent.md Challenge Mode:
**Step 4 — Prompt Autopsy Check**
For each agent prompt gap identified:
- Name the exact agent file (e.g., agents/backend-architect-agent.md)
- Name the exact section to modify (e.g., ## 6 Technical Risks)
- Write the exact text to add — not a direction, but the actual sentence or rule
Format as:
File: agents/[agent-name]-agent.md
Section: [section name]
Add: "[exact text]"
This output is consumed directly by /learning to update agent files.
Vague directions ("add a timeout rule") are not acceptable outputs.
What it missed: Error-path PostHog events were not wired during implementation. Happy-path events (bundle_generated, pitch_copied) were implemented. Error catch blocks were written for reliability only, not observability.
Root cause in prompt: Execute-plan instructs engineers to implement telemetry but doesn't distinguish happy-path from error-path. Implicit assumption: telemetry = success events.
Proposed fix — add to execute-plan command instructions under telemetry section:
## Telemetry Completeness Requirement
For every API route calling an external AI service, implement PostHog events
in ALL branches:
- Success path: event with latency_ms + key response properties
- Timeout branch: event with timeout_ms property
- Parse/AI failure branch: event with error_type property
- Rate limit branch: event (no PII)
An API route with partial telemetry is equivalent to no telemetry for funnel
analysis. The metric plan verifies event schema — it does not define which
events exist.
| Issue | Caught At | Should Have Been Caught | Stages Late |
|---|---|---|---|
| No rate limiting on paid API | /review | /create-plan (architecture) | 2 |
| SessionId tied to DB success | /peer-review | /create-plan (architecture) | 3 |
| No Gemini timeout | /peer-review | /create-plan (architecture) | 3 |
| Silent clipboard failure | /qa-test | /execute-plan | 1 |
| Error-path telemetry missing | /metric-plan → /deploy-check | /execute-plan | 2–3 |
All 5 systemic issues share a root cause: architecture under-specification propagated downstream until caught by adversarial review. The backend-architect-agent is the highest-leverage fix point — updating its mandatory checklist eliminates 3 of 5 issues at source.