Engineering Lessons

This file stores durable engineering rules extracted from postmortems.

The Learning Agent appends to this file after every completed pipeline cycle.

Agents must read this file before generating architecture, implementation, or review outputs.

Format

Each lesson follows this structure:

date: YYYY-MM-DD project: <project_name> issue: root_cause: rule: improvement:

Lessons

date: 2026-03-07 project: Gmail Summary to WhatsApp Notifier (issue-002) issue: Unbounded Gmail pagination loop caused Vercel 504 timeouts during QA root_cause: Architecture plan did not specify a lookback window or maximum page limit for external API fetching loops. The assumption that default pagination would self-terminate was incorrect. rule: Every external API fetch loop must implement two hard constraints before any other logic — a maximum page count (e.g., 5 pages) and a temporal bound (e.g., newer_than:30d). These are non-negotiable even in MVP. improvement: Backend Architect agent must include explicit pagination limits and temporal bounds in all architecture plans that involve syncing external data. Code Review agent must flag any while/for loop over a network call that lacks both a page cap and a date bound.

date: 2026-03-07 project: Gmail Summary to WhatsApp Notifier (issue-002) issue: AI summaries were low quality because only the Gmail snippet (200 chars) was passed to the LLM instead of the full email body root_cause: The plan assumed the default messages.list snippet would provide sufficient context for generative summarization. This was not validated against how LLMs actually require full text payloads. rule: Any AI summarization integration must explicitly fetch and pass the full text/plain or text/html payload of each item. Metadata snippets or previews must never be used as primary LLM input. improvement: Backend Architect agent must call out the specific payload fields required for each AI integration in the plan. Execute Plan agent must not default to snippet fields when full body fields exist.

date: 2026-03-07 project: Gmail Summary to WhatsApp Notifier (issue-002) issue: Master cron job processed all users synchronously inside a single Vercel function, coupling total runtime to user count and hitting the 5-minute timeout ceiling root_cause: The initial cron architecture treated user processing as a batch operation rather than a fan-out trigger. The architecture did not account for serverless execution time limits. rule: Any scheduled worker that triggers per-user or per-entity operations must use a fan-out architecture — the master cron fires N independent async invocations, one per entity. The master cron must not contain heavy processing logic itself. improvement: Backend Architect agent must flag any architecture that processes a list of users inside a single cron function and require a decoupled worker pattern instead. QStash, background jobs, or parallel HTTP triggers are the acceptable patterns.

date: 2026-03-07 project: Gmail Summary to WhatsApp Notifier (issue-002) issue: Transient Twilio errors permanently suspended user accounts because all third-party errors were treated identically root_cause: Error handling treated all non-2xx responses from messaging APIs as permanent failures, triggering is_active = false without checking whether the error was recoverable. rule: Third-party API error handling must distinguish between permanent errors (e.g., invalid number, 404) and transient errors (e.g., rate limit, sandbox restriction, 503). Account-level consequences like suspension must only trigger on permanent, confirmed errors. improvement: Backend Engineer agent must implement error classification for all third-party integrations before writing the error handling block. Code Review agent must verify that account suspension logic is guarded by error type, not error existence.

date: 2026-03-07 project: Gmail Summary to WhatsApp Notifier (issue-002) issue: Supabase schema was never pushed to the remote instance before deploy-check, causing the deploy gate to fail on an entirely empty database root_cause: The architecture plan assumed schema.sql would auto-sync. Supabase requires explicit execution via CLI link or manual SQL editor. This was not listed as a deploy prerequisite in the deploy-check command. rule: Database schema initialization against the production or staging URI must be a mandatory, explicit checklist item in deploy-check. The deploy gate cannot pass unless table existence is verified, not assumed. improvement: Deploy Agent must include a schema verification step — confirming all expected tables exist in the remote DB — as the first item in the deploy-check sequence, before any compilation or build checks.

date: 2026-03-07 project: Gmail Summary to WhatsApp Notifier (issue-002) issue: Failed AI Summaries created an infinite retry loop for specific unprocessable emails root_cause: The pipeline correctly skipped marking emails as processed on AI failure, but didn't track the number of failed attempts per email. This caused it to continually attempt to process the exact same "poison pill" content every cron run if the LLM constantly choked on it. rule: Data processing queues that iterate over specific external entities must implement a dead-letter queue (DLQ) or a permanent skip/failure counter per item to prevent infinite retry loops on unprocessable structures. improvement: Backend Engineer agent must implement per-item retry limits and failure tracking for any automated processing pipeline that operates on external payloads.

date: 2026-03-08 project: AI Personal Finance Advisor (issue-003) issue: WhatsApp replies were terminating abruptly because the sendWhatsAppMessage promise was fired off synchronously inside the serverless webhook before the Next.js API returned 200. root_cause: Serverless environments (like Vercel) suspend background execution immediately after the HTTP response is sent, silently killing unawaited network requests. rule: Every async API call or Background operation made inside a serverless API route or edge function must be explicitly awaited or resolved using waitUntil before returning the HTTP response. Do not use fire-and-forget patterns. improvement: Code Review Agent must enforce explicit await on all external sdk/fetch calls inside API routes, especially for notification services.

date: 2026-03-08 project: AI Personal Finance Advisor (issue-003) issue: The cron endpoints executed N+1 database queries, iterating users one-by-one and awaiting individual sequential queries and API calls inside loops. root_cause: Defaulting to single-user CRUD operations resulting in extreme execution duration, violating serverless timeout limits. rule: Cron jobs that process collections of entities MUST batch database reads and utilize concurrent Promise arrays (e.g. Promise.allSettled) for external dispatches to compress execution time. improvement: Backend Architect Agent must explicitly mandate batching patterns (like IN queries) and concurrent processing for cron jobs dealing with varying user numbers. Code Review Agent must reject await statements wrapped inside simple loops interacting with databases or 3rd party APIs.

date: 2026-03-11 project: Project Clarity (issue-004) issue: Gemini JSON output was wrapped in Markdown codeblocks, crashing JSON.parse() and causing task save failures. root_cause: Code assumed AI structured outputs were raw stringified JSON, ignoring common LLM behaviors where responses are wrapped in markdown formatting. rule: All JSON parsing from unstructured or semi-structured LLM outputs must be preceded by a sanitization/stripping step (e.g. regex replace of ```json markdown blocks). improvement: Code Review Agent must reject naive JSON.parse(ai_response.text) without a preceding regex clean or try/catch fallback block.

date: 2026-03-11 project: Project Clarity (issue-004) issue: Unbounded database queries (SELECT * FROM tasks without LIMIT) created a risk of infinitely growing initial load payloads. root_cause: Convenience of MVP implementation omitted basic database pagination and limits for list queries. rule: Every GET or list query on a database MUST enforce a hard .limit() or pagination strategy, even if the dataset is currently small. improvement: Backend Architect and Code Review Agents must actively check for .limit() or pagination clauses on any list-fetching endpoint.

date: 2026-03-11 project: Project Clarity (issue-004) issue: Missing state persistence for marked 'done' tasks - frontend updated optimistically but reloads brought tasks back. root_cause: MVP scope prioritized creation flow and skipped the mutation API endpoint, leading to a broken core loop. rule: No optimistic UI mutation can be shipped without a corresponding backend persistence endpoint hooked up and tested. improvement: Peer Review and QA Agents must explicitly verify that any state changes represented visually in the UI are persisted successfully to the database.

date: 2026-03-11 project: Project Clarity (issue-004) issue: Telemetry events defined in the Metric Plan were absent in the codebase during Deploy Check. rootcause: The pipeline executed /metric-plan _after all implementation and QA stages, disconnecting analytics definition from the build cycle. rule: Telemetry instrumentation (e.g. PostHog client) must be bundled into the feature implementation phase rather than treated as a post-QA checklist item. improvement: Execute Plan agent must mandate integration of telemetry trackers during the build. Metric Plan should ideally shift left conceptually.

date: 2026-03-19 project: SMB Feature Bundling Engine (issue-005) issue: Rate limiting on unauthenticated endpoint calling Gemini was deferred until /peer-review (3 stages late) root_cause: backend-architect-agent had no prompt instruction requiring a rate limiting strategy for unauthenticated endpoints calling paid external APIs. The Anti-Sycophancy 10x traffic check does not surface cost-abuse (bot requests ≠ load spikes). rule: Any architecture spec that includes an unauthenticated endpoint calling a paid external API must include a rate limiting strategy. This is a blocking architecture requirement, not a post-review improvement. improvement: backend-architect-agent Mandatory Pre-Approval Checklist now requires specifying rate limiting for all unauthenticated paid-API endpoints before outputting the spec.

date: 2026-03-19 project: SMB Feature Bundling Engine (issue-005) issue: SessionId was derived from DB return value, causing it to equal "unknown" on DB failure — poisoning PostHog analytics and causing 400 errors on PATCH endpoint root_cause: Architecture spec defined the sessionId field but gave no ordering constraint. Engineer naturally generated the ID after the DB insert returned. rule: When a sessionId or correlation ID is used across analytics, API routes, and DB, the architecture spec must state: "Generate sessionId (crypto.randomUUID()) before all downstream operations so it is stable regardless of DB or service failures." improvement: backend-architect-agent Mandatory Pre-Approval Checklist now requires explicit sessionId ordering constraint whenever a session ID spans analytics + API + DB.

date: 2026-03-19 project: SMB Feature Bundling Engine (issue-005) issue: No Gemini timeout specified — Vercel hard-kills functions at 10s returning HTML, which the client parsed as JSON and threw as "Network error" root_cause: backend-architect-agent mentioned "API latency" as a risk but did not mandate a concrete timeout. Vercel's 10s limit returns an HTML error page, not JSON, causing misleading client errors. rule: All architecture specs with external AI API calls on Vercel must include: "Wrap in Promise.race with AbortController at ≤ 9s. Return JSON 504 on timeout — never let Vercel's HTML error page reach the client." improvement: backend-architect-agent Mandatory Pre-Approval Checklist now requires specifying AbortController timeout for every API route that calls an external AI model.

date: 2026-03-19 project: SMB Feature Bundling Engine (issue-005) issue: Clipboard copy failure was silent — empty catch block gave PM zero feedback during a live sales call root_cause: No frontend standard required a fallback + error state for clipboard operations. Engineer implemented the happy path only. rule: Any clipboard copy interaction must implement: (1) navigator.clipboard.writeText() primary, (2) document.execCommand('copy') fallback, (3) visible inline error state if both fail. Silent catch blocks on user-facing copy actions are never acceptable. improvement: coding-standards.md now includes a Clipboard Operations section mandating fallback + inline error state for all copy-to-clipboard interactions.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: Internal worker endpoint accepted unauthenticated POST requests — experiment data could be corrupted by any caller root_cause: Architecture spec described the fan-out worker pattern but did not mandate an auth mechanism. "Internal" was treated as an implicit trust boundary. No checklist item required auth on worker-style routes. rule: Any API route that writes to experiment tables (cohorts, reminders, events, cron state) must specify its auth mechanism by name in the architecture spec. "Internal" is not an auth mechanism. All POST routes must be treated as externally reachable regardless of their intended caller. improvement: backend-architect-agent Mandatory Pre-Approval Checklist now requires specifying auth mechanism for every route that writes to experiment data tables. commands/execute-plan.md requires confirming auth header requirement before wiring any POST route.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: order_placed PostHog event fired from two sources (server API + client useEffect) — North Star double-counted on every reorder root_cause: Architecture plan defined the event but not its canonical emission point. No rule existed prohibiting dual-emission for a single PostHog event. Engineer wired both API-side and page-side tracking independently. rule: Each PostHog event that contributes to the North Star metric must have exactly one authoritative emission point — either client OR server, never both. If the server fires the event on API confirmation, all client-side re-firings of the same event name must be removed. Document the single source in an inline comment. improvement: commands/execute-plan.md Single Emission Source Rule added. code-review-agent.md now checks for PostHog event name appearing in both server-side routes and client-side components.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: orderId in deep link was decorative — reorder page fetched last-essential by userId, showing wrong product if user had a newer order root_cause: Architecture spec defined orderId as the URL parameter but did not specify the exact DB query. Engineer defaulted to the already-available getLastEssentialByUserId() helper, which was semantically wrong for an experiment attribution flow. rule: When a URL parameter names a specific entity (orderId, reminderId, sessionId), the page or API handler must fetch that exact entity by that ID. Fallback-to-owner lookups (e.g., fetching by userId when orderId is in the URL) corrupt experiment attribution and are never acceptable for experiment-instrumented flows. improvement: backend-architect-agent now requires specifying exact DB query (table, WHERE clause, column) for every URL containing an entity ID parameter. peer-review-agent now verifies URL ID → DB lookup fidelity on experiment deep links.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: ControlGroupSimulator reset to idle on page refresh — control conversions could be fired multiple times, corrupting North Star comparison root_cause: Simulator was implemented with React component state only. Full page reload reinstantiates the component as idle. The localStorage deduplication pattern used elsewhere in the same codebase was not applied to the new component. rule: Any simulation or conversion tool that fires write-once PostHog events must be idempotent across page refreshes. React component state is insufficient. Apply localStorage keying (check on mount → disable if key exists) AND a DB uniqueness constraint (ON CONFLICT DO NOTHING) for every write-once event emitter. improvement: peer-review-agent Step 5 now includes a demo simulation tool idempotency check. backend-architect-agent Dashboard & Reporting section now requires specifying both localStorage key and DB deduplication for any simulation tool.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: PostHog Promise.all in worker threw on PostHog failure — worker returned 500, trigger undercounted remindersSent even though DB state was correct root_cause: PostHog calls were passed to Promise.all without individual try/catch. A PostHog SDK exception propagated to the route handler. The pattern from issue-003 established concurrent processing but did not require per-call telemetry isolation. rule: All PostHog server-side calls in worker routes must be individually wrapped in try/catch before being passed to Promise.allSettled. A PostHog failure must never cause a worker to return 500. Worker HTTP status must reflect DB write state, not telemetry write state. Pattern: Promise.allSettled([trackA(data).catch(e => console.error(e)), trackB(data).catch(e => console.error(e))]). improvement: commands/execute-plan.md Telemetry Resilience Requirement updated to require Promise.allSettled with per-call .catch() for all PostHog worker calls. qa-agent now includes a Telemetry Unavailability test in Failure Simulation.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: README.md and .env.local.example were missing at deploy-check — third consecutive cycle with this blocker (issue-004, issue-005, issue-006) root_cause: execute-plan command listed README creation as implied but not as an explicit deliverable. Environment variables added during fix cycles were not tracked against .env.local.example. The blocker was caught twice before but no upstream instruction was hardened enough to prevent recurrence. rule: README.md and .env.local.example are mandatory deliverables of /execute-plan, not polish for /deploy-check. Every env var introduced at any pipeline stage (including peer-review fix cycles) must be added to .env.local.example in the same commit that introduces it. A /deploy-check README failure is always an execute-plan prompt failure. improvement: commands/execute-plan.md now has an explicit final checklist requiring README.md with 9 sections and .env.local.example listing every process.env.* reference before execute-plan can be marked complete.

date: 2026-03-21 project: Ozi Reorder Experiment (issue-006) issue: Error-path telemetry events absent for third consecutive cycle — per-user failure, cron_run_completed, and experiment_ended events not wired root_cause: The Telemetry Completeness Requirement added after issue-005 covered success-path and AI-branch events but did not explicitly require error-path events in catch blocks or lifecycle events at guard evaluations. rule: Telemetry completeness means happy-path AND error-path events. For every cron worker: (1) wire a per-user failure event in the catch block, (2) wire a cron_run_completed event after Promise.allSettled, (3) wire experiment lifecycle events at every guard evaluation (EXPERIMENT_END_DATE, opt-out threshold). These are blocking requirements, not production-only enhancements. improvement: commands/execute-plan.md Telemetry Completeness Requirement expanded to explicitly require error-path and lifecycle events. qa-agent now includes a failure telemetry verification test in Failure Simulation.

date: 2026-03-28 project: Nykaa Hyper-Personalized Style Concierge (issue-008) issue: Telemetry Latency in Critical Path caused API slowness and false client side aborts. root_cause: Backend API routes (shelf, rerank) awaited PostHog telemetry flushes, injecting 200-500ms of external network latency into the hot path. rule: Telemetry calls (e.g., PostHog captureServerEvent) in user-facing API routes must be fire-and-forget (.catch(() => {})) instead of awaited to prevent external latency from corrupting performance SLAs and experiment data. improvement: backend-engineer-agent now mandates fire-and-forget pattern for telemetry in hot paths.

date: 2026-03-28 project: Nykaa Hyper-Personalized Style Concierge (issue-008) issue: Unprotected JSON.parse on sessionStorage crashed client, and race condition in Search Payload caused network overlapping. root_cause: Frontend implementation prioritized functional completion over defensive programming. rule: All local storage reads must be wrapped in try/catch, and all search/filter network requests triggered by user input must utilize an AbortController. improvement: frontend-engineer-agent now enforces try/catch on storage reads and AbortController on async fetch.

date: 2026-03-28 project: Nykaa Hyper-Personalized Style Concierge (issue-008) issue: A/B experiment salt exposed to client via NEXTPUBLIC prefix, control cohort label returned in API response — enabling cohort self-selection rootcause: Engineer defaulted to NEXT_PUBLIC for shared config without verifying whether the value must be cryptographically hidden. API response included raw cohort string "control" without masking. rule: Cryptographic salts for A/B experiments must be server-only env vars (no NEXTPUBLIC prefix). API responses to clients must never expose the true cohort label for control groups — return a neutral value like "default". Server-side PostHog events are the correct place to record the real cohort. improvement: backend-engineer-agent now mandates server-only salts and masked cohort labels. backend-architect-agent Mandatory Pre-Approval Checklist item 8 now covers metric verifiability including experiment integrity constraints.

date: 2026-03-28 project: Nykaa Hyper-Personalized Style Concierge (issue-008) issue: Missing North Star metric flow. "Add-to-Cart" lifting was defined as success metric, but no such UI or button was ever built. root_cause: The architecture agents mapped API states to existing features but did not verify whether the metrics defined could actually be measured by the requested UI. rule: No product or architecture plan can be approved unless every single success metric has a corresponding, explicitly designed user flow and telemetry trigger in the specification. improvement: backend-architect-agent now requires explicitly verifying metric verifiability.

date: 2026-04-03 project: MoneyMirror (issue-009) issue: Dashboard transient state — GET /api/dashboard rehydration path absent from execute-plan output; refresh and email deep links dropped users to blank upload screen root_cause: Architecture spec described the dashboard only in terms of post-mutation result. The separate first-load read path (for refresh, direct URL, email CTA) was not specified. Backend engineer satisfied "what does the dashboard show" without satisfying "how does it load on any entry point." rule: Every page that is linked from an email, push notification, or external URL must have its full load path specified in the architecture: which API route is called, what query it runs, and what state it returns. Implementing only the post-mutation result path is never sufficient. improvement: backend-architect-agent Mandatory Pre-Approval Checklist item 10: every results/dashboard/report page linked from navigation, email, or external URL must specify the authenticated read path for first-load rehydration. Client-memory-only post-mutation flows are blocked. commands/execute-plan.md: add final verification — for every page in the plan, confirm both the write path and the read path are implemented.

date: 2026-04-03 project: MoneyMirror (issue-009) issue: Partial write accepted as success — transaction insert failure did not block processed state; downstream reads operated on a corrupted statement root_cause: Architecture spec defined the parse flow as a sequence of DB writes but did not specify an atomicity strategy for the parent/child pair. Backend engineer treated transaction insert failure as non-critical because there was no explicit instruction that the child write must succeed before the parent enters a success state. Second consecutive cycle with this failure (issue-006 had similar partial-write gap). rule: Any architecture spec that includes a parent record + child records written in the same user action must explicitly declare atomicity: if the child write fails, the parent must be rolled back or marked failed. Partial success is never acceptable as a terminal state for a user-facing data pipeline. improvement: backend-architect-agent Mandatory Pre-Approval Checklist item 11: for every user action that writes a parent record + one or more child records, specify the atomicity strategy. Child failure → parent rollback or failed state + error telemetry. "Partial success" terminal states are blocked. code-review-agent: flag CRITICAL if parent status is set to processed/success before child writes complete.

date: 2026-04-03 project: MoneyMirror (issue-009) issue: Fan-out worker returned HTTP 200 with { ok: false } on email failure — master cron counted it as success; weekly_recap_completed reported inflated succeeded counts with failed: 0 root_cause: Fan-out worker HTTP contract was defined at invocation level but success/failure propagation contract was not specified. Backend engineer returned 200 with a JSON error body — a common REST convention — without verifying that the master's counting logic would interpret it correctly. rule: Fan-out worker HTTP contracts must be explicitly specified in the architecture: the worker must return a non-2xx status on any failure that should be counted as failed by the master. JSON error bodies alone are insufficient — the master must not need to inspect payloads to distinguish success from failure. improvement: backend-architect-agent fan-out architecture section must state: "Worker returns non-2xx (e.g., 502) on any failure the master should count as failed. Master uses HTTP status only — never inspects JSON body — for success/failure accounting."

date: 2026-04-03 project: MoneyMirror (issue-009) issue: Advisory feed fetch missing auth header — coaching never rendered in core flow; dashboard called GET /api/dashboard/advisories without Authorization header returning 401 silently root_cause: Auth was added to the advisory route during a code-review fix cycle. The dashboard component was written before that fix using a bare fetch(). The two halves were never cross-verified. A route auth fix without updating all callers is an incomplete fix. rule: After adding or enforcing auth on any API route, all client-side callers of that route must be updated in the same change. A route auth fix without updating all callers is an incomplete fix. improvement: code-review-agent: for every API route confirmed to require auth, search all fetch(), axios, and useSWR calls in client components targeting that route path. If any caller omits the Authorization header, flag as CRITICAL. commands/execute-plan.md: after wiring any authenticated route, verify all client-side callers send auth headers.

date: 2026-04-03 project: MoneyMirror (issue-009) issue: PostHog env var name mismatch — .env.local.example declared NEXTPUBLIC_POSTHOG_KEY but posthog.ts read POSTHOG_KEY; server-side telemetry would be silently dead in any production deployment root_cause: .env.local.example was written from memory during execute-plan and never mechanically verified against actual process.env. calls in the code. Var names diverged silently. rule: .env.local.example must be generated from the actual process.env._ calls in the code — not from memory. Every key must exactly match the string used in source. A mismatch between the example file and the actual code reference is a deploy blocker. improvement: commands/execute-plan.md: add mandatory final step — grep all process.env.* references in src/, extract variable names, and verify every name appears in .env.local.example. Any discrepancy is a blocking gap before execute-plan can be marked done. qa-agent: promote env var key name cross-check to a standalone QA dimension with explicit grep-based verification.

date: 2026-04-03 project: MoneyMirror (issue-009) issue: File size violations at deploy-check — parse/route.ts (345 lines) and dashboard/page.tsx (562 lines) exceeded 300-line limit; extraction required 3 stages after implementation root_cause: 300-line file limit is enforced mechanically at commit time but is not an active constraint during code generation. Large files are written without budgeting for size. rule: The 300-line file limit must be applied during code generation, not at commit time. Any route or page expected to contain multi-phase logic must be designed with extraction points upfront. Files projected to exceed 250 lines must be split before writing. improvement: commands/execute-plan.md: for any API route handling more than 2 logical phases, the route handler must delegate to helpers at generation time. Target: route files under 200 lines, page files under 250 lines. backend-engineer-agent + frontend-engineer-agent: if a file is projected to exceed 250 lines during generation, extract into a helper or sub-component before writing past that limit.

date: 2026-04-03 project: MoneyMirror (issue-009) issue: pdf-parse wrong result property — pdf-parser.ts called result.pages?.length; library exposes result.total, not result.pages.length; pageCount resolved to 1 for all documents root_cause: execute-plan agent generated code against training knowledge of the pdf-parse API without verifying the installed package version's exported interface. The library API changed between versions. rule: When generating code against a third-party package whose API has changed between major versions, verify the installed version's exported types or index against the generated call pattern. Training knowledge of library APIs is not sufficient for version-sensitive properties. improvement: commands/execute-plan.md: after wiring any third-party library for the first time, check the installed version in package.json and verify the exported API matches the generated usage pattern.

date: 2026-04-04 project: MoneyMirror Phase 2 (issue-009) issue: Three label columns (nickname, account_purpose, card_network) added to statements table mid-phase via ALTER TABLE (VIJ-24); production schema lagged behind deployed code root_cause: Architecture spec for Epic G1 (multi-account labelling) specified the UI feature (upload form fields) but did not enumerate the DB columns those fields persist to. The gap between UI design and persistence design was not caught until the execute-phase. rule: Any feature that adds user-facing input fields must enumerate all required DB columns in the architecture schema before execute-plan begins. Nullable column additions are treated as schema migrations requiring a production ALTER and must be included in schema.sql before the first deploy of the feature. improvement: backend-architect-agent.md Mandatory Pre-Approval Checklist: for every new user input field in the spec, require a corresponding column in the schema with type, nullability, and constraint. Any feature with new form inputs but no corresponding schema column is a blocking gap.

date: 2026-04-04 project: MoneyMirror Phase 2 (issue-009) issue: UploadPanel sends free-text metadata (nickname, account_purpose, card_network) to the parse API; server sanitizes invalid values to null silently; user believes label was saved but it was dropped root_cause: Architecture defined server-side sanitization helpers but did not define the client-server validation contract. No explicit rule required client-side enum pickers or server-side 4xx rejection for invalid enum values. rule: Any new input field that stores an enum value must use a client-side picker or select (not free text) AND the server must return a 4xx response on invalid enum input, not silently sanitize. Silent sanitization masks data quality failures and gives the user false confidence their input was saved. improvement: commands/execute-plan.md: for any new input field persisting an enum column, require (1) client picker/select enumeration, (2) server-side explicit validation with 4xx on invalid value, (3) schema CHECK constraint on the column. backend-architect-agent.md: when designing input fields, classify each field as free-text or enum and specify the validation contract for each.

date: 2026-04-05 project: MoneyMirror Phase 3 (issue-010) issue: Dashboard headline totals and advisory inputs were derived from row-limited transaction queries, diverging from truth for large scopes root_cause: Architecture did not state that monetary rollups for insights and advisories must use full-scope SQL aggregates; implementation reused list-query shapes that applied LIMIT. rule: For any finance dashboard, totals, category sums, and inputs to rules or AI facts must be computed with database aggregates (SUM/COUNT) over the full active user scope — never by summing a LIMIT-capped transaction scan. improvement: backend-architect-agent.md Mandatory Pre-Approval Checklist: financial headline metrics must name aggregate strategy. backend-engineer-agent.md: never use LIMIT on the query path whose rows are summed into headline totals or advisory inputs.

date: 2026-04-05 project: MoneyMirror Phase 3 (issue-010) issue: Merchant-key backfill loop could run until timeout when normalize returned null for rows still selected as needing keys root_cause: Batch repair assumed every selected row could be updated; no termination proof for permanently unresolvable rows. rule: Any cursor-based batch repair over rows with nullable derived fields must advance the cursor past rows that cannot be normalized in one pass, or mark them skipped, so the loop always terminates. improvement: backend-architect-agent.md: maintenance routes must document cursor monotonicity and poison-row handling. code-review-agent.md: flag while/for repair loops without exit on unprocessable rows.

date: 2026-04-05 project: MoneyMirror Phase 3 (issue-010) issue: Scope editor modal could overwrite active URL scope because local form state was not re-hydrated when canonical scope changed root_cause: Two sources of truth — URL vs local modal state — without sync on scope change. rule: When URL or search params define canonical scope or filters, edit dialogs and local editors must reset from parsed route state whenever the active scope changes. improvement: frontend-engineer-agent.md: re-initialize modal/local state from parsed search params on scope change.

date: 2026-04-05 project: MoneyMirror Phase 3 (issue-010) issue: Advisory strings used monthly language (×12 annualization, “/mo”, “this month”) when the active scope was multi-month or arbitrary root_cause: Copy was authored for a single-month frame without a spec rule tying phrases to scope duration. rule: Money and time phrases in analytics and advisories must match the active date scope. Do not annualize with ×12 unless the scope is exactly one month or copy explicitly annualizes from a monthly estimate. improvement: product-agent.md / design-agent.md: period-neutral defaults when range is user-configurable. code-review-agent.md: financial copy vs scope dimension.

date: 2026-04-05 project: MoneyMirror Phase 3 (issue-010) issue: Rapid scope changes could show stale dashboard data when an older fetch resolved after a newer scope root_cause: Competing async loads without cancellation or stale-response guard. rule: Any user-triggered reload that can be superseded by a newer action must use AbortController (or equivalent) and ignore AbortError. improvement: frontend-engineer-agent.md: explicit abort pattern for dashboard and scope-driven loads.

date: 2026-04-05 project: MoneyMirror Phase 3 (issue-010) issue: Heavy authenticated read APIs (large scans, GROUP BY rollups) had no stated cost or abuse posture root_cause: Architecture specified auth and ownership but not pagination, rate limits, or MVP-trusted-client assumption for expensive reads. rule: For authenticated heavy-read endpoints, the architecture must document pagination or cursor guarantees, per-user rate limits, or an explicit MVP trusted-client / low-risk assumption. improvement: backend-architect-agent.md Mandatory Pre-Approval Checklist: heavy read strategy. Peer review cross-checks that the stated strategy exists or is waived explicitly.

date: 2026-04-07 project: MoneyMirror Gen Z clarity loop (issue-012) issue: Aggregate drill-through showed a subset view (single merchant) instead of the tapped cluster scope root_cause: Frontend implementation optimized for quick navigation but did not preserve aggregate semantics from summary row to downstream filter contract. rule: Any drill-through launched from an aggregate UI element (cluster/category/rollup) must pass the full aggregate filter set end-to-end (UI params -> API query -> rendered list). Substituting one representative row is a correctness bug. improvement: frontend-engineer-agent must require aggregate-preserving drill-through contracts; code-review-agent must add an explicit aggregate-to-detail integrity check.

date: 2026-04-07 project: MoneyMirror Gen Z clarity loop (issue-012) issue: Guided review could display success UI on non-2xx API responses root_cause: UI completion state was tied to promise resolution rather than HTTP success contract (response.ok), so failure responses were treated as completed actions. rule: Completion UIs for mutating actions must transition to success only on explicit success criteria (response.ok or equivalent). Non-2xx responses must keep the user in-flow with retryable error messaging. improvement: frontend-engineer-agent must enforce non-2xx handling in completion flows; code-review-agent/qa-agent should include this as a mandatory unhappy-path check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engineering Lessons

Format

Lessons

FilesExpand file tree

engineering-lessons.md

Latest commit

History

engineering-lessons.md

File metadata and controls

Engineering Lessons

Format

Lessons