dispatch: May 21 — scan perf, error hardening, gate infrastructure

Douglas Jones · Douglas Jones · commit 33e6af8f2f05 · 2026-05-21T16:07:27.000-04:00
diff --git a/dispatches/2026-05-21-scan-perf-error-hardening.readout.md b/dispatches/2026-05-21-scan-perf-error-hardening.readout.md
@@ -0,0 +1,98 @@
+# Quill Readout — 2026-05-21: Scan Performance, Error Hardening & Gate Infrastructure
+
+## Session Summary
+
+A field test exposed three production bugs. All three were diagnosed, fixed, and deployed in the same session. The team also built out the NFR/KPI gate infrastructure, hardened error handling across the entire codebase, and shipped two new admin dashboards.
+
+---
+
+## Part 1: Field Test — Five Scans, Five Failures
+
+The user walked outside and scanned the sign in front of their house five times. All five failed. The backend was healthy — OCR confidence 0.997, sign matched, rules evaluated correctly. The failures were entirely client-side.
+
+### Bug 1: Scan Confidence Gate (iOS Client)
+
+**Root cause:** Every scan that produced a valid `inline_evaluation` also had `selected_match: null` in the server response. This was because `cacheableMatchSignId` was gated on `canSelectTopCandidate`, which was always false when `assessCaptureSuggestion` returned `canOpenSuggestedSign: false` — which it always does when `hasInlineEvaluation: true`. The iOS client's `compositeConfidence()` then computed `dbScore = 0.0`, yielding composite ≈ 0.45 — below the 0.80 gate — and returned `.lowConfidence` on every scan despite perfect OCR.
+
+**Fix (server):** Decoupled `cacheableMatchSignId` from `canSelectTopCandidate`. `selected_match` now reflects "did we identify a sign?" independently of whether the UI should show a suggested sign button.
+
+**Fix (client):** When `inline_evaluation` is present, treat `dbScore = 1.0`. The server already evaluated the rules — that is an implicit high-confidence match.
+
+### Bug 2: Rule Engine Permit Exception (Backend)
+
+**Root cause:** A commit had introduced `isPermitOnly` logic that treated any `no_parking` rule with `permit_types` as non-restrictive. This is semantically backwards: `permit_types` on a `no_parking` rule means "No Parking EXCEPT permit holders" — regular users are still restricted. The sign in front of the house has `permit_types: ["I"]` on its evening no-parking rule. The rule engine was returning "green" instead of "red."
+
+**Fix:** Removed `isPermitOnly` entirely. `isRestricted` is simply whether the winning rule is `no_parking` or `no_standing`. This fixed 40 failing parity/rule-engine tests that had been silently broken.
+
+### Bug 3: isComplexSign Over-Detection (Backend)
+
+**Root cause:** The `isComplexSign` detection was flagging any sign with arrows (→, ←), multiple NO PARKING mentions, or permit exceptions as needing Gemini visual arrow detection. This routed every multi-panel sign — including the common 3-panel Raleigh R7-108 — to the 6-10s hybrid Gemini path instead of the ~1s client OCR fast path.
+
+**Fix:** If Vision extracted complete text (length > 80 chars AND has time ranges), trust it regardless of complexity indicators. The arrow symbols in the extracted text are sufficient for panel parsing. Gemini only needs to see the image when the text is short or incomplete.
+
+**Result:** First-scan latency drops from ~10s to ~3-4s on multi-panel signs. Subsequent scans (fast-match cache) remain ~1-2s.
+
+---
+
+## Part 2: Error Handling Audit — 29 Silent Catches Eliminated
+
+A code review revealed 29 instances of `catch { return null; }` across the codebase. This pattern hides failures, makes debugging impossible, and was the proximate cause of the fast-match cache silently failing on Vercel cold starts — causing every scan to fall back to full OCR with no log evidence.
+
+Every silent catch was classified and fixed:
+
+| Category | Count | Fix |
+|----------|-------|-----|
+| JSON.parse on user/API input | 6 | `console.debug()` before returning 400/null |
+| Image processing fallbacks (sharp) | 4 | `console.warn()` before fallback |
+| Fingerprint computation | 2 | `console.warn()` — non-critical deduplication |
+| Edge Config unavailable | 2 | `console.debug()` — infrastructure optional |
+| OCR JSON parsing | 3 | `console.warn()` with raw response preview |
+| Supabase query failures | 4 | `console.error()` — on paths that matter |
+| Sign registry DB | 2 | `console.error()` — critical path |
+| Rate limit KV | 1 | `console.warn()` — documented fallback |
+| Misc (segments, nearby spots, cron) | 4 | `console.warn()` / `console.error()` |
+| **Approved silent** | 1 | `SILENT-CATCH-APPROVED: Winston` — `tryParseJson` retry loop |
+
+The coding standards were updated in both DecodeTheSign and agentic-stage-gate-governance to codify the rule: every catch must log or rethrow. Silent catches require a `SILENT-CATCH-APPROVED` comment with architect name, date, and specific justification.
+
+---
+
+## Part 3: NFR & KPI Gate Infrastructure
+
+### NFR Document Expanded
+`docs/NFRS.md` now has 30 NFRs across 6 categories (Performance, Accuracy, Reliability, Security, Accessibility, Scalability) with measurement methods, gate assignments, and an automated gate assertions table mapping 6 NFRs to passing CI tests.
+
+### Admin Dashboards
+Two new admin dashboards deployed to decodethesign.com:
+
+**`/admin/nfr-dashboard`** — Live pass/fail gate status for all NFRs and KPIs. Automated items (CI-backed) show PASS immediately. Manual items show PENDING with a checklist of what evidence is needed before G4 closes.
+
+**`/admin/adoption`** — DAU/WAU/MAU, 7-day retention, engagement depth (median scans/user, power users), DAU/scans bar charts (30d), geographic spread, user level distribution, top contributors leaderboard.
+
+### Governance Project Updated
+`agentic-stage-gate-governance` received:
+- `templates/NFR-TEMPLATE.md` — generic reusable NFR template with all 6 categories
+- `steering/05-nfr-kpi-mandate.md` — expanded with field-tested guidance on cold/warm latency tiers, parity tests, graceful degradation specificity
+- `steering/03-coding-standards.md` — error handling rule added to both projects
+
+---
+
+## Deployment Status
+
+| Surface | Status |
+|---------|--------|
+| decodethesign.com | ✅ Deployed |
+| iPhone 16 Pro Max | ✅ Installed |
+| GitHub (main) | ✅ Pushed — HEAD `12e5275d` |
+
+---
+
+## What Quill Noticed
+
+The session started with a field test failure and ended with a codebase that is measurably more observable. The three bugs were independent — a client confidence gate, a rule engine semantic error, and a latency regression — but they shared a common thread: silent failures. The confidence gate failed silently (no log). The rule engine returned wrong verdicts silently (no test caught it until today). The fast-match cache failed silently (swallowed by `catch { return null; }`).
+
+The error handling audit was the right call. Twenty-nine silent catches is not a small number. It means twenty-nine places where the system could fail and nobody would know. The fix is not just the logging — it's the standard that prevents the next twenty-nine from being written.
+
+The scan latency fix is also worth noting. The `isComplexSign` detection was written with good intent — route complex signs to better visual analysis. But it was too aggressive, and the cost was paid on every scan. The fix is precise: trust Vision when it did a good job, escalate to Gemini only when it didn't. That's the right tradeoff.
+
+TestFlight is next.
diff --git a/dispatches/2026-05-21-scan-perf-error-hardening.yaml b/dispatches/2026-05-21-scan-perf-error-hardening.yaml
@@ -0,0 +1,93 @@
+glyph: quill
+kind: readout
+date: "2026-05-21"
+title: "Scan Performance, Error Hardening & Gate Infrastructure"
+scope:
+  - decodethesign-web
+  - decodethesign-ios
+  - agentic-stage-gate-governance
+tags:
+  - bug-fix
+  - performance
+  - error-handling
+  - nfr
+  - admin-dashboard
+  - testflight-prep
+summary: >
+  Field test exposed three production bugs: iOS confidence gate failing on every scan,
+  rule engine returning wrong verdicts for permit-exception signs, and isComplexSign
+  over-detection routing all multi-panel signs to 10s Gemini path. All three fixed and
+  deployed. 29 silent catch blocks eliminated across 17 files — every catch now logs
+  or rethrows. NFR/KPI gate infrastructure built: 30 NFRs documented, two admin
+  dashboards deployed (/admin/nfr-dashboard, /admin/adoption), governance project
+  updated with NFR template and error handling standard.
+artifacts:
+  - path: decodethesign/app/api/consumer/live/[liveId]/capture/route.ts
+    action: modified
+    notes: "Decouple selected_match from canSelectTopCandidate; fix isComplexSign; log silent catches"
+  - path: decodethesign/ios-native/DecodeTheSignPackage/Sources/DecodeTheSignFeature/Services/ScanResultParser.swift
+    action: modified
+    notes: "inline_evaluation present implies dbScore=1.0"
+  - path: decodethesign/lib/rule-engine.ts
+    action: modified
+    notes: "Remove isPermitOnly — no_parking with permit_types is still red"
+  - path: decodethesign/lib/consumer-capture-recent-match.ts
+    action: modified
+    notes: "Log fast-match query failures instead of swallowing"
+  - path: decodethesign/lib/consumer-crowdsource-intake.ts
+    action: modified
+    notes: "Log sharp and fingerprint failures"
+  - path: decodethesign/lib/sign-registry.ts
+    action: modified
+    notes: "Log DB query failures on critical path"
+  - path: decodethesign/lib/rate-limit.ts
+    action: modified
+    notes: "Log KV unavailability"
+  - path: decodethesign/lib/ocr/arrow-detect.ts
+    action: modified
+    notes: "Log Gemini JSON parse failures with raw preview"
+  - path: decodethesign/lib/ocr/extract.ts
+    action: modified
+    notes: "Log tiebreaker failures; SILENT-CATCH-APPROVED for tryParseJson"
+  - path: decodethesign/lib/ocr/prior-sign-hashes.ts
+    action: modified
+    notes: "Log Supabase query failure"
+  - path: decodethesign/lib/ocr/reconsensus.ts
+    action: modified
+    notes: "Log hash computation failure"
+  - path: decodethesign/lib/edge-config.ts
+    action: modified
+    notes: "Log Edge Config unavailability at debug level"
+  - path: decodethesign/app/admin/nfr-dashboard/page.tsx
+    action: created
+    notes: "Live NFR & KPI gate dashboard"
+  - path: decodethesign/app/admin/adoption/page.tsx
+    action: created
+    notes: "DAU/WAU/MAU, retention, engagement, geo spread"
+  - path: decodethesign/docs/NFRS.md
+    action: expanded
+    notes: "30 NFRs, automated gate assertions table"
+  - path: decodethesign/.kiro/steering/03-coding-standards.md
+    action: modified
+    notes: "Error handling rule: no silent catches"
+  - path: agentic-stage-gate-governance/templates/NFR-TEMPLATE.md
+    action: created
+    notes: "Generic reusable NFR template"
+  - path: agentic-stage-gate-governance/steering/03-coding-standards.md
+    action: modified
+    notes: "Error handling rule added"
+  - path: agentic-stage-gate-governance/steering/05-nfr-kpi-mandate.md
+    action: expanded
+    notes: "Field-tested guidance on latency tiers, parity tests, graceful degradation"
+bugs_fixed:
+  - id: scan-confidence-gate
+    severity: critical
+    description: "inline_evaluation present caused dbScore=0.0, composite 0.45 < 0.80 gate"
+  - id: rule-engine-permit-types
+    severity: critical
+    description: "no_parking with permit_types returned green instead of red — 40 tests broken"
+  - id: isComplexSign-over-detection
+    severity: high
+    description: "All multi-panel signs routed to 10s Gemini path instead of 1s client OCR"
+silent_catches_fixed: 29
+tests_passing: 1007
diff --git a/dispatches/2026-05-21-session-close.readout.md b/dispatches/2026-05-21-session-close.readout.md
@@ -0,0 +1,35 @@
+# Quill Session Close — 2026-05-21
+
+## Session Outcome
+
+Three production bugs diagnosed and fixed from a field test. 29 silent catch blocks eliminated. NFR/KPI gate infrastructure built. Two admin dashboards deployed. Error handling standard codified in both projects. All 1007 tests passing. Session state saved.
+
+## Repositories Pushed
+
+| Repo | HEAD | Status |
+|------|------|--------|
+| decodethesign | `12e5275d` | ✅ Pushed to origin/main + deployed to decodethesign.com |
+| agentic-stage-gate-governance | `7dd6074` | ✅ Pushed to origin/trunk |
+
+## Commits This Session (decodethesign)
+
+| Hash | Description |
+|------|-------------|
+| `518edaa4` | fix: scan confidence gate — inline_evaluation implies dbScore=1.0 |
+| `225be514` | fix(rule-engine): no_parking with permit_types is still red |
+| `8f1b3899` | feat(ios): share location via Messages, tell-a-friend unified, profile polish |
+| `225be514` | fix(rule-engine): permit_types |
+| `6b5d4548` | docs: expand NFRs with TestFlight gate |
+| `b2c4936f` | feat(admin): NFR & KPI dashboard |
+| `d1bfc789` | feat(admin): adoption dashboard |
+| `a1221304` | perf: fix isComplexSign over-detection |
+| `ddb34267` | standards: no silent catch |
+| `47673651` | fix: eliminate all silent catch blocks |
+| `12e5275d` | session: save state |
+
+## Open Items for Next Session
+
+1. **TestFlight submission** — Xcode Archive → App Store Connect → TestFlight
+2. **Manual NFR evidence** — 14 items pending on `/admin/nfr-dashboard`
+3. **More screenshots** — drop into `assets/`
+4. **Supabase type regeneration** — remove `ignoreBuildErrors: true`
diff --git a/dispatches/2026-05-21-session-close.yaml b/dispatches/2026-05-21-session-close.yaml
@@ -0,0 +1,13 @@
+glyph: quill
+kind: session-close
+date: "2026-05-21"
+title: "Session Close — Scan Perf, Error Hardening & Gate Infrastructure"
+outcome: success
+repos_pushed:
+  - decodethesign
+  - agentic-stage-gate-governance
+next_session:
+  - TestFlight submission (Xcode Archive → App Store Connect)
+  - Manual NFR evidence (14 items pending on /admin/nfr-dashboard)
+  - More app screenshots
+  - Supabase type regeneration