|
| 1 | +# Quill Readout — 2026-05-21: Scan Performance, Error Hardening & Gate Infrastructure |
| 2 | + |
| 3 | +## Session Summary |
| 4 | + |
| 5 | +A field test exposed three production bugs. All three were diagnosed, fixed, and deployed in the same session. The team also built out the NFR/KPI gate infrastructure, hardened error handling across the entire codebase, and shipped two new admin dashboards. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Part 1: Field Test — Five Scans, Five Failures |
| 10 | + |
| 11 | +The user walked outside and scanned the sign in front of their house five times. All five failed. The backend was healthy — OCR confidence 0.997, sign matched, rules evaluated correctly. The failures were entirely client-side. |
| 12 | + |
| 13 | +### Bug 1: Scan Confidence Gate (iOS Client) |
| 14 | + |
| 15 | +**Root cause:** Every scan that produced a valid `inline_evaluation` also had `selected_match: null` in the server response. This was because `cacheableMatchSignId` was gated on `canSelectTopCandidate`, which was always false when `assessCaptureSuggestion` returned `canOpenSuggestedSign: false` — which it always does when `hasInlineEvaluation: true`. The iOS client's `compositeConfidence()` then computed `dbScore = 0.0`, yielding composite ≈ 0.45 — below the 0.80 gate — and returned `.lowConfidence` on every scan despite perfect OCR. |
| 16 | + |
| 17 | +**Fix (server):** Decoupled `cacheableMatchSignId` from `canSelectTopCandidate`. `selected_match` now reflects "did we identify a sign?" independently of whether the UI should show a suggested sign button. |
| 18 | + |
| 19 | +**Fix (client):** When `inline_evaluation` is present, treat `dbScore = 1.0`. The server already evaluated the rules — that is an implicit high-confidence match. |
| 20 | + |
| 21 | +### Bug 2: Rule Engine Permit Exception (Backend) |
| 22 | + |
| 23 | +**Root cause:** A commit had introduced `isPermitOnly` logic that treated any `no_parking` rule with `permit_types` as non-restrictive. This is semantically backwards: `permit_types` on a `no_parking` rule means "No Parking EXCEPT permit holders" — regular users are still restricted. The sign in front of the house has `permit_types: ["I"]` on its evening no-parking rule. The rule engine was returning "green" instead of "red." |
| 24 | + |
| 25 | +**Fix:** Removed `isPermitOnly` entirely. `isRestricted` is simply whether the winning rule is `no_parking` or `no_standing`. This fixed 40 failing parity/rule-engine tests that had been silently broken. |
| 26 | + |
| 27 | +### Bug 3: isComplexSign Over-Detection (Backend) |
| 28 | + |
| 29 | +**Root cause:** The `isComplexSign` detection was flagging any sign with arrows (→, ←), multiple NO PARKING mentions, or permit exceptions as needing Gemini visual arrow detection. This routed every multi-panel sign — including the common 3-panel Raleigh R7-108 — to the 6-10s hybrid Gemini path instead of the ~1s client OCR fast path. |
| 30 | + |
| 31 | +**Fix:** If Vision extracted complete text (length > 80 chars AND has time ranges), trust it regardless of complexity indicators. The arrow symbols in the extracted text are sufficient for panel parsing. Gemini only needs to see the image when the text is short or incomplete. |
| 32 | + |
| 33 | +**Result:** First-scan latency drops from ~10s to ~3-4s on multi-panel signs. Subsequent scans (fast-match cache) remain ~1-2s. |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## Part 2: Error Handling Audit — 29 Silent Catches Eliminated |
| 38 | + |
| 39 | +A code review revealed 29 instances of `catch { return null; }` across the codebase. This pattern hides failures, makes debugging impossible, and was the proximate cause of the fast-match cache silently failing on Vercel cold starts — causing every scan to fall back to full OCR with no log evidence. |
| 40 | + |
| 41 | +Every silent catch was classified and fixed: |
| 42 | + |
| 43 | +| Category | Count | Fix | |
| 44 | +|----------|-------|-----| |
| 45 | +| JSON.parse on user/API input | 6 | `console.debug()` before returning 400/null | |
| 46 | +| Image processing fallbacks (sharp) | 4 | `console.warn()` before fallback | |
| 47 | +| Fingerprint computation | 2 | `console.warn()` — non-critical deduplication | |
| 48 | +| Edge Config unavailable | 2 | `console.debug()` — infrastructure optional | |
| 49 | +| OCR JSON parsing | 3 | `console.warn()` with raw response preview | |
| 50 | +| Supabase query failures | 4 | `console.error()` — on paths that matter | |
| 51 | +| Sign registry DB | 2 | `console.error()` — critical path | |
| 52 | +| Rate limit KV | 1 | `console.warn()` — documented fallback | |
| 53 | +| Misc (segments, nearby spots, cron) | 4 | `console.warn()` / `console.error()` | |
| 54 | +| **Approved silent** | 1 | `SILENT-CATCH-APPROVED: Winston` — `tryParseJson` retry loop | |
| 55 | + |
| 56 | +The coding standards were updated in both DecodeTheSign and agentic-stage-gate-governance to codify the rule: every catch must log or rethrow. Silent catches require a `SILENT-CATCH-APPROVED` comment with architect name, date, and specific justification. |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## Part 3: NFR & KPI Gate Infrastructure |
| 61 | + |
| 62 | +### NFR Document Expanded |
| 63 | +`docs/NFRS.md` now has 30 NFRs across 6 categories (Performance, Accuracy, Reliability, Security, Accessibility, Scalability) with measurement methods, gate assignments, and an automated gate assertions table mapping 6 NFRs to passing CI tests. |
| 64 | + |
| 65 | +### Admin Dashboards |
| 66 | +Two new admin dashboards deployed to decodethesign.com: |
| 67 | + |
| 68 | +**`/admin/nfr-dashboard`** — Live pass/fail gate status for all NFRs and KPIs. Automated items (CI-backed) show PASS immediately. Manual items show PENDING with a checklist of what evidence is needed before G4 closes. |
| 69 | + |
| 70 | +**`/admin/adoption`** — DAU/WAU/MAU, 7-day retention, engagement depth (median scans/user, power users), DAU/scans bar charts (30d), geographic spread, user level distribution, top contributors leaderboard. |
| 71 | + |
| 72 | +### Governance Project Updated |
| 73 | +`agentic-stage-gate-governance` received: |
| 74 | +- `templates/NFR-TEMPLATE.md` — generic reusable NFR template with all 6 categories |
| 75 | +- `steering/05-nfr-kpi-mandate.md` — expanded with field-tested guidance on cold/warm latency tiers, parity tests, graceful degradation specificity |
| 76 | +- `steering/03-coding-standards.md` — error handling rule added to both projects |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Deployment Status |
| 81 | + |
| 82 | +| Surface | Status | |
| 83 | +|---------|--------| |
| 84 | +| decodethesign.com | ✅ Deployed | |
| 85 | +| iPhone 16 Pro Max | ✅ Installed | |
| 86 | +| GitHub (main) | ✅ Pushed — HEAD `12e5275d` | |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## What Quill Noticed |
| 91 | + |
| 92 | +The session started with a field test failure and ended with a codebase that is measurably more observable. The three bugs were independent — a client confidence gate, a rule engine semantic error, and a latency regression — but they shared a common thread: silent failures. The confidence gate failed silently (no log). The rule engine returned wrong verdicts silently (no test caught it until today). The fast-match cache failed silently (swallowed by `catch { return null; }`). |
| 93 | + |
| 94 | +The error handling audit was the right call. Twenty-nine silent catches is not a small number. It means twenty-nine places where the system could fail and nobody would know. The fix is not just the logging — it's the standard that prevents the next twenty-nine from being written. |
| 95 | + |
| 96 | +The scan latency fix is also worth noting. The `isComplexSign` detection was written with good intent — route complex signs to better visual analysis. But it was too aggressive, and the cost was paid on every scan. The fix is precise: trust Vision when it did a good job, escalate to Gemini only when it didn't. That's the right tradeoff. |
| 97 | + |
| 98 | +TestFlight is next. |
0 commit comments