|
| 1 | +# Drip Hardening Gaps And Roadmap |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +This document captures the current missing pieces in drip functionality (queue + web integration), focused on scalability, maintainability, and industry-standard operational behavior. |
| 6 | + |
| 7 | +Date: 2026-03-21 |
| 8 | +Scope: |
| 9 | + |
| 10 | +- `apps/queue/src/domain/process-drip.ts` |
| 11 | +- `apps/queue/src/domain/queries.ts` |
| 12 | +- `apps/web/graphql/courses/logic.ts` |
| 13 | + |
| 14 | +This list reflects remaining open hardening gaps on the current branch. |
| 15 | +Last updated after: |
| 16 | + |
| 17 | +- Domain-scoped membership query contract in queue drip flow. |
| 18 | +- Regression test coverage for domain-scoped membership lookup and multi-group drip email behavior. |
| 19 | + |
| 20 | +## 1) Duplicate Emails In Horizontal Scale (Critical) |
| 21 | + |
| 22 | +Current behavior: |
| 23 | + |
| 24 | +- Multiple queue instances can execute `processDrip()` concurrently. |
| 25 | +- They can compute same unlocked groups and queue duplicate emails. |
| 26 | +- User progress update is idempotent (`$addToSet`), email queueing is not. |
| 27 | + |
| 28 | +Impact: |
| 29 | + |
| 30 | +- Duplicate notifications to learners. |
| 31 | +- Hard-to-debug race conditions. |
| 32 | + |
| 33 | +Recommendation: |
| 34 | + |
| 35 | +- Add distributed lock for drip worker cycle (Redis lock / BullMQ job lock / leader election). |
| 36 | +- Add idempotency key for drip emails (for example: `drip:<domain>:<course>:<user>:<group>`). |
| 37 | + |
| 38 | +Acceptance criteria: |
| 39 | + |
| 40 | +- Running N workers concurrently does not create duplicate drip emails for same user/group release event. |
| 41 | + |
| 42 | +## 2) Non-Atomic Unlock + Notify Flow (High) |
| 43 | + |
| 44 | +Current behavior: |
| 45 | + |
| 46 | +- Unlock state is written first, email queueing happens afterwards. |
| 47 | +- If email enqueue fails, unlock succeeds but notification may be partially/fully lost. |
| 48 | + |
| 49 | +Impact: |
| 50 | + |
| 51 | +- Inconsistent learner communication. |
| 52 | +- No deterministic retry for missed notifications. |
| 53 | + |
| 54 | +Recommendation: |
| 55 | + |
| 56 | +- Introduce outbox pattern: |
| 57 | + - Persist release events atomically with unlock update. |
| 58 | + - Separate reliable sender consumes outbox with retries and idempotency. |
| 59 | + |
| 60 | +Acceptance criteria: |
| 61 | + |
| 62 | +- If queue/email subsystem is down temporarily, unlock notifications are eventually delivered without duplicates. |
| 63 | + |
| 64 | +## 3) Full Polling + In-Memory Expansion (High) |
| 65 | + |
| 66 | +Current behavior: |
| 67 | + |
| 68 | +- Every minute: |
| 69 | + - loads all courses with drip, |
| 70 | + - loads memberships per course, |
| 71 | + - loads users per course, |
| 72 | + - loops in application memory. |
| 73 | + |
| 74 | +Impact: |
| 75 | + |
| 76 | +- High DB load and memory pressure as tenants/courses/users scale. |
| 77 | +- Runtime grows with total data size, not just due work. |
| 78 | + |
| 79 | +Recommendation: |
| 80 | + |
| 81 | +- Move to due-work driven execution: |
| 82 | + - precompute next drip due per user/course purchase, or |
| 83 | + - enqueue per-user drip check jobs at enrollment/release events. |
| 84 | +- Use cursor/batch processing where full scan remains necessary. |
| 85 | + |
| 86 | +Acceptance criteria: |
| 87 | + |
| 88 | +- Runtime and DB pressure scale roughly with due drip events, not total historical volume. |
| 89 | + |
| 90 | +## 4) Missing/Weak Indexing For Drip Query Paths (Medium) |
| 91 | + |
| 92 | +Current behavior: |
| 93 | + |
| 94 | +- Frequent query shapes in drip path do not have explicit compound indexes aligned to usage. |
| 95 | + |
| 96 | +Primary candidates: |
| 97 | + |
| 98 | +- `Membership`: `(domain, entityType, entityId, status, userId)` |
| 99 | +- `User`: `(domain, userId)` (for bulk lookup by userIds in a domain) |
| 100 | +- `Course`: if keeping scan approach, index around drip-enabled groups and domain as feasible. |
| 101 | + |
| 102 | +Impact: |
| 103 | + |
| 104 | +- Degraded throughput and elevated DB CPU at scale. |
| 105 | + |
| 106 | +Recommendation: |
| 107 | + |
| 108 | +- Add and validate compound indexes for hot predicates. |
| 109 | +- Run explain plans before/after. |
| 110 | + |
| 111 | +Acceptance criteria: |
| 112 | + |
| 113 | +- Explain plans avoid broad collection scans for hot drip queries. |
| 114 | + |
| 115 | +## 5) Rank Reorder Semantics For Relative Drip (Medium) |
| 116 | + |
| 117 | +Current behavior: |
| 118 | + |
| 119 | +- Relative drip is rank-ordered each run. |
| 120 | +- If groups are reordered after enrollments, in-flight learner release path changes. |
| 121 | + |
| 122 | +Impact: |
| 123 | + |
| 124 | +- Learner-facing release schedule may shift unexpectedly. |
| 125 | +- Support burden and potential trust issues. |
| 126 | + |
| 127 | +Recommendation (choose one explicit product policy): |
| 128 | + |
| 129 | +- Policy A (simple): lock relative rank editing once enrollments exist. |
| 130 | +- Policy B (robust): persist per-user drip cursor/snapshot so future rank changes affect only new learners (or only after explicit migration). |
| 131 | + |
| 132 | +Acceptance criteria: |
| 133 | + |
| 134 | +- Reordering behavior is deterministic and documented. |
| 135 | + |
| 136 | +## 6) Multi-Email Burst For Same-Run Unlocks (Medium) |
| 137 | + |
| 138 | +Current behavior: |
| 139 | + |
| 140 | +- If multiple groups unlock in one run and each has drip email configured, user gets multiple emails. |
| 141 | + |
| 142 | +Impact: |
| 143 | + |
| 144 | +- Notification fatigue. |
| 145 | + |
| 146 | +Recommendation: |
| 147 | + |
| 148 | +- Add configurable notification policy: |
| 149 | + - `per_group` (current), |
| 150 | + - `digest_per_run` (recommended default for larger schools). |
| 151 | +- If digest enabled, provide editable digest template with localization strategy. |
| 152 | + |
| 153 | +Acceptance criteria: |
| 154 | + |
| 155 | +- Multi-unlock runs follow configured policy and are test-covered. |
| 156 | + |
| 157 | +## 7) Data Validation Hardening For Drip Inputs (Medium) |
| 158 | + |
| 159 | +Current behavior: |
| 160 | + |
| 161 | +- Some server-side constraints are implicit/incomplete (for example, exact-date vs relative-date required fields). |
| 162 | + |
| 163 | +Impact: |
| 164 | + |
| 165 | +- Invalid drip configs can be stored and silently skipped. |
| 166 | + |
| 167 | +Recommendation: |
| 168 | + |
| 169 | +- Enforce stricter validation in `updateGroup`: |
| 170 | + - `relative-date` requires numeric `delayInMillis >= 0`, |
| 171 | + - `exact-date` requires valid numeric `dateInUTC`, |
| 172 | + - optional sanity checks for email schema consistency. |
| 173 | + |
| 174 | +Acceptance criteria: |
| 175 | + |
| 176 | +- Invalid drip payloads are rejected with explicit errors. |
| 177 | + |
| 178 | +## 8) Observability Gaps For Release Lifecycle (Medium) |
| 179 | + |
| 180 | +Current behavior: |
| 181 | + |
| 182 | +- Error capture exists, but release lifecycle visibility is limited. |
| 183 | + |
| 184 | +Recommendation: |
| 185 | + |
| 186 | +- Add structured counters/events: |
| 187 | + - `drip_courses_scanned` |
| 188 | + - `drip_users_evaluated` |
| 189 | + - `drip_groups_unlocked` |
| 190 | + - `drip_emails_queued` |
| 191 | + - `drip_emails_failed` |
| 192 | + - `drip_loop_duration_ms` |
| 193 | +- Add per-domain dimensions where safe. |
| 194 | + |
| 195 | +Acceptance criteria: |
| 196 | + |
| 197 | +- Can answer: "How many unlocks and drip emails happened per domain/day and failure rate?" |
| 198 | + |
| 199 | +## 9) Test Coverage Still Missing For Some Hard Cases (Medium) |
| 200 | + |
| 201 | +Current behavior: |
| 202 | + |
| 203 | +- Unit-level coverage has improved significantly, but concurrency/idempotency/outbox/reorder-policy and digest policy are not covered yet. |
| 204 | + |
| 205 | +Recommendation: |
| 206 | + |
| 207 | +- Add tests for: |
| 208 | + - concurrent worker execution idempotency, |
| 209 | + - outbox retry semantics, |
| 210 | + - rank reorder policy behavior, |
| 211 | + - digest mode behavior. |
| 212 | + |
| 213 | +Acceptance criteria: |
| 214 | + |
| 215 | +- Regression suite catches duplicate-email races and policy regressions. |
| 216 | + |
| 217 | +## Proposed Implementation Plan |
| 218 | + |
| 219 | +## Phase 1 (P0 - Safety) |
| 220 | + |
| 221 | +- Add strict drip input validation in `updateGroup`. |
| 222 | +- Add key indexes for hot query paths. |
| 223 | +- Add release metrics counters. |
| 224 | + |
| 225 | +## Phase 2 (P1 - Correctness Under Scale) |
| 226 | + |
| 227 | +- Introduce distributed locking for drip worker cycle. |
| 228 | +- Add email idempotency key to avoid duplicate sends. |
| 229 | + |
| 230 | +## Phase 3 (P1/P2 - Reliability) |
| 231 | + |
| 232 | +- Implement outbox for unlock-notification events. |
| 233 | +- Add sender worker retry + dead-letter handling. |
| 234 | + |
| 235 | +## Phase 4 (P2 - Product Policy) |
| 236 | + |
| 237 | +- Decide and implement rank-reorder semantics for relative drip. |
| 238 | +- Add digest mode and editable template/localization design. |
| 239 | + |
| 240 | +## Suggested Ownership Split |
| 241 | + |
| 242 | +- Queue worker correctness/scalability: `apps/queue` |
| 243 | +- GraphQL validation + admin constraints: `apps/web/graphql` |
| 244 | +- Schema/index migrations: `packages/common-logic`, `packages/orm-models`, migration scripts |
| 245 | +- Metrics/dashboarding: observability owner |
| 246 | + |
| 247 | +## Decision Log Needed |
| 248 | + |
| 249 | +Before implementation, confirm: |
| 250 | + |
| 251 | +- Reorder policy for in-flight learners (lock vs cursor snapshot). |
| 252 | +- Email policy for multi-unlock runs (per-group vs digest). |
| 253 | +- Preferred reliability model (direct enqueue with idempotency vs outbox). |
| 254 | + |
| 255 | +## References |
| 256 | + |
| 257 | +- `apps/queue/src/domain/process-drip.ts` |
| 258 | +- `apps/queue/src/domain/queries.ts` |
| 259 | +- `apps/web/graphql/courses/logic.ts` |
| 260 | +- `apps/queue/src/domain/__tests__/process-drip.test.ts` |
| 261 | +- `apps/web/graphql/courses/__tests__/update-group-drip.test.ts` |
0 commit comments