Skip to content

Commit 1ceaf9a

Browse files
rajat1saxenaRajat
andauthored
Queue observability via Posthog (#745)
* bug fix: drip settings cannot be switched off * Test suite parallelization fix * posthog observability for queue; drip refactoring; * Changed messaging * Codex drip code review fix --------- Co-authored-by: Rajat <hi@rajatsaxena.dev>
1 parent ec366d5 commit 1ceaf9a

File tree

100 files changed

+4818
-968
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+4818
-968
lines changed

apps/docs/src/pages/en/courses/section.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,22 +55,26 @@ If drip configuration is enabled for a section, a student won't be able to acces
5555
### Drip by Date
5656

5757
1. If you want a section to be available to users on a specific date, this is the option you should opt for.
58+
2. Exact-date sections unlock only when their chosen date and time arrives.
59+
3. Unlocking an exact-date section does not change the timing of other relative drip sections.
5860

5961
![Drip by Date](/assets/products/drip-by-date.png)
6062

61-
2. Select the date on which this section will be dripped.
62-
3. Click `Continue` to save it.
63+
4. Select the date on which this section will be dripped.
64+
5. Click `Continue` to save it.
6365

6466
### Drip After a Certain Number of Days From Last Dripped Content
6567

6668
1. If you want a section to be available to users after a certain number of days have elapsed since the last dripped content, this is the option you should opt for.
69+
2. Relative-date sections are released in section order. A later relative section waits for the earlier relative section before its own delay begins.
70+
3. The first relative section counts from the student's enrollment date. After that, each newly released relative section becomes the anchor for the next relative section.
6771

6872
> For the first dripped section, the date of enrollment will be considered the last dripped content date.
6973
7074
![Drip After a Certain Number of Days Have Elapsed](/assets/products/drip-by-specific-days.png)
7175

72-
2. Select the number of days.
73-
3. Click `Continue` to save it.
76+
4. Select the number of days.
77+
5. Click `Continue` to save it.
7478

7579
### Notify Users When a Section Has Dripped
7680

apps/queue/.env

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,36 @@
1+
# Port for the queue service to listen on
12
PORT=4000
3+
4+
# email configuration for sending notifications (e.g., course updates, reminders)
25
EMAIL_USER=email_user
36
EMAIL_PASS=email_pass
47
EMAIL_HOST=email_host
58
EMAIL_FROM=no-reply@example.com
9+
10+
# MongoDB connection string for storing queue data and job information
611
DB_CONNECTION_STRING=mongodb://db.string
12+
13+
# Redis configuration for caching and job queue management
714
REDIS_HOST=localhost
815
REDIS_PORT=6379
16+
17+
# Maximum number of times to retry a failed job before marking it as failed
918
SEQUENCE_BOUNCE_LIMIT=3
19+
20+
# Domain for constructing URLs in notifications and links
1021
DOMAIN=courselit.app
11-
PIXEL_SIGNING_SECRET=super_secret_string
22+
23+
# Secret key for signing tracking pixels (used in email notifications)
24+
PIXEL_SIGNING_SECRET=super_secret_string
25+
26+
# Optional: enables PostHog exception tracking when set.
27+
POSTHOG_API_KEY=
28+
29+
# Optional: PostHog host URL (default shown here).
30+
POSTHOG_HOST=https://us.i.posthog.com
31+
32+
# Optional: per-source exception cap per minute (default: 100).
33+
POSTHOG_ERROR_CAP_PER_SOURCE_PER_MINUTE=100
34+
35+
# Optional: deployment environment label sent in telemetry (default: unknown).
36+
DEPLOY_ENV=local

apps/queue/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ The following environment variables are used by the queue service:
2121
- `EMAIL_PORT` - SMTP server port (default: `587`)
2222
- `PORT` - HTTP server port (default: `80`)
2323
- `NODE_ENV` - Environment mode. When set to `production`, emails are actually sent; otherwise they are only logged
24+
- `POSTHOG_API_KEY` - Enables PostHog error tracking when set
25+
- `POSTHOG_HOST` - PostHog host URL (default: `https://us.i.posthog.com`)
26+
- `POSTHOG_ERROR_CAP_PER_SOURCE_PER_MINUTE` - Per-source exception cap (default: `100`)
27+
- `DEPLOY_ENV` - Deployment environment label used in telemetry (default: `unknown`)
2428
- `SEQUENCE_BOUNCE_LIMIT` - Maximum number of bounces allowed for email sequences (default: `3`)
2529
- `PROTOCOL` - Protocol used for generating site URLs (default: `https`)
2630
- `DOMAIN` - Base domain name for generating site URLs
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
# Drip Hardening Gaps And Roadmap
2+
3+
## Context
4+
5+
This document captures the current missing pieces in drip functionality (queue + web integration), focused on scalability, maintainability, and industry-standard operational behavior.
6+
7+
Date: 2026-03-21
8+
Scope:
9+
10+
- `apps/queue/src/domain/process-drip.ts`
11+
- `apps/queue/src/domain/queries.ts`
12+
- `apps/web/graphql/courses/logic.ts`
13+
14+
This list reflects remaining open hardening gaps on the current branch.
15+
Last updated after:
16+
17+
- Domain-scoped membership query contract in queue drip flow.
18+
- Regression test coverage for domain-scoped membership lookup and multi-group drip email behavior.
19+
20+
## 1) Duplicate Emails In Horizontal Scale (Critical)
21+
22+
Current behavior:
23+
24+
- Multiple queue instances can execute `processDrip()` concurrently.
25+
- They can compute same unlocked groups and queue duplicate emails.
26+
- User progress update is idempotent (`$addToSet`), email queueing is not.
27+
28+
Impact:
29+
30+
- Duplicate notifications to learners.
31+
- Hard-to-debug race conditions.
32+
33+
Recommendation:
34+
35+
- Add distributed lock for drip worker cycle (Redis lock / BullMQ job lock / leader election).
36+
- Add idempotency key for drip emails (for example: `drip:<domain>:<course>:<user>:<group>`).
37+
38+
Acceptance criteria:
39+
40+
- Running N workers concurrently does not create duplicate drip emails for same user/group release event.
41+
42+
## 2) Non-Atomic Unlock + Notify Flow (High)
43+
44+
Current behavior:
45+
46+
- Unlock state is written first, email queueing happens afterwards.
47+
- If email enqueue fails, unlock succeeds but notification may be partially/fully lost.
48+
49+
Impact:
50+
51+
- Inconsistent learner communication.
52+
- No deterministic retry for missed notifications.
53+
54+
Recommendation:
55+
56+
- Introduce outbox pattern:
57+
- Persist release events atomically with unlock update.
58+
- Separate reliable sender consumes outbox with retries and idempotency.
59+
60+
Acceptance criteria:
61+
62+
- If queue/email subsystem is down temporarily, unlock notifications are eventually delivered without duplicates.
63+
64+
## 3) Full Polling + In-Memory Expansion (High)
65+
66+
Current behavior:
67+
68+
- Every minute:
69+
- loads all courses with drip,
70+
- loads memberships per course,
71+
- loads users per course,
72+
- loops in application memory.
73+
74+
Impact:
75+
76+
- High DB load and memory pressure as tenants/courses/users scale.
77+
- Runtime grows with total data size, not just due work.
78+
79+
Recommendation:
80+
81+
- Move to due-work driven execution:
82+
- precompute next drip due per user/course purchase, or
83+
- enqueue per-user drip check jobs at enrollment/release events.
84+
- Use cursor/batch processing where full scan remains necessary.
85+
86+
Acceptance criteria:
87+
88+
- Runtime and DB pressure scale roughly with due drip events, not total historical volume.
89+
90+
## 4) Missing/Weak Indexing For Drip Query Paths (Medium)
91+
92+
Current behavior:
93+
94+
- Frequent query shapes in drip path do not have explicit compound indexes aligned to usage.
95+
96+
Primary candidates:
97+
98+
- `Membership`: `(domain, entityType, entityId, status, userId)`
99+
- `User`: `(domain, userId)` (for bulk lookup by userIds in a domain)
100+
- `Course`: if keeping scan approach, index around drip-enabled groups and domain as feasible.
101+
102+
Impact:
103+
104+
- Degraded throughput and elevated DB CPU at scale.
105+
106+
Recommendation:
107+
108+
- Add and validate compound indexes for hot predicates.
109+
- Run explain plans before/after.
110+
111+
Acceptance criteria:
112+
113+
- Explain plans avoid broad collection scans for hot drip queries.
114+
115+
## 5) Rank Reorder Semantics For Relative Drip (Medium)
116+
117+
Current behavior:
118+
119+
- Relative drip is rank-ordered each run.
120+
- If groups are reordered after enrollments, in-flight learner release path changes.
121+
122+
Impact:
123+
124+
- Learner-facing release schedule may shift unexpectedly.
125+
- Support burden and potential trust issues.
126+
127+
Recommendation (choose one explicit product policy):
128+
129+
- Policy A (simple): lock relative rank editing once enrollments exist.
130+
- Policy B (robust): persist per-user drip cursor/snapshot so future rank changes affect only new learners (or only after explicit migration).
131+
132+
Acceptance criteria:
133+
134+
- Reordering behavior is deterministic and documented.
135+
136+
## 6) Multi-Email Burst For Same-Run Unlocks (Medium)
137+
138+
Current behavior:
139+
140+
- If multiple groups unlock in one run and each has drip email configured, user gets multiple emails.
141+
142+
Impact:
143+
144+
- Notification fatigue.
145+
146+
Recommendation:
147+
148+
- Add configurable notification policy:
149+
- `per_group` (current),
150+
- `digest_per_run` (recommended default for larger schools).
151+
- If digest enabled, provide editable digest template with localization strategy.
152+
153+
Acceptance criteria:
154+
155+
- Multi-unlock runs follow configured policy and are test-covered.
156+
157+
## 7) Data Validation Hardening For Drip Inputs (Medium)
158+
159+
Current behavior:
160+
161+
- Some server-side constraints are implicit/incomplete (for example, exact-date vs relative-date required fields).
162+
163+
Impact:
164+
165+
- Invalid drip configs can be stored and silently skipped.
166+
167+
Recommendation:
168+
169+
- Enforce stricter validation in `updateGroup`:
170+
- `relative-date` requires numeric `delayInMillis >= 0`,
171+
- `exact-date` requires valid numeric `dateInUTC`,
172+
- optional sanity checks for email schema consistency.
173+
174+
Acceptance criteria:
175+
176+
- Invalid drip payloads are rejected with explicit errors.
177+
178+
## 8) Observability Gaps For Release Lifecycle (Medium)
179+
180+
Current behavior:
181+
182+
- Error capture exists, but release lifecycle visibility is limited.
183+
184+
Recommendation:
185+
186+
- Add structured counters/events:
187+
- `drip_courses_scanned`
188+
- `drip_users_evaluated`
189+
- `drip_groups_unlocked`
190+
- `drip_emails_queued`
191+
- `drip_emails_failed`
192+
- `drip_loop_duration_ms`
193+
- Add per-domain dimensions where safe.
194+
195+
Acceptance criteria:
196+
197+
- Can answer: "How many unlocks and drip emails happened per domain/day and failure rate?"
198+
199+
## 9) Test Coverage Still Missing For Some Hard Cases (Medium)
200+
201+
Current behavior:
202+
203+
- Unit-level coverage has improved significantly, but concurrency/idempotency/outbox/reorder-policy and digest policy are not covered yet.
204+
205+
Recommendation:
206+
207+
- Add tests for:
208+
- concurrent worker execution idempotency,
209+
- outbox retry semantics,
210+
- rank reorder policy behavior,
211+
- digest mode behavior.
212+
213+
Acceptance criteria:
214+
215+
- Regression suite catches duplicate-email races and policy regressions.
216+
217+
## Proposed Implementation Plan
218+
219+
## Phase 1 (P0 - Safety)
220+
221+
- Add strict drip input validation in `updateGroup`.
222+
- Add key indexes for hot query paths.
223+
- Add release metrics counters.
224+
225+
## Phase 2 (P1 - Correctness Under Scale)
226+
227+
- Introduce distributed locking for drip worker cycle.
228+
- Add email idempotency key to avoid duplicate sends.
229+
230+
## Phase 3 (P1/P2 - Reliability)
231+
232+
- Implement outbox for unlock-notification events.
233+
- Add sender worker retry + dead-letter handling.
234+
235+
## Phase 4 (P2 - Product Policy)
236+
237+
- Decide and implement rank-reorder semantics for relative drip.
238+
- Add digest mode and editable template/localization design.
239+
240+
## Suggested Ownership Split
241+
242+
- Queue worker correctness/scalability: `apps/queue`
243+
- GraphQL validation + admin constraints: `apps/web/graphql`
244+
- Schema/index migrations: `packages/common-logic`, `packages/orm-models`, migration scripts
245+
- Metrics/dashboarding: observability owner
246+
247+
## Decision Log Needed
248+
249+
Before implementation, confirm:
250+
251+
- Reorder policy for in-flight learners (lock vs cursor snapshot).
252+
- Email policy for multi-unlock runs (per-group vs digest).
253+
- Preferred reliability model (direct enqueue with idempotency vs outbox).
254+
255+
## References
256+
257+
- `apps/queue/src/domain/process-drip.ts`
258+
- `apps/queue/src/domain/queries.ts`
259+
- `apps/web/graphql/courses/logic.ts`
260+
- `apps/queue/src/domain/__tests__/process-drip.test.ts`
261+
- `apps/web/graphql/courses/__tests__/update-group-drip.test.ts`

apps/queue/jest.config.ts

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
const config = {
2-
preset: "@shelf/jest-mongodb",
32
setupFilesAfterEnv: ["<rootDir>/setupTests.ts"],
43
watchPathIgnorePatterns: ["globalConfig"],
54
moduleNameMapper: {

apps/queue/package.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@
1818
"@courselit/email-editor": "workspace:^",
1919
"@courselit/orm-models": "workspace:^",
2020
"@courselit/utils": "workspace:^",
21+
"@opentelemetry/api-logs": "^0.204.0",
22+
"@opentelemetry/exporter-logs-otlp-http": "^0.204.0",
23+
"@opentelemetry/resources": "^2.1.0",
24+
"@opentelemetry/sdk-logs": "^0.204.0",
25+
"@opentelemetry/sdk-node": "^0.204.0",
2126
"@types/jsdom": "^21.1.7",
2227
"bullmq": "^4.14.0",
2328
"express": "^4.18.2",
@@ -26,6 +31,7 @@
2631
"mongodb": "^6.15.0",
2732
"mongoose": "^8.13.1",
2833
"nodemailer": "^6.9.2",
34+
"posthog-node": "^5.9.1",
2935
"pino": "^8.14.1",
3036
"pino-mongodb": "^4.3.0",
3137
"zod": "^3.22.4"

0 commit comments

Comments
 (0)