Skip to content

Commit fa360ab

Browse files
authored
[Refactor] Change Retry Logic in Email Sending (#1191)
### Context Some of our users' emails were getting stuck in sending. The long delays in processing the retries caused a vercel function timeout. ### Summary of Changes We refactor the low level email sending functions to remove the retry logic there. We kick it up to the email queue step. Additionally, we flag emails to be retried when they encounter issues but leave it for a future iteration to actually perform the retry. We perform an exponential backoff with a random component to decide when they have to be retried. We also make some small adjustments to the queuing function to not queue skipped emails. When an email fails to send during the sending function, we check to see if it is a retryable error or not. Some errors are transient and trying again may succeed while others indicate deeper issues. If it is retryable, and the max number of retry attempts hasn't been reached, we set `nextSendRetryAt` to a time determined by an exponential backoff calculation function. When the queuing function looks for emails to queue, it doesn't just pick up the `SCHEDULED`. emails whose `scheduledAt` time <= `NOW()`, but also those emails whose `nextSendRetryAt` time <= `NOW()`. What this means in practice is that one iteration of the `email-queue-step` will mark emails as retryable while another iteration will perform the retry. This should be cleaner and prevent long delays in the `email-queue-step` process due to retries. This also makes it easier to scale up the number of retries if need be.
1 parent b255c5e commit fa360ab

15 files changed

Lines changed: 631 additions & 122 deletions

File tree

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ To see all development ports, refer to the index.html of `apps/dev-launchpad/pub
101101
- Fail early, fail loud. Fail fast with an error instead of silently continuing.
102102
- Do NOT use `as`/`any`/type casts or anything else like that to bypass the type system unless you specifically asked the user about it. Most of the time a place where you would use type casts is not one where you actually need them. Avoid wherever possible.
103103
- When writing database migration files, assume that we have >1,000,000 rows in every table (unless otherwise specified). This means you may have to use CONDITIONALLY_REPEAT_MIGRATION_SENTINEL to avoid running the migration and things like concurrent index builds; see the existing migrations for examples.
104+
- Each migration file runs in its own transaction with a relatively short timeout. Split long-running operations into separate migration files to avoid timeouts. For example, when adding CHECK constraints, use `NOT VALID` in one migration, then `VALIDATE CONSTRAINT` in a separate migration file.
104105
- **When building frontend code, always carefully deal with loading and error states.** Be very explicit with these; some components make this easy, eg. the button onClick already takes an async callback for loading state, but make sure this is done everywhere, and make sure errors are NEVER just silently swallowed.
105106
- Unless very clearly equivalent from types, prefer explicit null/undefinedness checks over boolean checks, eg. `foo == null` instead of `!foo`.
106107

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
-- Add deferred retry fields for email sending
2+
-- These fields allow the email queue to schedule retries for later iterations
3+
-- instead of blocking the current iteration with inline retries.
4+
5+
ALTER TABLE "EmailOutbox"
6+
ADD COLUMN "sendRetries" INTEGER NOT NULL DEFAULT 0,
7+
ADD COLUMN "nextSendRetryAt" TIMESTAMP(3),
8+
ADD COLUMN "sendAttemptErrors" JSONB;
9+
10+
-- Constraint: nextSendRetryAt can only be set after at least one failed attempt
11+
-- (if sendRetries is 0, no attempt has failed, so there's nothing to retry)
12+
-- Use NOT VALID to avoid holding ACCESS EXCLUSIVE lock during full-table validation.
13+
-- Validation happens in a separate migration to avoid transaction timeout.
14+
ALTER TABLE "EmailOutbox"
15+
ADD CONSTRAINT "EmailOutbox_nextSendRetryAt_requires_failure"
16+
CHECK ("nextSendRetryAt" IS NULL OR "sendRetries" > 0) NOT VALID;
17+
18+
-- Constraint: sendAttemptErrors can only be set after at least one failed attempt
19+
ALTER TABLE "EmailOutbox"
20+
ADD CONSTRAINT "EmailOutbox_sendAttemptErrors_requires_failure"
21+
CHECK ("sendAttemptErrors" IS NULL OR "sendRetries" > 0) NOT VALID;
22+
23+
-- Constraint: nextSendRetryAt must be null when email has finished sending
24+
-- (if finishedSendingAt is set, there's nothing more to retry)
25+
ALTER TABLE "EmailOutbox"
26+
ADD CONSTRAINT "EmailOutbox_no_retry_after_finished"
27+
CHECK ("finishedSendingAt" IS NULL OR "nextSendRetryAt" IS NULL) NOT VALID;
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
-- Validate the deferred retry constraints added in the previous migration.
2+
-- This runs in a separate transaction to avoid timeout, and only takes
3+
-- SHARE UPDATE EXCLUSIVE lock (allows concurrent reads/writes).
4+
5+
ALTER TABLE "EmailOutbox" VALIDATE CONSTRAINT "EmailOutbox_nextSendRetryAt_requires_failure";
6+
ALTER TABLE "EmailOutbox" VALIDATE CONSTRAINT "EmailOutbox_sendAttemptErrors_requires_failure";
7+
ALTER TABLE "EmailOutbox" VALIDATE CONSTRAINT "EmailOutbox_no_retry_after_finished";
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
-- SPLIT_STATEMENT_SENTINEL
2+
-- SINGLE_STATEMENT_SENTINEL
3+
-- RUN_OUTSIDE_TRANSACTION_SENTINEL
4+
-- Index on isQueued for efficient queueReadyEmails() queries.
5+
-- Most emails have isQueued=TRUE (already processed), so filtering for FALSE is highly selective.
6+
CREATE INDEX CONCURRENTLY IF NOT EXISTS "EmailOutbox_isQueued_idx" ON /* SCHEMA_NAME_SENTINEL */."EmailOutbox" ("isQueued");

apps/backend/prisma/schema.prisma

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,8 @@ model ExternalDbSyncMetadata {
9999
100100
singleton BooleanTrue @unique @default(TRUE)
101101
102-
sequencerEnabled Boolean @default(true)
103-
pollerEnabled Boolean @default(true)
102+
sequencerEnabled Boolean @default(true)
103+
pollerEnabled Boolean @default(true)
104104
105105
createdAt DateTime @default(now())
106106
updatedAt DateTime @updatedAt
@@ -833,7 +833,7 @@ model EmailOutbox {
833833
// The scheduled time of when the email should be added to the queue. Can be edited, but only if the email has not yet started sending. Doing so should set isQueued to false.
834834
scheduledAt DateTime
835835
836-
// The scheduled time of the email if it is in the future.
836+
// Whether the email has been queued for sending. Once queued, this stays true unless the email is retried or rescheduled.
837837
isQueued Boolean @default(false)
838838
839839
// A generated column that is equal to scheduledAt if isQueued is false, otherwise null. See the note above on EmailOutboxStatus.status for more details on dbgenerated values.
@@ -844,6 +844,14 @@ model EmailOutbox {
844844
// if startedSendingAt is not set, then finishedSendingAt is also not set
845845
finishedSendingAt DateTime?
846846
847+
// Deferred retry fields for email sending
848+
// Number of send retries attempted (starts at 0, incremented on each failure). Reset when email content is edited.
849+
sendRetries Int @default(0)
850+
// When to retry sending (null = not waiting for retry). Set when a retryable error occurs. Reset when email content is edited. Must be null if sendRetries is 0 (enforced by EmailOutbox_nextSendRetryAt_requires_failure). Must be null when finishedSendingAt is set (enforced by EmailOutbox_no_retry_after_finished).
851+
nextSendRetryAt DateTime?
852+
// JSON array of errors from each failed send attempt. Each entry has: { attemptNumber, timestamp, externalMessage, externalDetails, internalMessage, internalDetails }. Reset when email content is edited. Must be null if sendRetries is 0 (enforced by EmailOutbox_sendAttemptErrors_requires_failure).
853+
sendAttemptErrors Json?
854+
847855
// A generated column that is equal to finishedSendingAt if canHaveDeliveryInfo is false, otherwise deliveredAt.
848856
sentAt DateTime? @default(dbgenerated("\nCASE\n WHEN (\"canHaveDeliveryInfo\" IS TRUE) THEN \"deliveredAt\"\n WHEN (\"canHaveDeliveryInfo\" IS FALSE) THEN \"finishedSendingAt\"\n ELSE NULL::timestamp without time zone\nEND"))
849857
@@ -878,6 +886,7 @@ model EmailOutbox {
878886
@@index([tenancyId, finishedSendingAt(sort: Desc), scheduledAtIfNotYetQueued(sort: Desc), priority, id], map: "EmailOutbox_ordering_idx")
879887
@@index([tenancyId, simpleStatus], map: "EmailOutbox_simple_status_tenancy_idx")
880888
@@index([tenancyId, status], map: "EmailOutbox_status_tenancy_idx")
889+
@@index([isQueued], map: "EmailOutbox_isQueued_idx")
881890
}
882891

883892
model EmailOutboxProcessingMetadata {
@@ -1088,7 +1097,7 @@ model OutgoingRequest {
10881097
10891098
qstashOptions Json
10901099
startedFulfillingAt DateTime?
1091-
deduplicationKey String?
1100+
deduplicationKey String?
10921101
10931102
// Partial unique index on deduplicationKey WHERE startedFulfillingAt IS NULL
10941103
// is created in a custom migration (not expressible in Prisma schema)

apps/backend/src/app/api/latest/emails/outbox/crud.tsx

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import { KnownErrors } from "@stackframe/stack-shared";
77
import { emailOutboxCrud, EmailOutboxCrud } from "@stackframe/stack-shared/dist/interface/crud/email-outbox";
88
import { yupObject, yupString } from "@stackframe/stack-shared/dist/schema-fields";
99
import { StackAssertionError, StatusError, throwErr } from "@stackframe/stack-shared/dist/utils/errors";
10+
import { Json } from "@stackframe/stack-shared/dist/utils/json";
1011
import { createLazyProxy } from "@stackframe/stack-shared/dist/utils/proxies";
1112

1213
/**
@@ -57,6 +58,25 @@ function prismaModelToCrud(prismaModel: EmailOutbox): EmailOutboxCrud["Server"][
5758
to = { type: "custom-emails", emails: recipient?.emails ?? [] };
5859
}
5960

61+
// Convert sendAttemptErrors from DB format (camelCase) to API format (snake_case)
62+
const sendAttemptErrors = prismaModel.sendAttemptErrors
63+
? (prismaModel.sendAttemptErrors as Array<{
64+
attemptNumber: number,
65+
timestamp: string,
66+
externalMessage: string,
67+
externalDetails: Record<string, Json>,
68+
internalMessage: string,
69+
internalDetails: Record<string, Json>,
70+
}>).map(e => ({
71+
attempt_number: e.attemptNumber,
72+
timestamp: e.timestamp,
73+
external_message: e.externalMessage,
74+
external_details: e.externalDetails,
75+
internal_message: e.internalMessage,
76+
internal_details: e.internalDetails,
77+
}))
78+
: null;
79+
6080
// Base fields present on all emails
6181
const base = {
6282
id: prismaModel.id,
@@ -68,6 +88,9 @@ function prismaModelToCrud(prismaModel: EmailOutbox): EmailOutboxCrud["Server"][
6888
variables: (prismaModel.extraRenderVariables ?? {}) as Record<string, any>,
6989
skip_deliverability_check: prismaModel.shouldSkipDeliverabilityCheck,
7090
scheduled_at_millis: prismaModel.scheduledAt.getTime(),
91+
send_retries: prismaModel.sendRetries,
92+
next_send_retry_at_millis: prismaModel.nextSendRetryAt?.getTime() ?? null,
93+
send_attempt_errors: sendAttemptErrors,
7194
// Default flags (overridden in specific statuses)
7295
is_paused: false,
7396
has_rendered: false,
@@ -358,6 +381,7 @@ export const emailOutboxCrudHandlers = createLazyProxy(() => createCrudHandlers(
358381
// Cancel action - mark as skipped
359382
set("isPaused", Prisma.sql`false`);
360383
set("isQueued", Prisma.sql`false`);
384+
setNull("nextSendRetryAt"); // Clear any pending retry so it won't be picked up
361385
set("skippedReason", Prisma.sql`'MANUALLY_CANCELLED'::"EmailOutboxSkippedReason"`);
362386
set("skippedDetails", Prisma.sql`'{}'::jsonb`);
363387
} else {
@@ -395,13 +419,16 @@ export const emailOutboxCrudHandlers = createLazyProxy(() => createCrudHandlers(
395419
// If content changed, reset rendering and sending state
396420
if (needsRerenderReset) {
397421
set("isQueued", Prisma.sql`false`);
422+
// Reset retry fields (sendRetries to 0, others to null)
423+
set("sendRetries", Prisma.sql`0`);
398424
setNull(
399425
"renderedByWorkerId", "startedRenderingAt", "finishedRenderingAt",
400426
"renderErrorExternalMessage", "renderErrorExternalDetails",
401427
"renderErrorInternalMessage", "renderErrorInternalDetails",
402428
"renderedHtml", "renderedText", "renderedSubject",
403429
"renderedIsTransactional", "renderedNotificationCategoryId",
404430
"startedSendingAt", "finishedSendingAt",
431+
"nextSendRetryAt", "sendAttemptErrors",
405432
"sendServerErrorExternalMessage", "sendServerErrorExternalDetails",
406433
"sendServerErrorInternalMessage", "sendServerErrorInternalDetails",
407434
"skippedReason", "skippedDetails", "canHaveDeliveryInfo",
@@ -494,6 +521,9 @@ function parseEmailOutboxFromJson(j: Record<string, unknown>): EmailOutbox {
494521
scheduledAtIfNotYetQueued: dateOrNull("scheduledAtIfNotYetQueued"),
495522
startedSendingAt: dateOrNull("startedSendingAt"),
496523
finishedSendingAt: dateOrNull("finishedSendingAt"),
524+
sendRetries: j.sendRetries as number,
525+
nextSendRetryAt: dateOrNull("nextSendRetryAt"),
526+
sendAttemptErrors: j.sendAttemptErrors as Prisma.JsonValue,
497527
sentAt: dateOrNull("sentAt"),
498528
sendServerErrorExternalMessage: j.sendServerErrorExternalMessage as string | null,
499529
sendServerErrorExternalDetails: j.sendServerErrorExternalDetails as Prisma.JsonValue,

0 commit comments

Comments
 (0)