Problem
Preflight job d272041d-f014-4792-a1a1-944e5c0c57ab was incorrectly cancelled on March 31, 2026 due to a database connection pool timeout (PGRST003). The entitlement check in checkProductCodeEntitlements() caught the transient infrastructure error and returned false, causing the system to treat it as "site not entitled" and set the job status to CANCELLED.
Error Flow from Logs
- Database:
PGRST003: Timed out acquiring connection from connection pool
- TierClient throws error → caught at
src/common/audit-utils.js:38-40
- Returns
false → isAuditEnabledForSite() returns false
- AsyncJobRunner sets
AsyncJob.Status.CANCELLED (line 91)
- Job skipped with reason: "preflight audits disabled for site"
Expected behavior: Transient errors should trigger Lambda/SQS retry, not job cancellation.
Root Cause
checkProductCodeEntitlements() catches all errors and returns false, including transient infrastructure failures:
- PostgREST errors:
PGRST000-003 (connection/timeout errors, 503/504)
- Network errors:
ECONNREFUSED, ETIMEDOUT, ENOTFOUND, ECONNRESET
- HTTP server errors: 408, 429, 500, 502, 503, 504
The error handling doesn't distinguish between:
- ✅ Legitimate "site not entitled" → should return false and cancel
- ❌ Transient infrastructure errors → should throw and retry
Solution
Create isTransientTierClientError() classifier to distinguish transient errors from permanent errors:
Transient errors (should retry):
- Database:
PGRST000, PGRST001, PGRST002, PGRST003
- Network:
ECONNREFUSED, ETIMEDOUT, ENOTFOUND, ECONNRESET
- HTTP: 408, 429, 500, 502, 503, 504
Permanent errors (skip audit):
- HTTP: 401, 403, 404
- PostgREST:
PGRST100-122 (bad request), PGRST200-204 (not found), PGRST300 (config error)
- Business logic: "not enrolled", "no entitlement"
Affected Locations
All three TierClient usage locations:
src/common/audit-utils.js - checkProductCodeEntitlements()
src/utils/site-validation.js - checkSiteRequiresValidation()
src/prerender/utils/utils.js - isPaidLLMOCustomer()
Implementation Plan
See detailed plan: docs/plans/fix-database-timeout-entitlement-check.md
Files to Change:
- New:
src/common/tier-client-error-classifier.js - Error classification utility
- Update:
src/common/audit-utils.js - Rethrow transient errors
- Update:
src/utils/site-validation.js - Rethrow transient errors
- Update:
src/prerender/utils/utils.js - Rethrow transient errors
- Update:
src/common/async-job-runner.js - Add clarifying comment
- New:
test/common/tier-client-error-classifier.test.js - Test coverage
- Update: Test files for all affected modules
Impact
- ✅ Preserved: Sites without entitlements still return
false, job CANCELLED
- ✅ Preserved: Permanent errors (404, not found) still return
false
- ⚠️ Changed: Transient errors now throw instead of returning
false
Result: Jobs will no longer be incorrectly cancelled during infrastructure issues. Lambda/SQS will retry until successful or DLQ.
Related
- Job ID:
d272041d-f014-4792-a1a1-944e5c0c57ab
- Coralogix trace:
1-69cb9569-e63d7c316679035ff32e6ca0
Problem
Preflight job
d272041d-f014-4792-a1a1-944e5c0c57abwas incorrectly cancelled on March 31, 2026 due to a database connection pool timeout (PGRST003). The entitlement check incheckProductCodeEntitlements()caught the transient infrastructure error and returnedfalse, causing the system to treat it as "site not entitled" and set the job status toCANCELLED.Error Flow from Logs
PGRST003: Timed out acquiring connection from connection poolsrc/common/audit-utils.js:38-40false→isAuditEnabledForSite()returnsfalseAsyncJob.Status.CANCELLED(line 91)Expected behavior: Transient errors should trigger Lambda/SQS retry, not job cancellation.
Root Cause
checkProductCodeEntitlements()catches all errors and returnsfalse, including transient infrastructure failures:PGRST000-003(connection/timeout errors, 503/504)ECONNREFUSED,ETIMEDOUT,ENOTFOUND,ECONNRESETThe error handling doesn't distinguish between:
Solution
Create
isTransientTierClientError()classifier to distinguish transient errors from permanent errors:Transient errors (should retry):
PGRST000,PGRST001,PGRST002,PGRST003ECONNREFUSED,ETIMEDOUT,ENOTFOUND,ECONNRESETPermanent errors (skip audit):
PGRST100-122(bad request),PGRST200-204(not found),PGRST300(config error)Affected Locations
All three TierClient usage locations:
src/common/audit-utils.js-checkProductCodeEntitlements()src/utils/site-validation.js-checkSiteRequiresValidation()src/prerender/utils/utils.js-isPaidLLMOCustomer()Implementation Plan
See detailed plan:
docs/plans/fix-database-timeout-entitlement-check.mdFiles to Change:
src/common/tier-client-error-classifier.js- Error classification utilitysrc/common/audit-utils.js- Rethrow transient errorssrc/utils/site-validation.js- Rethrow transient errorssrc/prerender/utils/utils.js- Rethrow transient errorssrc/common/async-job-runner.js- Add clarifying commenttest/common/tier-client-error-classifier.test.js- Test coverageImpact
false, jobCANCELLEDfalsefalseResult: Jobs will no longer be incorrectly cancelled during infrastructure issues. Lambda/SQS will retry until successful or DLQ.
Related
d272041d-f014-4792-a1a1-944e5c0c57ab1-69cb9569-e63d7c316679035ff32e6ca0