Skip to content

Jobs incorrectly cancelled on transient database errors during entitlement checks #2318

@bhellema

Description

@bhellema

Problem

Preflight job d272041d-f014-4792-a1a1-944e5c0c57ab was incorrectly cancelled on March 31, 2026 due to a database connection pool timeout (PGRST003). The entitlement check in checkProductCodeEntitlements() caught the transient infrastructure error and returned false, causing the system to treat it as "site not entitled" and set the job status to CANCELLED.

Error Flow from Logs

  1. Database: PGRST003: Timed out acquiring connection from connection pool
  2. TierClient throws error → caught at src/common/audit-utils.js:38-40
  3. Returns falseisAuditEnabledForSite() returns false
  4. AsyncJobRunner sets AsyncJob.Status.CANCELLED (line 91)
  5. Job skipped with reason: "preflight audits disabled for site"

Expected behavior: Transient errors should trigger Lambda/SQS retry, not job cancellation.

Root Cause

checkProductCodeEntitlements() catches all errors and returns false, including transient infrastructure failures:

  • PostgREST errors: PGRST000-003 (connection/timeout errors, 503/504)
  • Network errors: ECONNREFUSED, ETIMEDOUT, ENOTFOUND, ECONNRESET
  • HTTP server errors: 408, 429, 500, 502, 503, 504

The error handling doesn't distinguish between:

  • ✅ Legitimate "site not entitled" → should return false and cancel
  • ❌ Transient infrastructure errors → should throw and retry

Solution

Create isTransientTierClientError() classifier to distinguish transient errors from permanent errors:

Transient errors (should retry):

  • Database: PGRST000, PGRST001, PGRST002, PGRST003
  • Network: ECONNREFUSED, ETIMEDOUT, ENOTFOUND, ECONNRESET
  • HTTP: 408, 429, 500, 502, 503, 504

Permanent errors (skip audit):

  • HTTP: 401, 403, 404
  • PostgREST: PGRST100-122 (bad request), PGRST200-204 (not found), PGRST300 (config error)
  • Business logic: "not enrolled", "no entitlement"

Affected Locations

All three TierClient usage locations:

  1. src/common/audit-utils.js - checkProductCodeEntitlements()
  2. src/utils/site-validation.js - checkSiteRequiresValidation()
  3. src/prerender/utils/utils.js - isPaidLLMOCustomer()

Implementation Plan

See detailed plan: docs/plans/fix-database-timeout-entitlement-check.md

Files to Change:

  • New: src/common/tier-client-error-classifier.js - Error classification utility
  • Update: src/common/audit-utils.js - Rethrow transient errors
  • Update: src/utils/site-validation.js - Rethrow transient errors
  • Update: src/prerender/utils/utils.js - Rethrow transient errors
  • Update: src/common/async-job-runner.js - Add clarifying comment
  • New: test/common/tier-client-error-classifier.test.js - Test coverage
  • Update: Test files for all affected modules

Impact

  • Preserved: Sites without entitlements still return false, job CANCELLED
  • Preserved: Permanent errors (404, not found) still return false
  • ⚠️ Changed: Transient errors now throw instead of returning false

Result: Jobs will no longer be incorrectly cancelled during infrastructure issues. Lambda/SQS will retry until successful or DLQ.

Related

  • Job ID: d272041d-f014-4792-a1a1-944e5c0c57ab
  • Coralogix trace: 1-69cb9569-e63d7c316679035ff32e6ca0

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions