You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: use UPSERT to prevent duplicate DLQ entries (#6)
## Summary
Two fixes for worker queue health and DLQ reliability:
### 1. DLQ UPSERT - Prevent duplicate entries
When a job is requeued from the DLQ and fails again, it was creating a
new DLQ entry instead of updating the existing one. This caused:
- **Duplicate DLQ entries** for the same logical job (same `job_key`)
- **Stale `failed_at` timestamps** - original entry's timestamp was
never updated
- **Infinite requeue loops** - auto-requeue cron jobs would pick up old
entries repeatedly
- **Exponential job growth** - duplicates kept multiplying on each
requeue cycle
**Fix:**
- Add a **unique partial index** on `job_key` (`WHERE job_key IS NOT
NULL`)
- Change `INSERT` to `UPSERT` in `add_to_dlq()` and
`process_failed_jobs()`
- On conflict, update `failed_at=NOW()` and increment `failure_count`
### 2. Startup cleanup - Release stale locks and clean up dead jobs
When workers crash or are killed without graceful shutdown, they can
leave behind:
- **Stale queue locks** - prevent other workers from processing jobs in
affected queues
- **Dead jobs** - jobs with `attempts >= max_attempts` that sit in the
main queue forever
**Fix:**
- Add `startup_cleanup()` that runs automatically when
`run_until_cancelled()` is called
- `release_stale_queue_locks()` - releases queue locks older than 5
minutes (`DEFAULT_STALE_LOCK_TIMEOUT`)
- `cleanup_permanently_failed_jobs()` - deletes jobs that have exhausted
retries
## Behavior After Fix
- **One DLQ entry per logical job** (identified by `job_key`)
- **`failed_at` always reflects the most recent failure** - cooldown
periods work correctly
- **`failure_count` accumulates** across all failure cycles
- **Stale locks auto-released** on worker startup
- **Dead jobs cleaned up** on worker startup
## Testing
All 55 tests pass including the admin API DLQ tests.
## Breaking Changes
None - backwards compatible. Existing DLQ entries without `job_key` are
unaffected.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
0 commit comments