docs(retries): mark dead RetryPolicy timing fields as deprecated

ceejbot · ceejbot · commit de4ecd8e36de · 2026-05-06T19:11:29.000-07:00
`RetryPolicy.{initial_delay, max_delay, backoff_multiplier, jitter_factor}` were stored on the struct but never reached graphile_worker — only `max_attempts` is forwarded by `From<JobSpec> for GraphileJobSpec`. graphile_worker uses a hard-coded `exp(min(attempts, 10))` second SQL formula for every retry. So `RetryPolicy::fast()` and `RetryPolicy::conservative()` produced identical retry timing in practice even though the docs promised "100ms-30s delays" vs. "1 min - 8 hour delays". This commit makes the fact match the promise: - Marks the unused math helpers (`RetryPolicy::new`, `with_jitter`, `calculate_delay`, `calculate_retry_time`, and `JobSpec::calculate_retry_time`) as `#[deprecated(since = "1.2.0")]` with notes pointing users at `RetryPolicy { max_attempts: n, ..Default::default() }` or the presets. - Rewrites the rustdoc on `RetryPolicy`, on each preset, on the `with_*_retries` builders, and on the `enqueue_*_with_retries` convenience helpers to describe what actually happens (only `max_attempts` differs across presets, fixed exp-backoff timing). - Updates the lib.rs module-level rustdoc. - Migrates `examples/enqueue_jobs.rs` to the recommended pattern so it doesn't trigger the new deprecation warnings. - Updates README.md to drop the false delay-range claims and replace the wrong "Pre-configured Fast/Bulk queues" / "Custom(name)" listing with the actual `Queue::Parallel` / `Queue::Serial(name)` enum. - Updates docs/02-dlq.md to mark the post-#9 "queue_name shows as default" warning as resolved (it was the change in v1.1.1 that fixed this). The struct fields themselves stay public for source-compatibility with existing struct-literal construction. Per-job backoff customization needs upstream graphile_worker support and is deferred.
diff --git a/README.md b/README.md
@@ -1,7 +1,6 @@
 # backfill
 
 [![CI](https://github.com/ceejbot/backfill/workflows/CI/badge.svg)](https://github.com/ceejbot/backfill/actions)
-[![Coverage](https://img.shields.io/badge/coverage-64.67%25-yellow)](https://github.com/ceejbot/backfill/actions)
 [![Security](https://github.com/ceejbot/backfill/actions/workflows/security.yml/badge.svg)](https://github.com/ceejbot/backfill/actions/workflows/security.yml)
 
 A boringly-named priority queue system for doing async work. This library and work process wrap the the [graphile_worker crate](https://lib.rs/crates/graphile_worker) to do things the way I want to do them. It's unlikely you'll want to do things exactly this way, but perhaps you can learn by reading the code, or get a jumpstart by borrowing open-source code, or heck, maybe this will do what you need.
@@ -10,37 +9,70 @@ A boringly-named priority queue system for doing async work. This library and wo
 
 This is a postgres-backed async work queue library that is a set of conveniences and features on top of the rust port of Graphile Worker. It gives you a library you can integrate with your own project to handle background tasks.
 
-> **Status**: Core features are complete and tested (64.67%% test coverage, 55 tests). The library is suitable for production use for job enqueueing, worker processing, and DLQ management. The Admin API (feature-gated) is experimental. See [CHANGELOG.md](CHANGELOG.md) for details and [Known Limitations](docs/02-dlq.md#known-limitations).
+> **Status**: Core features are complete and covered by an integration test
+> suite. The library is suitable for production use for job enqueueing,
+> worker processing, and DLQ management. The Admin API (feature-gated) is
+> experimental. See [CHANGELOG.md](CHANGELOG.md) for details and
+> [Known Limitations](docs/02-dlq.md#known-limitations).
 
 ### What's New Over graphile_worker
 
-Built on top of `graphile_worker` (v0.8.6), backfill adds these production-ready features:
-
-- 🎯 **Priority System** - Six-level priority queue (EMERGENCY to BULK_LOWEST) with numeric priority values
-- 📦 **Named Queues** - Pre-configured Fast/Bulk queues plus custom queue support
-- 🔄 **Smart Retry Policies** - Exponential backoff with jitter (fast/aggressive/conservative presets)
-- 💀 **Dead Letter Queue (DLQ)** - Automatic failed job handling with query/requeue/deletion APIs
-- 📊 **Comprehensive Metrics** - Prometheus-compatible metrics for jobs, DLQ, and database operations
-- 🛠️ **High-Level Client API** - `BackfillClient` with ergonomic enqueueing helpers
-- 🏃 **Flexible Worker Patterns** - `WorkerRunner` supporting tokio::select!, background tasks, and one-shot processing
-- 🔧 **Admin API** - Optional Axum router for HTTP-based job management (experimental)
-- 📝 **Convenience Functions** - `enqueue_fast()`, `enqueue_bulk()`, `enqueue_critical()`, etc.
-- 🧹 **Stale Lock Cleanup** - Automatic cleanup of orphaned locks from crashed workers (startup + periodic)
+Built on top of `graphile_worker` (v0.11.x), backfill adds these production-ready features:
+
+- 🎯 **Priority System** — Six-level priority enum (EMERGENCY=-20 down through
+  BULK_LOWEST=10), mapped through to graphile_worker's `priority asc` fetch
+  ordering.
+- 📦 **Parallel + Serial queues** — `Queue::Parallel` (default, jobs run
+  concurrently across workers) or `Queue::Serial(name)` (one-job-at-a-time
+  per named queue, for rate limiting or per-entity ordering).
+- 🔄 **Retry policy presets** — `fast` / `aggressive` / `conservative`
+  presets that differ in `max_attempts`. Note: backoff *timing* is fixed by
+  graphile_worker (see [`docs/02-dlq.md`](docs/02-dlq.md)) — only the
+  attempt count is configurable.
+- 💀 **Dead Letter Queue (DLQ)** — Automatic failed-job handling with
+  query/requeue/deletion APIs. Includes a permanent-failure short-circuit
+  plugin: handlers that return non-retryable `WorkerError` variants land in
+  the DLQ on the first failure rather than waiting for `max_attempts` to
+  exhaust.
+- 📊 **Comprehensive Metrics** — Prometheus-compatible metrics for jobs,
+  DLQ, and database operations.
+- 🛠️ **High-Level Client API** — `BackfillClient` with ergonomic enqueueing
+  helpers.
+- 🏃 **Flexible Worker Patterns** — `WorkerRunner` supporting
+  `tokio::select!`, background tasks, and one-shot processing.
+- 🔧 **Admin API** — Optional Axum router for HTTP-based job management
+  (experimental).
+- 📝 **Convenience Functions** — `enqueue_fast()`, `enqueue_bulk()`,
+  `enqueue_critical()`, etc.
+- 🧹 **Stale Lock Cleanup** — Automatic cleanup of orphaned locks from
+  crashed workers (startup + periodic). Ordered correctly with the DLQ
+  scanner so failed jobs aren't lost across restarts.
 
 All built on graphile_worker's rock-solid foundation of PostgreSQL SKIP LOCKED and LISTEN/NOTIFY.
 
 ### Features
 
-- **Priority queues**: EMERGENCY, FAST_HIGH, FAST_DEFAULT, BULK_DEFAULT, BULK_LOW, BULK_LOWEST
-- **Named queues**: Fast, Bulk, DeadLetter, Custom(name)
-- **Scheduling**: Immediate or delayed execution with `run_at`
-- **Idempotency**: Use `job_key` for deduplication
-- **Exponential backoff**: Built-in retry policies with jitter to prevent thundering herds
-- **Dead letter queue**: Handling jobs that experience un-retryable failures or exceed their retry limits
-- **Error handling**: Automatic retry classification
-- **Metrics**: Comprehensive metrics via the `metrics` crate - bring your own exporter (Prometheus, StatsD, etc.)
-- **Monitoring**: Structured logging and tracing throughout
-- **Building blocks for an axum admin api**: via a router you can mount on your own axum api server
+- **Priority queues**: EMERGENCY (-20), FAST_HIGH (-10), FAST_DEFAULT (-5),
+  BULK_DEFAULT (0), BULK_LOW (5), BULK_LOWEST (10) — lower number = higher
+  priority.
+- **Queue types**: `Queue::Parallel` (default), `Queue::Serial(name)` — plus
+  `Queue::serial_for(entity, id)` for per-entity ordering.
+- **Scheduling**: Immediate or delayed execution with `run_at`.
+- **Idempotency**: Use `job_key` for deduplication.
+- **Retries**: Configurable `max_attempts` per job; graphile_worker handles
+  the exponential-backoff schedule (`exp(min(attempts, 10))` seconds, capped
+  at ~6h per retry).
+- **Dead letter queue**: Automatic capture of jobs that exceed their retry
+  limits or return non-retryable errors. Includes a synchronous startup
+  pre-move so DLQ doesn't lose jobs across worker restarts.
+- **Error classification**: `WorkerError` variants split into retryable and
+  non-retryable; non-retryable errors short-circuit retries to DLQ via an
+  auto-registered lifecycle plugin.
+- **Metrics**: Comprehensive metrics via the `metrics` crate — bring your
+  own exporter (Prometheus, StatsD, etc.).
+- **Monitoring**: Structured logging and tracing throughout.
+- **Building blocks for an axum admin api**: via a router you can mount on
+  your own axum api server.
 
 Look at the `examples/` directory and the readme there for practical usage examples.
 
diff --git a/docs/02-dlq.md b/docs/02-dlq.md
@@ -880,16 +880,21 @@ The DLQ system is fully functional for production use, but has a few known limit
 
 ### 1. Queue Name Tracking
 
-**Issue**: DLQ entries may show `queue_name` as `"default"` even if the job ran in a different queue (e.g., "fast" or "bulk").
+**Status**: Fixed in v1.1.1 (PR #9).
 
-**Cause**: The GraphileWorker `Job` struct doesn't expose the queue name field, so when jobs are moved to the DLQ, the queue name defaults to `"default"`.
+DLQ entries now correctly preserve the queue type of the original job:
 
-**Workarounds**:
-- The `task_identifier` field is always accurate and can be used for filtering
-- Job priority is preserved, which often correlates with queue assignment
-- For critical workflows, track queue assignment in your application logs or metrics
+- Parallel jobs (`Queue::Parallel`) are stored with an empty `queue_name`.
+  When requeued, they go back to `Queue::Parallel` — concurrent execution
+  is preserved.
+- Serial jobs (`Queue::Serial(name)`) are stored with their queue name.
+  When requeued, they go back to `Queue::Serial(name)` — single-job-at-a-
+  time semantics are preserved.
 
-**Future**: This will be resolved when GraphileWorker exposes queue_name on the Job struct, or when we implement direct database querying.
+The DDL retains a `DEFAULT 'default'` clause for the `queue_name` column for
+schema compatibility, but it is never used by the production code path —
+`add_to_dlq` and `process_failed_jobs` always pass an explicit value
+(possibly the empty string for parallel jobs).
 
 ### 2. Payload Visibility
 
@@ -929,7 +934,7 @@ These limitations are minor and don't affect the core DLQ functionality:
 - ✅ **Error message capture** - Fully functional
 - ✅ **Requeuing workflows** - Production-ready
 - ✅ **Statistics and monitoring** - Complete
-- ⚠️ **Queue name tracking** - Shows "default" for all queues
+- ✅ **Queue name tracking** - Fixed in v1.1.1 (parallel ↔ serial round-trip preserved)
 - ⚠️ **Payload inspection** - Requires direct DB access
 - ⚠️ **Job cancellation** - Not yet implemented
 
diff --git a/examples/enqueue_jobs.rs b/examples/enqueue_jobs.rs
@@ -248,14 +248,15 @@ async fn main() -> Result<(), Box<dyn std::error::Error>> {
         outcome.expect("outcome should contain a job").id()
     );
 
-    // Enqueue a job with custom retry policy
-    let custom_retry_policy = RetryPolicy::new(
-        6,                                     // 6 attempts
-        std::time::Duration::from_millis(500), // Start with 500ms
-        std::time::Duration::from_secs(60),    // Cap at 1 minute
-        1.8,                                   // 1.8x multiplier
-    )
-    .with_jitter(0.2); // 20% jitter
+    // Enqueue a job with custom retry settings.
+    // Note: only `max_attempts` affects runtime behaviour. graphile_worker
+    // schedules retries via a fixed `exp(min(attempts, 10))` second formula,
+    // so the timing fields on RetryPolicy are not honored. See the docs on
+    // `RetryPolicy` for the full story.
+    let custom_retry_policy = RetryPolicy {
+        max_attempts: 6,
+        ..Default::default()
+    };
 
     let custom_job = GenerateReportJob {
         report_type: "analytics_summary".to_string(),
diff --git a/src/lib.rs b/src/lib.rs
@@ -9,8 +9,14 @@
 //! - **Parallel execution** by default - jobs run concurrently across all
 //!   workers
 //! - **Serial queues** when you need ordering or rate limiting
-//! - **Exponential backoff** with jitter to prevent thundering herds
-//! - **Flexible retry policies** (fast, aggressive, conservative, or custom)
+//! - **Exponential backoff retries** via graphile_worker (timing is
+//!   `exp(min(attempts, 10))` seconds; fixed, not per-job tunable — see
+//!   [`RetryPolicy`])
+//! - **Configurable max-attempts per job** with `fast`, `aggressive`, and
+//!   `conservative` presets
+//! - **Permanent-failure short-circuit** — non-retryable `WorkerError`
+//!   variants land in the DLQ on the first failure instead of waiting for
+//!   `max_attempts` to exhaust
 //! - **Dead letter queue** handling for failed jobs
 //! - **Type-safe job handlers** using Rust's type system
 //! - **Low-latency execution** via PostgreSQL LISTEN/NOTIFY
@@ -362,46 +368,70 @@ impl Default for JobSpec {
 }
 
 impl JobSpec {
-    /// Create a JobSpec with exponential backoff retry policy
+    /// Attach a [`RetryPolicy`] to this JobSpec.
+    ///
+    /// Sets `max_attempts` from the policy. Note that backoff timing fields
+    /// on the policy are stored but not honored at runtime — see
+    /// [`RetryPolicy`].
     pub fn with_retry_policy(mut self, retry_policy: RetryPolicy) -> Self {
         self.max_attempts = Some(retry_policy.max_attempts);
         self.retry_policy = Some(retry_policy);
         self
     }
 
-    /// Create a JobSpec optimized for fast retries
+    /// Configure for the `fast` preset: `max_attempts = 3`.
+    ///
+    /// In practice this differs from [`with_aggressive_retries`] and
+    /// [`with_conservative_retries`] only in the attempt count — see
+    /// [`RetryPolicy`].
     pub fn with_fast_retries(mut self) -> Self {
         let policy = RetryPolicy::fast();
         self.max_attempts = Some(policy.max_attempts);
         self.retry_policy = Some(policy);
         self
     }
 
-    /// Create a JobSpec optimized for aggressive retries
+    /// Configure for the `aggressive` preset: `max_attempts = 12`.
+    ///
+    /// At graphile_worker's fixed exp-backoff schedule, 12 attempts gives
+    /// roughly half a day of cumulative retry coverage before DLQ.
     pub fn with_aggressive_retries(mut self) -> Self {
         let policy = RetryPolicy::aggressive();
         self.max_attempts = Some(policy.max_attempts);
         self.retry_policy = Some(policy);
         self
     }
 
-    /// Create a JobSpec optimized for conservative retries
+    /// Configure for the `conservative` preset: `max_attempts = 5`.
     pub fn with_conservative_retries(mut self) -> Self {
         let policy = RetryPolicy::conservative();
         self.max_attempts = Some(policy.max_attempts);
         self.retry_policy = Some(policy);
         self
     }
 
-    /// Get the effective retry policy (returns default if none specified)
+    /// Get the effective retry policy (returns default if none specified).
+    ///
+    /// **Note:** Only `max_attempts` from the returned policy reaches
+    /// graphile_worker. See [`RetryPolicy`] for details.
     pub fn effective_retry_policy(&self) -> RetryPolicy {
         self.retry_policy.clone().unwrap_or_default()
     }
 
-    /// Calculate the next retry time for a failed job
+    /// Calculate what the next retry time *would* be under this spec's
+    /// policy.
+    ///
+    /// **Not used at runtime.** graphile_worker schedules retries via a
+    /// fixed SQL formula. This method is preserved as a utility but has no
+    /// effect on actual job behaviour.
+    #[deprecated(
+        since = "1.2.0",
+        note = "graphile_worker computes retry timing in SQL and ignores this method. Returns a value but has no runtime effect."
+    )]
     pub fn calculate_retry_time(&self, attempt: i32, failed_at: DateTime<Utc>) -> Option<DateTime<Utc>> {
         let policy = self.effective_retry_policy();
         if policy.should_retry(attempt) {
+            #[allow(deprecated)]
             Some(policy.calculate_retry_time(attempt, failed_at))
         } else {
             None // No more retries
@@ -504,10 +534,12 @@ where
     client.enqueue(task_identifier, payload, spec).await
 }
 
-/// Enqueue a high-priority job with fast exponential backoff retries.
+/// Enqueue a high-priority job configured for a low retry count (3 attempts).
 ///
-/// Best for high-priority jobs that need quick retries (3 attempts, 100ms-30s
-/// delays).
+/// Use for jobs where rapid failure-to-DLQ is preferred over many retries.
+/// graphile_worker's retry timing is fixed at `exp(min(attempts, 10))` seconds
+/// regardless of policy — see [`RetryPolicy`] — so the only difference between
+/// this and other `_with_retries` helpers is the attempt cap.
 pub async fn enqueue_fast_with_retries<T>(
     client: &BackfillClient,
     task_identifier: &str,
@@ -527,9 +559,12 @@ where
     client.enqueue(task_identifier, payload, spec).await
 }
 
-/// Enqueue a critical job with aggressive exponential backoff retries.
+/// Enqueue a critical job with a high retry count (12 attempts).
 ///
-/// Best for critical jobs that must succeed (12 attempts, up to 4 hour delays).
+/// Use for jobs that must eventually succeed if at all possible. graphile_worker
+/// retries on a fixed `exp(min(attempts, 10))` second schedule, capping at
+/// ~6h per retry — so 12 attempts gives roughly half a day of total retry
+/// coverage. See [`RetryPolicy`] for the full timing.
 pub async fn enqueue_critical<T>(
     client: &BackfillClient,
     task_identifier: &str,
@@ -549,10 +584,13 @@ where
     client.enqueue(task_identifier, payload, spec).await
 }
 
-/// Enqueue a bulk job with conservative exponential backoff retries.
+/// Enqueue a bulk job with a moderate retry count (5 attempts via the
+/// `conservative` preset).
 ///
-/// Best for background jobs where consistency matters more than speed
-/// (8 attempts, 1 min - 8 hour delays).
+/// Use for background jobs that should be retried but where you don't want
+/// many attempts. graphile_worker's retry timing is `exp(min(attempts, 10))`
+/// seconds — see [`RetryPolicy`] — so this gives roughly 1s, 3s, 7s, 20s,
+/// 55s before the job lands in DLQ.
 pub async fn enqueue_bulk_with_retries<T>(
     client: &BackfillClient,
     task_identifier: &str,
diff --git a/src/retries.rs b/src/retries.rs