|
| 1 | +# Operations |
| 2 | + |
| 3 | +A field guide for running the queue in production. This is not a tutorial — it's a checklist of things that bite, with the configuration knob or admin-UI lever next to each one. |
| 4 | + |
| 5 | +## Sizing workers |
| 6 | + |
| 7 | +The right number of concurrent workers is bounded by: |
| 8 | + |
| 9 | +- the slowest task you run (long-running tasks block the worker slot they occupy until completion); |
| 10 | +- the database connection pool (each worker holds one connection); |
| 11 | +- the I/O / CPU profile of your tasks (mailer-heavy: I/O-bound, can over-subscribe cores; image processing: CPU-bound, stay at or below `nproc`). |
| 12 | + |
| 13 | +A sensible starting point: one worker per available CPU core for CPU-bound workloads, two to four times that for I/O-bound workloads. Watch the admin dashboard's "queue length over time" graph for the first week and adjust. |
| 14 | + |
| 15 | +### Per-task concurrency limits |
| 16 | + |
| 17 | +If a single task class can saturate workers (e.g. a heavy report generator), cap it with the task's `$rate` / `$timeout` properties or with a per-task `Configure::write('Queue.taskTimeout.MyTask', ...)` override. See [Configuration](/guide/configuration). |
| 18 | + |
| 19 | +## Worker lifetime and restart cadence |
| 20 | + |
| 21 | +Workers should be designed to exit and respawn periodically — this is the simplest defense against memory leaks, accumulated state, and any in-process cache that drifts from the database (e.g. a freshly migrated column that an older worker can't see yet — see the `captureOutput` regression that landed mid-2026). |
| 22 | + |
| 23 | +- **`workerLifetime`** (`Configure::write('Queue.workerLifetime', 3600)`) — exit cleanly after N seconds. Process supervisor brings up a fresh worker. Default 0 means run forever; for production use a bounded value. |
| 24 | +- **`workerRetry`** — number of retries on transient errors before a job is marked failed. Match this to your task idempotency story. |
| 25 | +- **`exitwhennothingtodo`** — when set, the worker exits on its first empty poll. Suitable for cron-managed workers; leave off for long-running supervisor-managed processes. |
| 26 | + |
| 27 | +### Recommended supervisor / systemd unit |
| 28 | + |
| 29 | +A systemd unit shape that survives normal failures: |
| 30 | + |
| 31 | +```ini |
| 32 | +# /etc/systemd/system/cakephp-queue.service |
| 33 | +[Unit] |
| 34 | +Description=CakePHP Queue worker |
| 35 | +After=network.target |
| 36 | + |
| 37 | +[Service] |
| 38 | +User=www-data |
| 39 | +WorkingDirectory=/var/www/app |
| 40 | +ExecStart=/var/www/app/bin/cake queue worker --verbose |
| 41 | +Restart=always |
| 42 | +RestartSec=5 |
| 43 | + |
| 44 | +[Install] |
| 45 | +WantedBy=multi-user.target |
| 46 | +``` |
| 47 | + |
| 48 | +Then `systemctl enable --now cakephp-queue.service`. The `Restart=always` plus a bounded `workerLifetime` means workers cycle on their own schedule and any crash recovers without manual intervention. |
| 49 | + |
| 50 | +For multiple parallel workers, use a `cakephp-queue@.service` template and `systemctl enable --now cakephp-queue@1.service cakephp-queue@2.service ...`. |
| 51 | + |
| 52 | +## Monitoring |
| 53 | + |
| 54 | +The admin dashboard surfaces: |
| 55 | + |
| 56 | +- `queue_processes` table — currently-running workers with last-seen timestamp. |
| 57 | +- `queued_jobs` table — pending + recent jobs grouped by task, with progress, failure-message, and exit status. |
| 58 | +- Per-task counters: queued / in-flight / failed / completed in the last 24h. |
| 59 | + |
| 60 | +For external monitoring (Prometheus, Datadog, etc.), poll the same tables. The cheapest signals worth alerting on: |
| 61 | + |
| 62 | +- **stale workers**: any row in `queue_processes` where `modified` is older than `defaultRequeueTimeout + 60s` is a worker that died without cleanup. The plugin auto-evicts these on next startup, but a persistent count > 0 means workers are crashing faster than they recover. |
| 63 | +- **backlog growth**: `count(*)` of `queued_jobs WHERE completed IS NULL AND fetched IS NULL`. A monotonically rising number means you're under-provisioned. |
| 64 | +- **per-task failure rate**: `count(*) WHERE failure_message IS NOT NULL AND created > now() - interval '1h'`, grouped by `job_task`. Alert when any task crosses your error budget. |
| 65 | + |
| 66 | +The admin dashboard's "tips" sidebar contains the same SQL for quick eyeballing. |
| 67 | + |
| 68 | +## Failure handling |
| 69 | + |
| 70 | +The queue retries failed jobs up to `Queue.workerRetry` times before marking them dead. Dead jobs sit in `queued_jobs` with `failure_message` populated and `failed` set — they aren't auto-purged. |
| 71 | + |
| 72 | +### Investigating a failure |
| 73 | + |
| 74 | +1. **Admin dashboard → Failed jobs view** sorts by most recent. Each row links to its full payload + stack trace. |
| 75 | +2. **`bin/cake queue clean`** moves completed and old failed jobs out; configure retention via `Queue.cleanuptimeout` (default 30 days). |
| 76 | +3. **`bin/cake queue retry <job-id>`** requeues a dead job. Useful for transient failures (network blip, downstream service down) after you've fixed the cause. |
| 77 | + |
| 78 | +### Dead-letter pattern |
| 79 | + |
| 80 | +There's no built-in DLQ table yet — failed jobs stay in `queued_jobs`. The current workaround: |
| 81 | + |
| 82 | +1. Set `Queue.cleanuptimeout` high enough (e.g. 90 days) that failed rows survive long enough to investigate. |
| 83 | +2. Filter the dashboard by `failed IS NOT NULL` for the DLQ view. |
| 84 | +3. Use `queue retry` to requeue, `queue remove` to drop. |
| 85 | + |
| 86 | +Issue tracker has the formal DLQ feature on the roadmap; see `Per-task circuit breaker / dead-letter status` in the strategic items section of the [merged plugin review](https://github.com/dereuromark/cakephp-queue/issues). |
| 87 | + |
| 88 | +## Schema migrations on live workers |
| 89 | + |
| 90 | +Two cases worth knowing: |
| 91 | + |
| 92 | +- **New column on `queued_jobs`** (e.g. when you migrate to a version that adds an `output` column): the `captureOutput()` auto-detect path re-runs the schema check on every job since the mid-2026 fix, so the new column takes effect on the next dispatched job. No worker restart required. |
| 93 | +- **Anything more invasive** (dropped column, renamed table, FK changes): use a rolling restart. Bring workers down via `systemctl stop cakephp-queue@*`, run migrations, bring them back. Jobs in-flight at restart time fall under the standard `defaultRequeueTimeout` (180s default) — they get re-picked-up by a fresh worker if not finished. |
| 94 | + |
| 95 | +## Multi-server deployments |
| 96 | + |
| 97 | +If two app servers run cron-triggered workers against the same queue: |
| 98 | + |
| 99 | +- Workers coordinate via the `queue_processes` table heartbeat + per-row `workerkey` claim, so there's no double-execution risk at the job level. |
| 100 | +- BUT: if you also run `cakephp-queue-scheduler` on multiple hosts, the bundled `FileLock` is single-host by design. See the queue-scheduler operations guide for a multi-host advisory lock recipe. |
| 101 | +- Run `bin/cake queue clean` from exactly **one** host (cron-driven, daily off-peak). Two simultaneous cleans don't corrupt anything but waste DB cycles. |
| 102 | + |
| 103 | +## Common pitfalls |
| 104 | + |
| 105 | +- **Tasks that swallow exceptions**: a task's `run()` method that catches `Throwable` and returns normally counts as success — even when the underlying work failed. If you must catch, re-throw `QueueException` to surface the failure to the worker. |
| 106 | +- **Long sleeps inside `run()`**: anything that blocks longer than `defaultRequeueTimeout` will be considered abandoned and re-queued. Either chunk the work into smaller jobs or raise the per-task `$timeout`. |
| 107 | +- **Worker started in CLI environment without DB access**: `bin/cake queue worker` reads the active CakePHP `Database.default` connection. If your CLI shell has a different DSN than your web shell, workers can't see jobs your web app created. Use the same `app_local.php` everywhere. |
| 108 | +- **`captureOutput`-cached-stale**: fixed mid-2026 by re-checking schema per call. If you observe missing output on a long-running worker, restart it as a fallback. Migration ordering is also worth checking — add the column before deploying the code that writes to it. |
| 109 | + |
| 110 | +## See also |
| 111 | + |
| 112 | +- [Configuration](/guide/configuration) — every Configure key the worker reads at runtime. |
| 113 | +- [Multi-Connection](/guide/multi-connection) — running queues against multiple databases. |
| 114 | +- [Tips](/reference/tips) — debugging notes and one-off SQL queries for the queued_jobs table. |
0 commit comments