Skip to content

Cap $tries on queued jobs to stop unbounded retry on --tries=0 workers#4353

Open
JoshSalway wants to merge 12 commits into
SpartnerNL:3.1from
JoshSalway:fix/job-tries-unbounded-loops
Open

Cap $tries on queued jobs to stop unbounded retry on --tries=0 workers#4353
JoshSalway wants to merge 12 commits into
SpartnerNL:3.1from
JoshSalway:fix/job-tries-unbounded-loops

Conversation

@JoshSalway
Copy link
Copy Markdown

@JoshSalway JoshSalway commented Apr 20, 2026

What

Caps $tries on the nine queued jobs that currently inherit it from queue:work --tries=N. On default-worker deployments (--tries=1) this is a no-op; on --tries=0 deployments (e.g. VaporWorkCommand's default) it stops unbounded retry loops.

Nine files, ~25 insertions, 1 deletion. No new test file: existing integration tests in tests/QueuedImportTest.php and tests/QueuedExportTest.php already cover these jobs through full dispatch cycles.

Job $tries
AfterImportJob 10
AppendDataToSheet 1
AppendPaginatedToSheet 1
AppendQueryToSheet 1
AppendViewToSheet 1
CloseSheet 1
QueueExport 1
StoreQueuedExport 1
ReadChunk $import->tries ?? 1 (fallback only)

Not touched: QueueImport (empty handle() body, retries accomplish nothing).

Untyped public $tries = 1; used to match the composer.json PHP constraint ^7.0||^8.0.

Why these values

$tries = 1 on the seven single-run jobs matches Laravel's own queue:work --tries=1 default. Default deployments see zero behaviour change; --tries=0 deployments get capped at 1. If the maintainers prefer a resilience-over-fail-fast tradeoff ($tries = 3 with $backoff), happy to switch. My read is that fail-fast is safer because these jobs write chunk data to a shared temp file and idempotency on partial-write failure is not obvious.

$tries = 10 on AfterImportJob because its handle() uses $this->release($this->interval) to poll for ReadChunk dependencies. Each release increments the attempt counter. $tries = 1 would fail the job on the first release before any dependency completes. 10 attempts at the 60s default interval caps the polling at ~10 minutes.

$tries = $import->tries ?? 1 on ReadChunk preserves user-set values on the Import class while falling back to the Laravel default instead of null. User settings still win.

Verified

  • php -l clean on all edited files (PHP 8.4.19)
  • composer.json PHP constraint ^7.0||^8.0 respected (untyped properties)
  • StyleCI preset laravel + ordered_class_elements respected
  • queue:work --tries=1 default verified on laravel/framework 13.x Illuminate/Queue/Console/WorkCommand.php signature
  • VaporWorkCommand's --tries=0 signature is long-standing and well-documented in the wider Laravel queue-safety discussion

Context

Part of a Laravel queue-safety audit: https://joshsalway.com/articles/improving-queue-safety-in-laravel.

Four queued jobs in this package ship without a $tries property,
inheriting the worker-level default. That works fine for typical
runs, but produces the failure mode described in SpartnerNL#2769, SpartnerNL#2596,
SpartnerNL#3948, SpartnerNL#2779, SpartnerNL#3962, SpartnerNL#4203, and SpartnerNL#4328: a stall or OOM during
import/export triggers the worker to re-dispatch the job, which
stalls or OOMs again, repeating until the worker process dies or
the queue backlog is cleared manually.

AfterImportJob is the most severe case: its handle() uses
$this->release($this->interval) to poll for dependent ReadChunk
completion. Each release increments the attempts counter; without
$tries, there is no ceiling, so a failed dependency or missing
cache entry keeps the poll alive indefinitely at 60-second
intervals.

Changes:
- AfterImportJob: $tries = 10 (caps the 60s polling at ~10 minutes)
- AppendDataToSheet: $tries = 5
- AppendPaginatedToSheet: $tries = 5
- AppendQueryToSheet: $tries = 5

Users who already set $tries on their Import class are unaffected,
since job-level $tries is overridden by the payload resolution path
already in use for ReadChunk/QueueImport. These four jobs were the
only ones in src/Jobs that had no fallback.
Previous docblock said 'keeps polling indefinitely' — that is only
true when the worker runs with --tries=0 (unlimited). Laravel's
default queue:work signature is --tries=1, which would fail
AfterImportJob after a single release before dependencies complete.
Updated wording describes both cases correctly.
After reviewing the referenced issues in the tracker, the retry-storm
failure mode the original commits cited isn't actually the pattern
users are reporting. The reports are about PhpSpreadsheet memory
inefficiency (SpartnerNL#3962, SpartnerNL#4203), queue worker execution timeouts
(SpartnerNL#2596, SpartnerNL#3948), and unrelated architectural issues (SpartnerNL#4328).

Users hitting those reports typically want MORE attempts on the
Append* jobs, not fewer. Setting $tries=5 on them could regress
users whose current queue:work --tries=N gives them headroom.

The only case with a genuinely defensible mechanism is
AfterImportJob, which uses $this->release($this->interval) to poll
for ReadChunk completion. Without a job-level $tries, that loop has
no ceiling on workers started with --tries=0. Keeping that fix.

Three Append* changes reverted. One file, one property remains.
Expanded from AfterImportJob-only to seven export-pipeline jobs that
currently ship with no queue-safety configuration. Each gets
$tries=5 with $backoff=[30,60,300] to match the pattern that
ReadChunk and QueueImport already use (where they pull from the
user's Import class).

Patched:
- AfterImportJob: $tries=10, no $backoff (its release($interval)
  polling sets its own timing; adding $backoff would be redundant)
- AppendDataToSheet, AppendPaginatedToSheet, AppendQueryToSheet,
  AppendViewToSheet: $tries=5, $backoff=[30,60,300]
- CloseSheet, QueueExport, StoreQueuedExport: same

Not touched:
- ReadChunk: already pulls $tries, $backoff, $timeout,
  $maxExceptions from the user's Import class
- QueueImport: pulls $tries from Import class and has an empty
  handle() body, so retries would do nothing anyway

Untyped properties used throughout to match the composer.json
PHP constraint (^7.0||^8.0) — typed properties require 7.4+.
@JoshSalway JoshSalway changed the title Cap queued job attempts to avoid unbounded retry loops Add safe retry defaults to unconfigured queued jobs Apr 20, 2026
ReadChunk resolves $tries, $backoff, $timeout, and $maxExceptions
from the user's Import class at construction time. If the Import
class doesn't set them, these currently resolve to null, meaning
the worker falls back to whatever queue:work --tries=N and default
backoff (0) are.

This changes two of them to explicit fallbacks that match what a
fresh Laravel install does:

- $tries: null -> 1 (same as queue:work --tries default)
- $backoff: null -> [30, 60, 300] (affects spacing only, not count)

Effect by scenario:

- User sets Import::$tries = 5: unchanged (Import still wins)
- User runs queue:work (default --tries=1): unchanged (1 either way)
- User runs queue:work --tries=0 without Import setting: now bounded
  at 1 attempt instead of unlimited retries on failure
- User runs queue:work --tries=5 without Import setting: now bounded
  at 1 instead of 5. Can restore by setting $tries on the Import class.

$timeout and $maxExceptions left as null-fallbacks because worker
--timeout=60 default is already tight, and $maxExceptions default
behaviour is correct when $tries alone acts as the ceiling.
@JoshSalway JoshSalway changed the title Add safe retry defaults to unconfigured queued jobs Add safe retry defaults and Import-class fallbacks to all queued jobs Apr 20, 2026
Three unit tests verify the fallback behaviour documented in the PR:

- test_readchunk_falls_back_to_tries_one_when_import_has_no_retry_config
  Import class without $tries -> ReadChunk.tries = 1

- test_readchunk_falls_back_to_exponential_backoff_when_import_has_no_retry_config
  Import class without $backoff -> ReadChunk.backoff = [30, 60, 300]

- test_readchunk_preserves_user_tries_set_on_import
  Import class with $tries = 7 -> ReadChunk.tries = 7 (user wins)

Construct ReadChunk directly with anonymous-class Import instances and
mocked IReader/TemporaryFile dependencies to avoid the DB-backed test
path needed for the existing queue propagation tests.
The .phpunit.cache/test-results file was inadvertently committed when
running the new ReadChunkDefaultsTest locally. Reverting to match
origin/3.1. Also removed a stray blank line at the end of
QueuedImportTest.php.
StyleCI flagged the original ordering (private helpers before public
test methods). Moved the three test methods to the top of the class
and left the helpers below, matching the visibility ordering the
laravel preset enforces (public -> private).

No test logic changed. All 3 tests still pass locally:
  Tests: 3, Assertions: 3.
Changes the seven single-run jobs from $tries=5 with exponential backoff
to $tries=1 with no backoff. Keeps AfterImportJob at $tries=10 because
its polling pattern genuinely needs the headroom.

Rationale:
- $tries=1 matches Laravel's queue:work --tries default. Default-worker
  deployments see zero behaviour change.
- Still caps --tries=0 (Vapor-pattern) deployments at 1 attempt instead
  of unlimited.
- Append*/Close/Queue/Store jobs write to temp files and their
  idempotency on partial-write failure is not obvious. Fail-fast on the
  first error avoids any risk of double-written rows or corrupt output.
- ReadChunk fallback ($tries ?? 1, $backoff ?? [30, 60, 300]) kept as-is:
  users who explicitly set $tries on their Import class still get
  properly-spaced retries via the $backoff fallback.
- AfterImportJob keeps $tries=10 because its handle() uses
  $this->release($interval) to poll for ReadChunk completion. Each
  release increments the attempt counter; $tries=1 would fail the job
  on the first release before any dependency completes.
@JoshSalway JoshSalway changed the title Add safe retry defaults and Import-class fallbacks to all queued jobs Add conservative $tries defaults to unconfigured queued jobs Apr 21, 2026
Dropped the $import->backoff fallback to [30, 60, 300] on ReadChunk.
If the user sets $tries on their Import class but no $backoff, they
now get the framework default (0s between retries) again. If they
want spacing they can set it on their Import class.

Rationale: this PR's narrative for the seven single-run jobs is
'cap runaway attempts, don't impose retry-spacing opinions.' Keeping
a backoff fallback on ReadChunk contradicted that for a narrow case
(user opted into tries>1 without setting backoff). Follow-up PR can
add it if maintainers want it.

Dropped the accompanying test. Two tests remain:
- test_readchunk_falls_back_to_tries_one_when_import_has_no_tries
- test_readchunk_preserves_user_tries_set_on_import

Tests: 2, Assertions: 2.
@JoshSalway JoshSalway changed the title Add conservative $tries defaults to unconfigured queued jobs Cap $tries on queued jobs to stop unbounded retry on --tries=0 workers Apr 21, 2026
Removed the long inline rationale; the same information now lives in
the PR body where it is easier to maintain. One-line docblock points
at the load-bearing fact (each release() counts as an attempt).
The tests verified that PHP's null-coalescing operator returns the
fallback when the operand is null and the operand otherwise. That is
a language guarantee, not library behaviour. The existing integration
tests in QueuedImportTest.php already cover the jobs through real
dispatch cycles (see test_can_define_max_exceptions_property_on_queued_import
for the pattern). Dropping the unit file to keep the PR focused on
the one-line property changes.
@JoshSalway JoshSalway marked this pull request as ready for review April 21, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant