Skip to content

Add retry with backoff for failed publish and update cron workers#182

Open
pfefferle wants to merge 4 commits into
trunkfrom
add/publish-retry-backoff
Open

Add retry with backoff for failed publish and update cron workers#182
pfefferle wants to merge 4 commits into
trunkfrom
add/publish-retry-backoff

Conversation

@pfefferle

Copy link
Copy Markdown
Member

Fixes #179
Fixes #181

Proposed changes:

  • Retry with backoff for failed publish/update cron workers (Retry/backoff for failed post-publish cron events #179). Publishing is deferred to one-shot wp_schedule_single_event cron events; a single transient PDS failure (429/5xx/timeout) or missed cron tick previously dropped the post silently — the post showed "Published" in WordPress but never reached Bluesky (reported on wp.org as "inconsistent" publishing). Failed atmosphere_publish_post / atmosphere_update_post workers now re-queue themselves with an escalating 60s/5m/15m ladder, tracked in _atmosphere_publish_retries post meta and cleared on success, permanent rejection, exhaustion, or a fresh save. Mirrors the existing comment parent-defer pattern and the ActivityPub dispatcher's filterable-retry convention.
  • Transient-vs-permanent classification. Retry-by-default with a permanent blocklist: local preconditions (not_connected, needs_reauth, decrypt, did_mismatch, …) and deterministic PDS 4xx are never retried; no-status network failures, 408, 429, and 5xx are. atmosphere_thread_rollback_failed is explicitly permanent — retrying after a failed rollback would mint fresh TIDs next to live orphan records and publish a duplicate copy.
  • New atmosphere_publish_retry_delays filter. One entry per retry, so the ladder's length is also the attempt budget — an empty array disables retries, a longer array raises the cap. Documented in the hook reference.
  • Fix: disconnect/uninstall cron cleanup missed arg-carrying events (Disconnect/uninstall cron cleanup misses events scheduled with args #181). wp_clear_scheduled_hook( $hook ) with no args only removes events scheduled with an empty args array, so every queued per-post event and the revoke event (encrypted token payload in its args) survived clear_scheduled_hooks() — the exact events the cleanup exists to stop from firing against a different connection. Both helpers now use wp_unschedule_hook().

Other information:

  • Have you written new tests for your changes, if applicable?

Reviewed pre-PR by three parallel review passes (correctness/WP-Cron races, conventions/test hygiene, edge cases/abuse); all blocking findings addressed in the last commit (duplicate-post guard, stranded-counter reset, uninstall sweep).

Testing instructions:

  • Start the environment: npm run env-start, connect a Bluesky account, and enable auto-publish.
  • Simulate a transient PDS failure: add_filter( 'atmosphere_pre_apply_writes', fn() => new \WP_Error( 'atmosphere_pds', 'boom', array( 'status' => 500 ) ) ); (e.g. via a mu-plugin), then publish a post.
  • Check wp cron event list (or WP Crontrol): an atmosphere_publish_post event for the post is queued ~60s out; after it fails again, ~5m; then ~15m; then it stops. The post's _atmosphere_publish_retries meta tracks the attempt count and is removed at the end.
  • Remove the failure filter before the next retry fires: the post publishes and the counter is cleared.
  • Repeat with array( 'status' => 400 ): no retry is scheduled.
  • For Disconnect/uninstall cron cleanup misses events scheduled with args #181: publish a post so a per-post event is queued (or schedule one manually), then disconnect the account on the settings page and confirm wp cron event list shows no remaining atmosphere_* events.
  • Run npm run env-test — 848 tests pass, including 10 new ones covering the ladder, the filter, the rollback guard, the counter reset, and the arg-carrying cleanup.

pfefferle added 4 commits July 4, 2026 14:18
A transient PDS failure (429/5xx/timeout) in the one-shot
atmosphere_publish_post / atmosphere_update_post cron events previously
dropped the post silently. Re-queue the same hook with an escalating
60s/5m/15m ladder, tracked in post meta and cleared on success,
permanent rejection, or exhaustion. Mirrors the comment parent-defer
pattern.

Fixes #179
Mirror the ActivityPub dispatcher's filterable retry knobs with a
single atmosphere_publish_retry_delays filter: one entry per retry,
so the ladder's length is also the attempt budget — an empty array
disables retries, a longer one raises the cap. One knob instead of
two keeps the delay schedule and the max-attempts count from
contradicting each other.

Part of #179
wp_clear_scheduled_hook() with no args only removes events whose args
are an empty array, so every queued per-post event ([ post_id ]) and
the revoke event (encrypted token payload in its args) survived
clear_scheduled_hooks() — the exact events the cleanup exists to stop
from firing against a different connection. Switch both helpers to
wp_unschedule_hook(), which clears a hook's events regardless of args.

Fixes #181
- Never retry atmosphere_thread_rollback_failed: live orphans on the
  PDS plus a fresh-TID retry would publish a duplicate copy of the
  post. Also classify the other deterministic no-status failures
  (decrypt, DID mismatch, invalid blob-filter return) as permanent.
- Reset the retry counter on every status transition so a counter
  stranded by a dead retry event cannot shrink the next ladder.
- Sweep _atmosphere_publish_retries in uninstall.php.
- Skip the giving-up breadcrumb when the filter disabled retries.
- Document atmosphere_publish_retry_delays in the hook reference and
  code-style skill; use args-agnostic unschedule in test tear_down.
Copilot AI review requested due to automatic review settings July 4, 2026 12:52
@pfefferle pfefferle self-assigned this Jul 4, 2026
@pfefferle pfefferle requested a review from a team July 4, 2026 12:52
@github-actions github-actions Bot added [Tests] Includes Tests PR includes test changes Docs labels Jul 4, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds resilience to Atmosphere’s WP-Cron publish/update workers by introducing bounded retry-with-backoff for transient failures, and fixes disconnect/deactivation/uninstall cleanup so arg-carrying cron events are reliably removed (preventing queued tasks from firing under a new connection).

Changes:

  • Add transient-vs-permanent failure classification for publish/update worker errors, and reschedule transient failures with a filterable backoff ladder tracked in _atmosphere_publish_retries.
  • Fix scheduled-hook cleanup by switching from wp_clear_scheduled_hook() (arg-sensitive) to wp_unschedule_hook() (args-agnostic), including coverage for revoke events with args.
  • Add PHPUnit coverage for retry scheduling/escalation/exhaustion, rollback non-retry, retry-budget reset, and cron cleanup behavior; update docs and changelog entries.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
includes/class-atmosphere.php Implements publish/update retry-with-backoff logic, retry classification, and retry-budget reset on status transition.
includes/functions.php Fixes cron cleanup to unschedule hooks regardless of args; ensures revoke event cleanup is also args-agnostic.
tests/phpunit/tests/class-test-atmosphere.php Adds tests validating retry ladder behavior, filter behavior, non-retryable rollback failure, and counter reset.
tests/phpunit/tests/class-test-functions.php Adds regression tests ensuring cleanup removes both argless and arg-carrying scheduled events.
uninstall.php Ensures new retry-counter meta key is removed on uninstall.
docs/php-coding-standards.md Documents the new atmosphere_publish_retry_delays filter in the hook reference list.
.github/changelog/add-publish-retry-backoff Changelog entry for publish retry/backoff behavior.
.github/changelog/fix-unschedule-all-cron-events Changelog entry for reliable cron cleanup on disconnect/deactivation.
.agents/skills/code-style/SKILL.md Updates internal skill reference list to include the new filter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Docs [Tests] Includes Tests PR includes test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disconnect/uninstall cron cleanup misses events scheduled with args Retry/backoff for failed post-publish cron events

2 participants