Add retry with backoff for failed publish and update cron workers#182
Open
pfefferle wants to merge 4 commits into
Open
Add retry with backoff for failed publish and update cron workers#182pfefferle wants to merge 4 commits into
pfefferle wants to merge 4 commits into
Conversation
A transient PDS failure (429/5xx/timeout) in the one-shot atmosphere_publish_post / atmosphere_update_post cron events previously dropped the post silently. Re-queue the same hook with an escalating 60s/5m/15m ladder, tracked in post meta and cleared on success, permanent rejection, or exhaustion. Mirrors the comment parent-defer pattern. Fixes #179
Mirror the ActivityPub dispatcher's filterable retry knobs with a single atmosphere_publish_retry_delays filter: one entry per retry, so the ladder's length is also the attempt budget — an empty array disables retries, a longer one raises the cap. One knob instead of two keeps the delay schedule and the max-attempts count from contradicting each other. Part of #179
wp_clear_scheduled_hook() with no args only removes events whose args are an empty array, so every queued per-post event ([ post_id ]) and the revoke event (encrypted token payload in its args) survived clear_scheduled_hooks() — the exact events the cleanup exists to stop from firing against a different connection. Switch both helpers to wp_unschedule_hook(), which clears a hook's events regardless of args. Fixes #181
- Never retry atmosphere_thread_rollback_failed: live orphans on the PDS plus a fresh-TID retry would publish a duplicate copy of the post. Also classify the other deterministic no-status failures (decrypt, DID mismatch, invalid blob-filter return) as permanent. - Reset the retry counter on every status transition so a counter stranded by a dead retry event cannot shrink the next ladder. - Sweep _atmosphere_publish_retries in uninstall.php. - Skip the giving-up breadcrumb when the filter disabled retries. - Document atmosphere_publish_retry_delays in the hook reference and code-style skill; use args-agnostic unschedule in test tear_down.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds resilience to Atmosphere’s WP-Cron publish/update workers by introducing bounded retry-with-backoff for transient failures, and fixes disconnect/deactivation/uninstall cleanup so arg-carrying cron events are reliably removed (preventing queued tasks from firing under a new connection).
Changes:
- Add transient-vs-permanent failure classification for publish/update worker errors, and reschedule transient failures with a filterable backoff ladder tracked in
_atmosphere_publish_retries. - Fix scheduled-hook cleanup by switching from
wp_clear_scheduled_hook()(arg-sensitive) towp_unschedule_hook()(args-agnostic), including coverage for revoke events with args. - Add PHPUnit coverage for retry scheduling/escalation/exhaustion, rollback non-retry, retry-budget reset, and cron cleanup behavior; update docs and changelog entries.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
includes/class-atmosphere.php |
Implements publish/update retry-with-backoff logic, retry classification, and retry-budget reset on status transition. |
includes/functions.php |
Fixes cron cleanup to unschedule hooks regardless of args; ensures revoke event cleanup is also args-agnostic. |
tests/phpunit/tests/class-test-atmosphere.php |
Adds tests validating retry ladder behavior, filter behavior, non-retryable rollback failure, and counter reset. |
tests/phpunit/tests/class-test-functions.php |
Adds regression tests ensuring cleanup removes both argless and arg-carrying scheduled events. |
uninstall.php |
Ensures new retry-counter meta key is removed on uninstall. |
docs/php-coding-standards.md |
Documents the new atmosphere_publish_retry_delays filter in the hook reference list. |
.github/changelog/add-publish-retry-backoff |
Changelog entry for publish retry/backoff behavior. |
.github/changelog/fix-unschedule-all-cron-events |
Changelog entry for reliable cron cleanup on disconnect/deactivation. |
.agents/skills/code-style/SKILL.md |
Updates internal skill reference list to include the new filter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #179
Fixes #181
Proposed changes:
wp_schedule_single_eventcron events; a single transient PDS failure (429/5xx/timeout) or missed cron tick previously dropped the post silently — the post showed "Published" in WordPress but never reached Bluesky (reported on wp.org as "inconsistent" publishing). Failedatmosphere_publish_post/atmosphere_update_postworkers now re-queue themselves with an escalating 60s/5m/15m ladder, tracked in_atmosphere_publish_retriespost meta and cleared on success, permanent rejection, exhaustion, or a fresh save. Mirrors the existing comment parent-defer pattern and the ActivityPub dispatcher's filterable-retry convention.not_connected,needs_reauth,decrypt,did_mismatch, …) and deterministic PDS 4xx are never retried; no-status network failures, 408, 429, and 5xx are.atmosphere_thread_rollback_failedis explicitly permanent — retrying after a failed rollback would mint fresh TIDs next to live orphan records and publish a duplicate copy.atmosphere_publish_retry_delaysfilter. One entry per retry, so the ladder's length is also the attempt budget — an empty array disables retries, a longer array raises the cap. Documented in the hook reference.wp_clear_scheduled_hook( $hook )with no args only removes events scheduled with an empty args array, so every queued per-post event and the revoke event (encrypted token payload in its args) survivedclear_scheduled_hooks()— the exact events the cleanup exists to stop from firing against a different connection. Both helpers now usewp_unschedule_hook().Other information:
Reviewed pre-PR by three parallel review passes (correctness/WP-Cron races, conventions/test hygiene, edge cases/abuse); all blocking findings addressed in the last commit (duplicate-post guard, stranded-counter reset, uninstall sweep).
Testing instructions:
npm run env-start, connect a Bluesky account, and enable auto-publish.add_filter( 'atmosphere_pre_apply_writes', fn() => new \WP_Error( 'atmosphere_pds', 'boom', array( 'status' => 500 ) ) );(e.g. via a mu-plugin), then publish a post.wp cron event list(or WP Crontrol): anatmosphere_publish_postevent for the post is queued ~60s out; after it fails again, ~5m; then ~15m; then it stops. The post's_atmosphere_publish_retriesmeta tracks the attempt count and is removed at the end.array( 'status' => 400 ): no retry is scheduled.wp cron event listshows no remainingatmosphere_*events.npm run env-test— 848 tests pass, including 10 new ones covering the ladder, the filter, the rollback guard, the counter reset, and the arg-carrying cleanup.