Fix broker stuck in SYNCHRONIZING on DB error during rollback#4995
Merged
Fix broker stuck in SYNCHRONIZING on DB error during rollback#4995
Conversation
edad655 to
c6e8ca8
Compare
b45ec83 to
fdf1cfb
Compare
johha
previously approved these changes
Apr 8, 2026
df03384 to
c00c7e5
Compare
c00c7e5 to
b002752
Compare
johha
previously approved these changes
Apr 9, 2026
8a9524d to
1c425fc
Compare
ecfe4b5 to
c9dc2ee
Compare
johha
requested changes
Apr 15, 2026
87f93fc to
7b2fab2
Compare
johha
approved these changes
Apr 21, 2026
Service brokers can become permanently stuck in SYNCHRONIZING state when
a database connection failure occurs while a failed job attempts to
revert the broker state. Without intervention, the broker remains
unusable even after the database recovers.
This change implements a multi-layered error handling approach:
1. Immediate rollback: Best-effort state reversion in the job's rescue
block with graceful error handling that doesn't mask the original
failure
2. Failure recovery hook: New recover_from_failure method invoked when
jobs transition to FAILED state after retries are exhausted. This
serves as a safety net to set the broker to SYNCHRONIZATION_FAILED
when the database becomes available again
3. Conditional updates: WHERE clauses ensure only SYNCHRONIZING brokers
are affected, protecting against overwriting newer states
The failure hook infrastructure is implemented in PollableJobWrapper and
WrappingJob, allowing any job to implement recover_from_failure for
cleanup when transitioning to permanent failure.
Changes:
- Add PollableJobWrapper.failure hook that calls recover_from_failure
- Add WrappingJob.recover_from_failure delegation with respond_to? check
- Implement recover_from_failure in UpdateBrokerJob and
SynchronizeBrokerCatalogJob to set brokers to SYNCHRONIZATION_FAILED
- Add graceful error handling to rollback_broker_state
- Add comprehensive test coverage for all new behavior
This captures everything in your PR accurately and concisely.
ff64426 to
5df74e5
Compare
ari-wg-gitbot
added a commit
to cloudfoundry/capi-release
that referenced
this pull request
Apr 22, 2026
Changes in cloud_controller_ng:
- Fix broker stuck in SYNCHRONIZING on DB error during rollback
PR: cloudfoundry/cloud_controller_ng#4995
Author: Katharina Przybill <30441792+kathap@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a service broker update or create job fails, a database connection failure during state rollback could leave the broker permanently stuck in
SYNCHRONIZINGstate with aFAILEDjob. This PR adds comprehensive error handling and a recovery hook to ensure brokers are marked asSYNCHRONIZATION_FAILEDeven when database outages occur during job failure.Changes:
a) app/jobs/v3/services/update_broker_job.rb:
recover_from_failurehook to set broker toSYNCHRONIZATION_FAILEDwhen job transitions toFAILEDrollback_broker_stateanddestroy_update_requesthelper methods with graceful error handlingstate: SYNCHRONIZING) to prevent overwriting newer broker statesb) app/jobs/v3/services/synchronize_broker_catalog_job.rb:
recover_from_failurehook to set broker toSYNCHRONIZATION_FAILEDwhen job transitions toFAILEDstate: SYNCHRONIZING) to prevent overwriting newer broker statesc) app/jobs/wrapping_job.rb:
recover_from_failurehook to allow all wrappers to invoke recovery methodsd) app/jobs/pollable_job_wrapper.rb:
recover_from_failurehook in failure method on the handler (delegated through WrappingJob)e) spec/unit/jobs/v3/services/update_broker_job_spec.rb:
recover_from_failuremethod covering normal recovery, idempotency, state protection, and error handlingf) spec/unit/jobs/pollable_job_wrapper_spec.rb:
g) spec/unit/jobs/wrapping_job_spec.rb:
recover_from_failuredelegation through wrapper layersA short explanation of the proposed change:
This PR adds multi-layered error handling for broker state management during job failures:
rollback_broker_statehelper for updates, inline for creates)recover_from_failuremethod called when job exhausts retries and transitions toFAILEDSYNCHRONIZINGbrokers are affected, protecting against overwriting newer statesAn explanation of the use cases your change solves:
This change solves a critical production issue where service brokers become permanently stuck in
SYNCHRONIZINGstate. This occurs when:SYNCHRONIZATION_FAILEDWithout this fix: The broker remains stuck in
SYNCHRONIZINGstate with aFAILEDjob, requiring manual interventionWith this fix: When the job transitions to
FAILED, therecover_from_failurehook is invoked byPollableJobWrapper, which attempts to set the broker toSYNCHRONIZATION_FAILED. The conditional WHERE clause prevents overwriting any newer state that may have been set, ensuring safe recovery.I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
mainbranchI have run all the unit tests using
bundle exec rakeI have run CF Acceptance Tests