CA-403379: pre-flight cluster_host state before pool-ha-enable#7130
Open
LunfanZhang wants to merge 1 commit into
Open
CA-403379: pre-flight cluster_host state before pool-ha-enable#7130LunfanZhang wants to merge 1 commit into
LunfanZhang wants to merge 1 commit into
Conversation
gthvn1
reviewed
Jun 16, 2026
| ~host:(Helpers.get_localhost ~__context) | ||
| with | ||
| | None -> | ||
| () (* unreachable: covered by the iter above *) |
Contributor
There was a problem hiding this comment.
maybe use failwith ... I'm always a bit stressed with unreachable code that fails silently
Collaborator
Author
There was a problem hiding this comment.
Indeed, I move it to the loop, so there no unreachable code and logic stay the same
When the chosen HA cluster_stack is corosync (i.e. for a gfs2 heartbeat SR) every pool host must have an enabled, joined cluster_host on the matching cluster stack, and this host must currently be quorate. Without this preflight, that failure surfaces much later inside Xha_statefile.check_sr_can_host_statefile with the misleading SR_NO_PBDS error from pool-ha-enable (CA-417077 / TC7509). This change adds a per-host preflight in Xapi_ha.enable that reuses the existing NO_COMPATIBLE_CLUSTER_HOST, CLUSTERING_DISABLED and CLUSTER_HOST_NOT_JOINED errors so the caller can pinpoint exactly which host is the problem. The preflight runs BEFORE the cluster_stack is persisted to the pool DB and localdb, matching the pattern of the existing host_offline check, so a failed precondition does not leak ha_cluster_stack into the pool state. The final assert_cluster_host_quorate call queries xapi-clusterd diagnostics directly rather than reading the Cluster_host.live DB field, which the corosync_notifyd watcher only updates asynchronously and which is reset to false for all hosts on any transient quorum blip. Signed-off-by: Lunfan Zhang[Lunfan.Zhang] <Lunfan.Zhang@cloud.com>
32d2521 to
285d31d
Compare
BengangY
approved these changes
Jun 17, 2026
| Cluster_host.live DB field which the corosync_notifyd | ||
| watcher only updates asynchronously. *) | ||
| if host = localhost then | ||
| Xapi_clustering.assert_cluster_host_quorate ~__context ~self |
Contributor
There was a problem hiding this comment.
This check is only needed once for the master. So could we move it outside the iteration?
minglumlu
reviewed
Jun 17, 2026
| misleading SR_NO_PBDS from check_sr_can_host_statefile. Run this | ||
| before persisting cluster_stack so a failed precondition does not | ||
| leak ha_cluster_stack into the pool DB. *) | ||
| ( if cluster_stack = Constants.Ha_cluster_stack.(to_string Corosync) then |
Member
There was a problem hiding this comment.
I would suggest to move this implementation into xapi_clustering.ml
gthvn1
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the chosen HA cluster_stack is corosync (i.e. for a gfs2 heartbeat SR) every pool host must have an enabled, joined cluster_host on the matching cluster stack, and this host must currently be quorate. Without this preflight, that failure surfaces much later inside Xha_statefile.check_sr_can_host_statefile with the misleading SR_NO_PBDS error from pool-ha-enable.
This change adds a per-host preflight in
Xapi_ha.enablethat reuses the existingNO_COMPATIBLE_CLUSTER_HOST,CLUSTERING_DISABLEDandCLUSTER_HOST_NOT_JOINEDerrors so the caller can pinpoint exactly which host is the problem.The preflight runs BEFORE the cluster_stack persisted to the pool DB and local db, matching the pattern of the existing host_offline check, so a failed precondition does not leak ha_cluster_stack into the pool state.
The final assert_cluster_host_quorate call queries xapi-clusterd diagnostics directly rather than reading the Cluster_host.live DB field, which the corosync_notifyd watcher only updates asynchronously and which is reset to false for all hosts on any transient quorum blip.