Skip to content

CA-403379: pre-flight cluster_host state before pool-ha-enable#7130

Open
LunfanZhang wants to merge 1 commit into
xapi-project:masterfrom
LunfanZhang:private/luzhan/CA-403379
Open

CA-403379: pre-flight cluster_host state before pool-ha-enable#7130
LunfanZhang wants to merge 1 commit into
xapi-project:masterfrom
LunfanZhang:private/luzhan/CA-403379

Conversation

@LunfanZhang

Copy link
Copy Markdown
Collaborator

When the chosen HA cluster_stack is corosync (i.e. for a gfs2 heartbeat SR) every pool host must have an enabled, joined cluster_host on the matching cluster stack, and this host must currently be quorate. Without this preflight, that failure surfaces much later inside Xha_statefile.check_sr_can_host_statefile with the misleading SR_NO_PBDS error from pool-ha-enable.

This change adds a per-host preflight in Xapi_ha.enable that reuses the existing NO_COMPATIBLE_CLUSTER_HOST, CLUSTERING_DISABLED and CLUSTER_HOST_NOT_JOINED errors so the caller can pinpoint exactly which host is the problem.

The preflight runs BEFORE the cluster_stack persisted to the pool DB and local db, matching the pattern of the existing host_offline check, so a failed precondition does not leak ha_cluster_stack into the pool state.

The final assert_cluster_host_quorate call queries xapi-clusterd diagnostics directly rather than reading the Cluster_host.live DB field, which the corosync_notifyd watcher only updates asynchronously and which is reset to false for all hosts on any transient quorum blip.

Comment thread ocaml/xapi/xapi_ha.ml Outdated
~host:(Helpers.get_localhost ~__context)
with
| None ->
() (* unreachable: covered by the iter above *)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use failwith ... I'm always a bit stressed with unreachable code that fails silently

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I move it to the loop, so there no unreachable code and logic stay the same

When the chosen HA cluster_stack is corosync (i.e. for a gfs2 heartbeat
SR) every pool host must have an enabled, joined cluster_host on the
matching cluster stack, and this host must currently be quorate.
Without this preflight, that failure surfaces much later inside
Xha_statefile.check_sr_can_host_statefile with the misleading
SR_NO_PBDS error from pool-ha-enable (CA-417077 / TC7509).

This change adds a per-host preflight in Xapi_ha.enable that reuses
the existing NO_COMPATIBLE_CLUSTER_HOST, CLUSTERING_DISABLED and
CLUSTER_HOST_NOT_JOINED errors so the caller can pinpoint exactly
which host is the problem.

The preflight runs BEFORE the cluster_stack is persisted to the pool
DB and localdb, matching the pattern of the existing host_offline
check, so a failed precondition does not leak ha_cluster_stack into
the pool state.

The final assert_cluster_host_quorate call queries xapi-clusterd
diagnostics directly rather than reading the Cluster_host.live DB
field, which the corosync_notifyd watcher only updates asynchronously
and which is reset to false for all hosts on any transient quorum
blip.

Signed-off-by: Lunfan Zhang[Lunfan.Zhang] <Lunfan.Zhang@cloud.com>
@LunfanZhang LunfanZhang force-pushed the private/luzhan/CA-403379 branch from 32d2521 to 285d31d Compare June 17, 2026 01:53
Comment thread ocaml/xapi/xapi_ha.ml
Cluster_host.live DB field which the corosync_notifyd
watcher only updates asynchronously. *)
if host = localhost then
Xapi_clustering.assert_cluster_host_quorate ~__context ~self

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is only needed once for the master. So could we move it outside the iteration?

Comment thread ocaml/xapi/xapi_ha.ml
misleading SR_NO_PBDS from check_sr_can_host_statefile. Run this
before persisting cluster_stack so a failed precondition does not
leak ha_cluster_stack into the pool DB. *)
( if cluster_stack = Constants.Ha_cluster_stack.(to_string Corosync) then

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to move this implementation into xapi_clustering.ml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants