Skip to content

Commit 0b946c4

Browse files
authored
infra: switch nomad server update policy to OPPORTUNISTIC (#2356)
* infra: switch nomad server update policy to OPPORTUNISTIC Prevents automatic instance replacements on template changes, requiring manual intervention to roll out updates to the control plane. This reduces risk of unintended disruptions to the Nomad server cluster. * fix(infra): use dev conditional and fix redistribution for server update policy - Use var.environment == "dev" ? "PROACTIVE" : "OPPORTUNISTIC" to match the pattern used by all other nodepools (api, clickhouse, loki, worker) - Set instance_redistribution_type to NONE for non-dev to prevent GCP zone rebalancing from propagating new templates as a side effect - Add reference to hashicorp/nomad#9390 (missed heartbeats during rotation) - Update stale comment to reflect OPPORTUNISTIC semantics * fix(infra): keep PROACTIVE redistribution for server pool Server cluster needs even zone distribution for Raft quorum resilience. Redistribution only triggers on zone imbalance, not on template changes, so it doesn't undermine the OPPORTUNISTIC update policy. * docs(infra): note redistribution interaction with OPPORTUNISTIC updates
1 parent 38b29d4 commit 0b946c4

1 file changed

Lines changed: 8 additions & 4 deletions

File tree

iac/provider-gcp/nomad-cluster/nodepool-control-server.tf

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,13 +49,17 @@ resource "google_compute_region_instance_group_manager" "server_pool" {
4949
port = var.nomad_port
5050
}
5151

52-
# Server is a stateful cluster, so the update strategy used to roll out a new GCE Instance Template must be
53-
# a rolling update.
52+
# Server is a stateful cluster. In non-dev environments, use OPPORTUNISTIC updates so instance template
53+
# changes are only applied when instances are recreated for other reasons (e.g., auto-healing).
54+
# Proactive rolling replacements of servers can cause missed client heartbeats and secret revocations:
55+
# https://github.com/hashicorp/nomad/issues/9390
5456
update_policy {
55-
type = "PROACTIVE"
57+
type = var.environment == "dev" ? "PROACTIVE" : "OPPORTUNISTIC"
5658
minimal_action = "REPLACE"
5759

58-
// We want to keep the instance distribution even
60+
// Keep PROACTIVE redistribution to maintain even server distribution across zones for Raft quorum resilience.
61+
// Note: redistributed instances will pick up the current instance template, which may apply pending template
62+
// changes as a side effect of zone rebalancing. This is an acceptable trade-off for server quorum safety.
5963
instance_redistribution_type = "PROACTIVE"
6064
max_unavailable_fixed = 0
6165

0 commit comments

Comments
 (0)