fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619
Open
francesco-furlan wants to merge 5 commits into
Open
fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619francesco-furlan wants to merge 5 commits into
francesco-furlan wants to merge 5 commits into
Conversation
The allowlist waiter Job and its SA/CR/CRB are gated by gke_autopilot.allowlist_waiter.enabled, but their RBAC creation was coupled to host.rbac.create. With host.rbac.create=false (e.g., externally managed host RBAC), enabling the waiter produced an orphan Job referencing a non-existent SA. Introduce a dedicated gke_autopilot.allowlist_waiter.create_rbac flag (defaulting to true) plus gke_autopilot.allowlist_waiter.rbac_annotations so the waiter's RBAC is managed independently of the host-shield's. Bumps chart to 1.38.0.
…k failure
Helm 3's hook executor applies the HookSucceeded delete-policy to both
the failed hook and all previously-succeeded hooks in the batch (see
helm/helm pkg/action/hooks.go execHook). When the allowlist-waiter Job
(weight 5) fails, Helm sweeps the SA/CR/CRB (weight -5) because they
carried hook-succeeded — leaving the failed Job referencing a missing
SA, which produces a FailedCreate retry storm bounded only by manual
intervention.
Two surgical changes:
* Drop hook-succeeded from the SA/CR/CRB delete-policy. They now
persist across failures (and across successes — a lingering tiny
SA/CR/CRB that the next install's before-hook-creation reclaims).
* Add hook-failed to the Job's delete-policy so a failed Job is
reaped instead of looping forever against a deleted SA.
The AllowlistSynchronizer is unchanged (already before-hook-creation
only — it must persist after the waiter exits since host-shield pods
rely on Warden seeing it).
When the waiter's kubectl wait times out, the Pod previously exited with just the bare wait error, leaving no diagnostics by the time the next on-caller arrived (especially after kubelet log eviction). On non-zero exit, the script now emits a kubectl describe and full YAML of the AllowlistSynchronizer to stderr before exiting with the original error code. The existing RBAC (get/list/watch on allowlistsynchronizers) is sufficient; events lookup is intentionally skipped to avoid widening the ClusterRole.
…econds The Job's only existing timeout lives inside the container script (kubectl wait --timeout). If the Pod hangs before that line executes (image-pull stall, scheduler delay, admission webhook latency), the Job can outrun Helm's pre-install hook timeout. Add a Job-level activeDeadlineSeconds guard via the new gke_autopilot.allowlist_waiter.active_deadline_seconds value, defaulting to 300 (180s of headroom over the default 120s wait timeout).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Fixes a stuck-cluster state observed on a customer running
shield-1.37.1on GKE Autopilot via Flux: when the pre-installshield-host-allowlist-waiterJob times out or fails, Helm's hook executor (perhelm/helmpkg/action/hooks.go) applies theHookSucceededdelete-policy to all previously-succeeded hooks in the batch, sweeping the waiter's SA/CR/CRB while the failed Job survives. The Job then loopsFailedCreateforever against a missing ServiceAccount.Four logically separable changes, one commit each (kept independently bisectable):
Decouple the waiter SA/CR/CRB from
host.rbac.create. They now have their owngke_autopilot.allowlist_waiter.create_rbacandrbac_annotationskeys so enabling the waiter without host-shield RBAC no longer produces an orphan Job. Bumps chart to1.38.0.The core fix:
hook-succeededfrom the SA/CR/CRBhelm.sh/hook-delete-policyso they survive Helm'sHookSucceededsweep on Job failure.hook-failedto the Job'shelm.sh/hook-delete-policyso a failed Job is reaped instead of loopingFailedCreate.before-hook-creationonly — it must persist past the waiter).Emit
kubectl describeand full YAML of the AllowlistSynchronizer on wait failure, so the next on-caller has actionable diagnostics instead of a bare exit (the customer's first failure was unrecoverable because pod logs were already evicted by the time investigation started). No new RBAC needed.Add
gke_autopilot.allowlist_waiter.active_deadline_seconds(default300) wired intoJob.spec.activeDeadlineSeconds. Belt-and-suspenders Job-level guard against pod hangs before the innerkubectl waittimeout fires (image-pull stalls, scheduler delays, admission webhook hangs).Customer-side remediation (independent of this fix)
For clusters already stuck:
Then watch the install hook phase. With the chart fix shipped, the same failure mode produces a clean retry instead of a 5-day
FailedCreatestorm.Checklist