Skip to content

fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619

Open
francesco-furlan wants to merge 5 commits into
mainfrom
shield/fix-allowlist-waiter-hook-cleanup
Open

fix(shield): prevent stuck allowlist-waiter Job after hook failure#2619
francesco-furlan wants to merge 5 commits into
mainfrom
shield/fix-allowlist-waiter-hook-cleanup

Conversation

@francesco-furlan
Copy link
Copy Markdown
Contributor

@francesco-furlan francesco-furlan commented May 12, 2026

What this PR does / why we need it:

Fixes a stuck-cluster state observed on a customer running shield-1.37.1 on GKE Autopilot via Flux: when the pre-install shield-host-allowlist-waiter Job times out or fails, Helm's hook executor (per helm/helm pkg/action/hooks.go) applies the HookSucceeded delete-policy to all previously-succeeded hooks in the batch, sweeping the waiter's SA/CR/CRB while the failed Job survives. The Job then loops FailedCreate forever against a missing ServiceAccount.

Four logically separable changes, one commit each (kept independently bisectable):

  1. Decouple the waiter SA/CR/CRB from host.rbac.create. They now have their own gke_autopilot.allowlist_waiter.create_rbac and rbac_annotations keys so enabling the waiter without host-shield RBAC no longer produces an orphan Job. Bumps chart to 1.38.0.

  2. The core fix:

    • Drop hook-succeeded from the SA/CR/CRB helm.sh/hook-delete-policy so they survive Helm's HookSucceeded sweep on Job failure.
    • Add hook-failed to the Job's helm.sh/hook-delete-policy so a failed Job is reaped instead of looping FailedCreate.
    • AllowlistSynchronizer hook policy unchanged (already before-hook-creation only — it must persist past the waiter).
  3. Emit kubectl describe and full YAML of the AllowlistSynchronizer on wait failure, so the next on-caller has actionable diagnostics instead of a bare exit (the customer's first failure was unrecoverable because pod logs were already evicted by the time investigation started). No new RBAC needed.

  4. Add gke_autopilot.allowlist_waiter.active_deadline_seconds (default 300) wired into Job.spec.activeDeadlineSeconds. Belt-and-suspenders Job-level guard against pod hangs before the inner kubectl wait timeout fires (image-pull stalls, scheduler delays, admission webhook hangs).

Customer-side remediation (independent of this fix)

For clusters already stuck:

flux suspend helmrelease shield -n sysdig
kubectl delete -n sysdig job/shield-host-allowlist-waiter
kubectl delete allowlistsynchronizer/sysdig-agent-allowlist-synchronizer
helm uninstall shield -n sysdig
flux resume helmrelease shield -n sysdig

Then watch the install hook phase. With the chart fix shipped, the same failure mode produces a clean retry instead of a 5-day FailedCreate storm.

Checklist

  • Title of the PR starts with type and scope
  • Chart Version bumped for the respective charts (1.37.1 → 1.38.0)
  • Variables are documented in the README.md
  • Check GithubAction checks (like lint) to avoid merge-check stoppers
  • All test files are added in the tests folder of their respective chart and have a "_test" suffix

The allowlist waiter Job and its SA/CR/CRB are gated by
gke_autopilot.allowlist_waiter.enabled, but their RBAC creation was
coupled to host.rbac.create. With host.rbac.create=false (e.g.,
externally managed host RBAC), enabling the waiter produced an orphan
Job referencing a non-existent SA.

Introduce a dedicated gke_autopilot.allowlist_waiter.create_rbac flag
(defaulting to true) plus gke_autopilot.allowlist_waiter.rbac_annotations
so the waiter's RBAC is managed independently of the host-shield's.

Bumps chart to 1.38.0.
…k failure

Helm 3's hook executor applies the HookSucceeded delete-policy to both
the failed hook and all previously-succeeded hooks in the batch (see
helm/helm pkg/action/hooks.go execHook). When the allowlist-waiter Job
(weight 5) fails, Helm sweeps the SA/CR/CRB (weight -5) because they
carried hook-succeeded — leaving the failed Job referencing a missing
SA, which produces a FailedCreate retry storm bounded only by manual
intervention.

Two surgical changes:

  * Drop hook-succeeded from the SA/CR/CRB delete-policy. They now
    persist across failures (and across successes — a lingering tiny
    SA/CR/CRB that the next install's before-hook-creation reclaims).
  * Add hook-failed to the Job's delete-policy so a failed Job is
    reaped instead of looping forever against a deleted SA.

The AllowlistSynchronizer is unchanged (already before-hook-creation
only — it must persist after the waiter exits since host-shield pods
rely on Warden seeing it).
When the waiter's kubectl wait times out, the Pod previously exited with
just the bare wait error, leaving no diagnostics by the time the next
on-caller arrived (especially after kubelet log eviction).

On non-zero exit, the script now emits a kubectl describe and full YAML
of the AllowlistSynchronizer to stderr before exiting with the original
error code. The existing RBAC (get/list/watch on allowlistsynchronizers)
is sufficient; events lookup is intentionally skipped to avoid widening
the ClusterRole.
…econds

The Job's only existing timeout lives inside the container script
(kubectl wait --timeout). If the Pod hangs before that line executes
(image-pull stall, scheduler delay, admission webhook latency), the
Job can outrun Helm's pre-install hook timeout.

Add a Job-level activeDeadlineSeconds guard via the new
gke_autopilot.allowlist_waiter.active_deadline_seconds value, defaulting
to 300 (180s of headroom over the default 120s wait timeout).
@francesco-furlan francesco-furlan requested a review from a team as a code owner May 12, 2026 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant