Skip to content

fix: wait for cert-manager webhook in decommissioning integration test#1441

Draft
david-yu wants to merge 2 commits intomainfrom
fix/decommission-test-cert-manager-flake
Draft

fix: wait for cert-manager webhook in decommissioning integration test#1441
david-yu wants to merge 2 commits intomainfrom
fix/decommission-test-cert-manager-flake

Conversation

@david-yu
Copy link
Copy Markdown
Contributor

@david-yu david-yu commented Apr 10, 2026

Summary

Fix intermittent TestIntegrationStatefulSetDecommissioner failures caused by cert-manager webhook not being ready when helm install runs.

Root cause

CI agents persist k3d clusters across builds. Three problems compound:

  1. Version mismatch — The k3d startup manifest (pkg/k3d/cert-manager.yaml) pinned cert-manager v1.16.1, but CI only pre-pulls images for v1.14.2 and v1.17.2. The k3s helm controller had to pull v1.16.1 from the internet at runtime, which is slow or fails under CI load.

  2. Images not in k3d containerd — CI's test:pull-images downloads images into the host Docker daemon, but k3d nodes use containerd (a separate image store). Without an explicit k3d image import, images aren't available inside the cluster.

  3. Stale clusters break on version upgrades — When GetOrCreate finds a pre-existing k3d cluster from a previous build, loadCluster patches the HelmChart manifest (e.g., v1.16.1 → v1.17.2). This triggers an in-place upgrade by k3s's helm controller, which disrupts the running cert-manager webhook during the transition. The webhook readiness check then times out because the upgrade is still in progress.

Changes

1. Auto-recreate stale clusters (pkg/k3d/k3d.go)

If loadCluster fails on a pre-existing cluster (e.g., webhook never becomes ready after a manifest upgrade), GetOrCreate now deletes the unhealthy cluster and recreates it from scratch. This guarantees a clean state regardless of what previous builds left behind on the CI agent.

2. Pre-import cert-manager images into k3d (pkg/k3d/k3d.go)

waitForJobs() now calls c.importImages(certManagerImages()...) before applying startup manifests, importing cert-manager images from the host Docker daemon into k3d's containerd. Falls back gracefully to registry pull if images aren't in the host daemon (e.g., local dev).

3. Align cert-manager versions to v1.17.2 (the newest CI pre-pulled version)

File Was Now
pkg/k3d/cert-manager.yaml v1.16.1 v1.17.2
pkg/testutil/testutil.go v1.17.1 v1.17.2

4. Readiness wait in decommissioning test (statefulset_decommissioner_test.go)

Add testutil.WaitForCertManagerWebhook() in SetupSuite before any helm operations, matching the pattern used by vcluster, k3d, helmtest, and acceptance test setups.

5. Install retry (statefulset_decommissioner_test.go)

Add webhook error retry to installChart(), matching the existing retry pattern in upgradeChart().

Parallelism

This does not change test parallelism. The WaitForCertManagerWebhook call runs once in SetupSuite. The image import runs once per cluster setup via marker files. The cluster auto-recreate only triggers when loadCluster fails, which is a blocking error today anyway. All complete before any test method executes.

Evidence

Progressive debugging across CI builds:

  • Build 12884/12885 — Original flake: helm install fails with "no endpoints available for service cert-manager-webhook"
  • Build 12886 — Added readiness wait → fails with "cert-manager webhook not ready" (clearer error, same root cause — images not in containerd)
  • Build 12889 — Aligned versions to v1.17.2 → still fails (stale cluster from previous build doing in-place upgrade)
  • Build 12891 — Added image import → passes on 3/4 shards, fails on shard with stale cluster
  • Build 12893 — Added auto-recreate → passes on all shards

Test plan

  • TestIntegrationStatefulSetDecommissioner passes on CI (build 12893, all shards)
  • Other integration tests unaffected
  • Verify on a CI agent that previously had a stale cluster (auto-recreate path exercised)

🤖 Generated with Claude Code

… test

The TestIntegrationStatefulSetDecommissioner test intermittently fails
with "no endpoints available for service cert-manager-webhook" because
helm install runs before the cert-manager webhook pod has ready
endpoints.

Add testutil.WaitForCertManagerWebhook() in SetupSuite before any helm
operations, matching the pattern used by vcluster, k3d, helmtest, and
acceptance test setups. Also add webhook error retry to installChart,
matching the existing retry in upgradeChart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@david-yu david-yu force-pushed the fix/decommission-test-cert-manager-flake branch 4 times, most recently from 18e11fe to e36b055 Compare April 11, 2026 04:36
@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented Apr 11, 2026

Filed DEVPROD-4069 to set up ECR pull-through cache rules for all upstream registries used by CI (quay.io, ghcr.io, registry.k8s.io, docker.io). This would eliminate the external registry dependency that causes these cert-manager flakes — same-region ECR pulls instead of cross-internet pulls to quay.io under CI load.

@david-yu david-yu force-pushed the fix/decommission-test-cert-manager-flake branch 3 times, most recently from 9bd1b37 to d618cbc Compare April 11, 2026 05:45
…e stale k3d clusters

Root cause: CI agents persist k3d clusters across builds. When the
cert-manager version in the startup manifest changes, loadCluster
patches the HelmChart which triggers an in-place upgrade. This
disrupts the running webhook during transition, causing "cert-manager
webhook not ready" timeouts.

Fixes:
1. Align cert-manager to v1.17.2 (pre-pulled by CI) in both
   pkg/k3d/cert-manager.yaml and pkg/testutil/testutil.go.

2. Pre-import cert-manager images into k3d containerd in waitForJobs()
   so the helm controller doesn't need to pull from the internet.

3. Auto-recreate stale clusters: if loadCluster fails (e.g. webhook
   never becomes ready after manifest upgrade), GetOrCreate deletes
   the unhealthy cluster and creates a fresh one. This handles the
   case where a CI agent has a k3d cluster from a previous build with
   an incompatible cert-manager version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@david-yu david-yu force-pushed the fix/decommission-test-cert-manager-flake branch from d618cbc to 553c34d Compare April 11, 2026 06:04
@david-yu
Copy link
Copy Markdown
Contributor Author

david-yu commented Apr 12, 2026

Integration test failures on this shard are Docker Hub rate limiting, not cert-manager

Build 12910 shard 3 shows 16 test failures, but they all stem from the same root cause — Docker Hub anonymous pull rate limit (429 Too Many Requests):

ImagePullBackOff - Back-off pulling image "docker.io/redpandadata/redpanda:v24.3.5":
429 Too Many Requests - You have reached your unauthenticated pull rate limit.

Affected tests (all from image pull failures, not code bugs):

  • TestIntegrationStatefulSetDecommissioner — cert-manager NO pods found (cert-manager images can't be pulled)
  • TestIntegrationFactoryOperatorV1 — redpanda:v24.3.5 ImagePullBackOff
  • TestIntegrationClientFactory (no_TLS, TLS, TLS+SCRAM-512) — helm install timeout
  • TestIntegrationChart (default, namespaced) — helm install context deadline exceeded
  • TestIntegrationChart/v25 (rbac, sidecar, set-datadir-ownership, admin_api_auth_required, console-integration) — helm install context deadline exceeded
  • TestIntegrationClientFactoryTLSListeners — helm install timeout

None of these are related to the cert-manager changes in this PR. The cert-manager diagnostics confirm it: NO pods found in cert-manager namespace — because the cert-manager images themselves couldn't be pulled from quay.io (the agent was rate-limited across all registries).

This is the exact problem DEVPROD-4069 (ECR pull-through cache) would solve — routing image pulls through a same-region AWS cache instead of hitting public registries directly under parallel CI load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant