fix: wait for cert-manager webhook in decommissioning integration test#1441
fix: wait for cert-manager webhook in decommissioning integration test#1441
Conversation
… test The TestIntegrationStatefulSetDecommissioner test intermittently fails with "no endpoints available for service cert-manager-webhook" because helm install runs before the cert-manager webhook pod has ready endpoints. Add testutil.WaitForCertManagerWebhook() in SetupSuite before any helm operations, matching the pattern used by vcluster, k3d, helmtest, and acceptance test setups. Also add webhook error retry to installChart, matching the existing retry in upgradeChart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
18e11fe to
e36b055
Compare
|
Filed DEVPROD-4069 to set up ECR pull-through cache rules for all upstream registries used by CI (quay.io, ghcr.io, registry.k8s.io, docker.io). This would eliminate the external registry dependency that causes these cert-manager flakes — same-region ECR pulls instead of cross-internet pulls to quay.io under CI load. |
9bd1b37 to
d618cbc
Compare
…e stale k3d clusters Root cause: CI agents persist k3d clusters across builds. When the cert-manager version in the startup manifest changes, loadCluster patches the HelmChart which triggers an in-place upgrade. This disrupts the running webhook during transition, causing "cert-manager webhook not ready" timeouts. Fixes: 1. Align cert-manager to v1.17.2 (pre-pulled by CI) in both pkg/k3d/cert-manager.yaml and pkg/testutil/testutil.go. 2. Pre-import cert-manager images into k3d containerd in waitForJobs() so the helm controller doesn't need to pull from the internet. 3. Auto-recreate stale clusters: if loadCluster fails (e.g. webhook never becomes ready after manifest upgrade), GetOrCreate deletes the unhealthy cluster and creates a fresh one. This handles the case where a CI agent has a k3d cluster from a previous build with an incompatible cert-manager version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d618cbc to
553c34d
Compare
Integration test failures on this shard are Docker Hub rate limiting, not cert-managerBuild 12910 shard 3 shows 16 test failures, but they all stem from the same root cause — Docker Hub anonymous pull rate limit (429 Too Many Requests): Affected tests (all from image pull failures, not code bugs):
None of these are related to the cert-manager changes in this PR. The cert-manager diagnostics confirm it: This is the exact problem DEVPROD-4069 (ECR pull-through cache) would solve — routing image pulls through a same-region AWS cache instead of hitting public registries directly under parallel CI load. |
Summary
Fix intermittent
TestIntegrationStatefulSetDecommissionerfailures caused by cert-manager webhook not being ready when helm install runs.Root cause
CI agents persist k3d clusters across builds. Three problems compound:
Version mismatch — The k3d startup manifest (
pkg/k3d/cert-manager.yaml) pinned cert-manager v1.16.1, but CI only pre-pulls images for v1.14.2 and v1.17.2. The k3s helm controller had to pull v1.16.1 from the internet at runtime, which is slow or fails under CI load.Images not in k3d containerd — CI's
test:pull-imagesdownloads images into the host Docker daemon, but k3d nodes use containerd (a separate image store). Without an explicitk3d image import, images aren't available inside the cluster.Stale clusters break on version upgrades — When
GetOrCreatefinds a pre-existing k3d cluster from a previous build,loadClusterpatches the HelmChart manifest (e.g., v1.16.1 → v1.17.2). This triggers an in-place upgrade by k3s's helm controller, which disrupts the running cert-manager webhook during the transition. The webhook readiness check then times out because the upgrade is still in progress.Changes
1. Auto-recreate stale clusters (
pkg/k3d/k3d.go)If
loadClusterfails on a pre-existing cluster (e.g., webhook never becomes ready after a manifest upgrade),GetOrCreatenow deletes the unhealthy cluster and recreates it from scratch. This guarantees a clean state regardless of what previous builds left behind on the CI agent.2. Pre-import cert-manager images into k3d (
pkg/k3d/k3d.go)waitForJobs()now callsc.importImages(certManagerImages()...)before applying startup manifests, importing cert-manager images from the host Docker daemon into k3d's containerd. Falls back gracefully to registry pull if images aren't in the host daemon (e.g., local dev).3. Align cert-manager versions to v1.17.2 (the newest CI pre-pulled version)
pkg/k3d/cert-manager.yamlpkg/testutil/testutil.go4. Readiness wait in decommissioning test (
statefulset_decommissioner_test.go)Add
testutil.WaitForCertManagerWebhook()inSetupSuitebefore any helm operations, matching the pattern used by vcluster, k3d, helmtest, and acceptance test setups.5. Install retry (
statefulset_decommissioner_test.go)Add webhook error retry to
installChart(), matching the existing retry pattern inupgradeChart().Parallelism
This does not change test parallelism. The
WaitForCertManagerWebhookcall runs once inSetupSuite. The image import runs once per cluster setup via marker files. The cluster auto-recreate only triggers whenloadClusterfails, which is a blocking error today anyway. All complete before any test method executes.Evidence
Progressive debugging across CI builds:
Test plan
TestIntegrationStatefulSetDecommissionerpasses on CI (build 12893, all shards)🤖 Generated with Claude Code