e2e/operator: Fix GCP single-region and multi-region test infrastructure#635
e2e/operator: Fix GCP single-region and multi-region test infrastructure#635shreyaskm623 wants to merge 1 commit into
Conversation
75edd41 to
dfef36d
Compare
|
Attaching the Screen shot below for the runs of the single and multi region tests which are passing. I have tested them on the Single region :
|
There was a problem hiding this comment.
Pull request overview
This PR updates the operator e2e test harness to stabilize GCP single-region and multi-region infrastructure, primarily around DNS/cluster-domain handling and readiness behavior.
Changes:
- Make cluster/pod readiness polling resilient to transient Kubernetes API errors by retrying instead of failing immediately.
- Normalize cluster-domain handling (defaulting single-region to
cluster.local) and add GCP-specific join-service patching required for pod discovery. - Extend GCP infra setup to support multi-region custom domains (kube-dns stubDomains + node-local-dns restart), allow overriding the node service account, and tweak initial node counts.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/testutil/require.go | Treats certain Kubernetes API errors as transient during readiness polling to reduce test flakiness. |
| tests/e2e/operator/region.go | Defaults single-region domain to cluster.local, adjusts region config generation, patches join service on GCP, and ensures CA artifacts are cleaned before regeneration. |
| tests/e2e/operator/infra/gcp.go | Adds kube-dns/node-local-dns patching for custom domains in multi-region GCP tests, supports node service account override, and adjusts regional cluster autoscaling parameters. |
| tests/e2e/operator/infra/common.go | Updates kubeconfig setup to modify the cluster TLS verification behavior before renaming contexts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Step 2: Skip TLS verification to avoid x509 errors from GKE's internal CA. | ||
| longContextName := fmt.Sprintf("gke_%s_%s_%s", projectID, region, clusterName) | ||
| skipTLSCmd := exec.Command("kubectl", "config", "set-cluster", longContextName, | ||
| "--insecure-skip-tls-verify=true") | ||
| output, err = skipTLSCmd.CombinedOutput() |
| // Step 2: Rename context | ||
| // Step 2: Skip TLS verification to avoid x509 errors from GKE's internal CA. | ||
| longContextName := fmt.Sprintf("gke_%s_%s_%s", projectID, region, clusterName) | ||
| skipTLSCmd := exec.Command("kubectl", "config", "set-cluster", longContextName, |
There was a problem hiding this comment.
Are we skipping tls verification only while running the tests locally on a GKE cluster or is this the default pattern?
There was a problem hiding this comment.
This is only for tests.
There was a problem hiding this comment.
No, I meant is this change only for local or are we skipping tls everywhere?
There was a problem hiding this comment.
We can skip TLS everywhere for now. In case this requires a holistic fix via operator we can take that up later I think.
|
|
||
| if r.IsMultiRegion { | ||
| for _, clusterName := range r.Clusters { | ||
| err = patchKubeDNSForCustomDomains(t, clusterName, kubeConfigPath) |
There was a problem hiding this comment.
Why do we need to patch kubedns with custom cluster domains? Is this mandatory for Multi region tests to work?
There was a problem hiding this comment.
Yes, For the multi region tests we use cluster1.local and cluster2.local which needs to be resolved.
Are you suggesting an alternate way here ?
There was a problem hiding this comment.
Even for single region tests we use cluster1.local and we don't patch kubedns for those tests right?
There was a problem hiding this comment.
There was a problem hiding this comment.
I will check this once because long back the tests are passing with custom domains.
There was a problem hiding this comment.
https://github.com/cockroachdb/helm-charts/actions/runs/21127552978 - this is in Jan and we dont have changes to our e2e tests during this time.
| if r.Provider == "gcp" { | ||
| k8s.WaitUntilServiceAvailable(t, kubectlOptions, "cockroachdb-join", 30, 5*time.Second) | ||
| err := k8s.RunKubectlE(t, kubectlOptions, "patch", "service", "cockroachdb-join", | ||
| "--type=merge", "-p", `{"spec":{"publishNotReadyAddresses":true}}`) |
There was a problem hiding this comment.
We should not make this change for join service, the behaviour should be the same irrespective of provider.
If we install a CRDB cluster manually on a GKE cluster, it works as expected without any modifications right?
There was a problem hiding this comment.
I previously faced the issue with cluster join, that's why had to add this. I tried again without this and works fine now. I have addressed this comment.
|
|
||
| // CreateCACertificate creates CA cert and key at the same path. | ||
| func (r *Region) CreateCACertificate(t *testing.T) error { | ||
| r.CleanUpCACertificate(t) |
There was a problem hiding this comment.
Doesnt this create a new CA cert for every test? or Did we refactor to use only one CA for all the tests?
There was a problem hiding this comment.
Each test still creates its own CA cert — that behavior hasn't changed. CreateCACertificate is called once at the start of each test function (across both single-region and multi-region tests).
The CleanUpCACertificate call added at line 377 is purely defensive: it deletes any leftover ca.crt/ca.key files before generating a fresh CA. The problem it fixes is that if a previous test crashed mid-run and left stale cert files on disk, the subsequent cockroach cert create-ca command would fail because the files already exist. So it's a "clean slate before creation" guard, not a refactor of the CA lifecycle.
There was a problem hiding this comment.
Thanks, want to confirm this.
| t.Logf("error fetching stateful set") | ||
| return false, err | ||
| t.Logf("transient error fetching stateful set, will retry: %v", err) | ||
| return false, nil |
There was a problem hiding this comment.
why did we make changes to this method? This is not being used by any operator tests right?
There was a problem hiding this comment.
It's a fix made just for the consistency, Nothing to do with tests.
There was a problem hiding this comment.
Yeah, but i want to understand why this change is made now? Did we face any issues with timeout while running tests?
There was a problem hiding this comment.
There was an Intermittent issue. It's anyway a very minimal change. I will go ahead and remove it if it's out of scope for this.
|
Do we know the exact reason why the existing gcp runs are failing? |
715c391 to
6acba72
Compare
|
@NishanthNalluri For the current fixes in the PR, the build for GCP is passing : https://github.com/cockroachdb/helm-charts/actions/runs/26157873855 |



Fixes the existing e2e tests running in GCP for the single and multi region