e2e/operator: Fix GCP single-region and multi-region test infrastructure by shreyaskm623 · Pull Request #635 · cockroachdb/helm-charts

shreyaskm623 · 2026-05-17T12:37:35Z

Fixes the existing e2e tests running in GCP for the single and multi region

shreyaskm623 · 2026-05-18T06:29:34Z

Attaching the Screen shot below for the runs of the single and multi region tests which are passing. I have tested them on the cockroach-shreyaskm project, I'm sure it should work even in the helm-testing project ( where we run the actual weekly test suite ) as these are all project agnostic ( @NishanthNalluri to confirm ).

Single region :

Multi region :

Copilot

Pull request overview

This PR updates the operator e2e test harness to stabilize GCP single-region and multi-region infrastructure, primarily around DNS/cluster-domain handling and readiness behavior.

Changes:

Make cluster/pod readiness polling resilient to transient Kubernetes API errors by retrying instead of failing immediately.
Normalize cluster-domain handling (defaulting single-region to cluster.local) and add GCP-specific join-service patching required for pod discovery.
Extend GCP infra setup to support multi-region custom domains (kube-dns stubDomains + node-local-dns restart), allow overriding the node service account, and tweak initial node counts.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
tests/testutil/require.go	Treats certain Kubernetes API errors as transient during readiness polling to reduce test flakiness.
tests/e2e/operator/region.go	Defaults single-region domain to `cluster.local`, adjusts region config generation, patches join service on GCP, and ensures CA artifacts are cleaned before regeneration.
tests/e2e/operator/infra/gcp.go	Adds kube-dns/node-local-dns patching for custom domains in multi-region GCP tests, supports node service account override, and adjusts regional cluster autoscaling parameters.
tests/e2e/operator/infra/common.go	Updates kubeconfig setup to modify the cluster TLS verification behavior before renaming contexts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	// Step 2: Skip TLS verification to avoid x509 errors from GKE's internal CA.
 	longContextName := fmt.Sprintf("gke_%s_%s_%s", projectID, region, clusterName)
+	skipTLSCmd := exec.Command("kubectl", "config", "set-cluster", longContextName,
+		"--insecure-skip-tls-verify=true")
+	output, err = skipTLSCmd.CombinedOutput()


NishanthNalluri · 2026-05-19T05:18:11Z

-	// Step 2: Rename context
+	// Step 2: Skip TLS verification to avoid x509 errors from GKE's internal CA.
 	longContextName := fmt.Sprintf("gke_%s_%s_%s", projectID, region, clusterName)
+	skipTLSCmd := exec.Command("kubectl", "config", "set-cluster", longContextName,


Are we skipping tls verification only while running the tests locally on a GKE cluster or is this the default pattern?

This is only for tests.

No, I meant is this change only for local or are we skipping tls everywhere?

We can skip TLS everywhere for now. In case this requires a holistic fix via operator we can take that up later I think.

NishanthNalluri · 2026-05-19T05:19:31Z

+
+	if r.IsMultiRegion {
+		for _, clusterName := range r.Clusters {
+			err = patchKubeDNSForCustomDomains(t, clusterName, kubeConfigPath)


Why do we need to patch kubedns with custom cluster domains? Is this mandatory for Multi region tests to work?

Yes, For the multi region tests we use cluster1.local and cluster2.local which needs to be resolved.

Are you suggesting an alternate way here ?

Even for single region tests we use cluster1.local and we don't patch kubedns for those tests right?

That's exactly why the current tests are failing. Going forward, we'll use cluster.local for the single region

You can refer to the screen shot below, after that particular log, the tests will start failing. Even when I was testing, I faced the similar issue.

I will check this once because long back the tests are passing with custom domains.

https://github.com/cockroachdb/helm-charts/actions/runs/21127552978 - this is in Jan and we dont have changes to our e2e tests during this time.

NishanthNalluri · 2026-05-19T05:26:36Z

+	if r.Provider == "gcp" {
+		k8s.WaitUntilServiceAvailable(t, kubectlOptions, "cockroachdb-join", 30, 5*time.Second)
+		err := k8s.RunKubectlE(t, kubectlOptions, "patch", "service", "cockroachdb-join",
+			"--type=merge", "-p", `{"spec":{"publishNotReadyAddresses":true}}`)


We should not make this change for join service, the behaviour should be the same irrespective of provider.

If we install a CRDB cluster manually on a GKE cluster, it works as expected without any modifications right?

I previously faced the issue with cluster join, that's why had to add this. I tried again without this and works fine now. I have addressed this comment.

NishanthNalluri · 2026-05-19T05:28:15Z


 // CreateCACertificate creates CA cert and key at the same path.
 func (r *Region) CreateCACertificate(t *testing.T) error {
+	r.CleanUpCACertificate(t)


Doesnt this create a new CA cert for every test? or Did we refactor to use only one CA for all the tests?

Each test still creates its own CA cert — that behavior hasn't changed. CreateCACertificate is called once at the start of each test function (across both single-region and multi-region tests).

The CleanUpCACertificate call added at line 377 is purely defensive: it deletes any leftover ca.crt/ca.key files before generating a fresh CA. The problem it fixes is that if a previous test crashed mid-run and left stale cert files on disk, the subsequent cockroach cert create-ca command would fail because the files already exist. So it's a "clean slate before creation" guard, not a refactor of the CA lifecycle.

Thanks, want to confirm this.

NishanthNalluri · 2026-05-19T05:32:36Z

-				t.Logf("error fetching stateful set")
-				return false, err
+				t.Logf("transient error fetching stateful set, will retry: %v", err)
+				return false, nil


why did we make changes to this method? This is not being used by any operator tests right?

It's a fix made just for the consistency, Nothing to do with tests.

Yeah, but i want to understand why this change is made now? Did we face any issues with timeout while running tests?

There was an Intermittent issue. It's anyway a very minimal change. I will go ahead and remove it if it's out of scope for this.

NishanthNalluri · 2026-05-19T05:36:02Z

Do we know the exact reason why the existing gcp runs are failing?

shreyaskm623 · 2026-05-20T11:54:54Z

@NishanthNalluri For the current fixes in the PR, the build for GCP is passing : https://github.com/cockroachdb/helm-charts/actions/runs/26157873855

shreyaskm623 self-assigned this May 17, 2026

shreyaskm623 force-pushed the shreyaskm/fix-existing-gcp-test branch from 75edd41 to dfef36d Compare May 18, 2026 06:23

shreyaskm623 requested review from NishanthNalluri, nameisbhaskar and pritesh-lahoti May 18, 2026 06:30

shreyaskm623 marked this pull request as ready for review May 18, 2026 09:50

shreyaskm623 requested a review from Copilot May 19, 2026 05:04

Copilot started reviewing on behalf of shreyaskm623 May 19, 2026 05:05 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

NishanthNalluri reviewed May 19, 2026

View reviewed changes

e2e/operator: Fix GCP single-region and multi-region test infrastructure

6acba72

shreyaskm623 force-pushed the shreyaskm/fix-existing-gcp-test branch from 715c391 to 6acba72 Compare May 20, 2026 05:55

Conversation

shreyaskm623 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shreyaskm623 commented May 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NishanthNalluri commented May 19, 2026

Uh oh!

shreyaskm623 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shreyaskm623 commented May 17, 2026 •

edited

Loading