Skip to content

e2e/operator: Fix GCP single-region and multi-region test infrastructure#635

Open
shreyaskm623 wants to merge 1 commit into
masterfrom
shreyaskm/fix-existing-gcp-test
Open

e2e/operator: Fix GCP single-region and multi-region test infrastructure#635
shreyaskm623 wants to merge 1 commit into
masterfrom
shreyaskm/fix-existing-gcp-test

Conversation

@shreyaskm623
Copy link
Copy Markdown
Contributor

@shreyaskm623 shreyaskm623 commented May 17, 2026

Fixes the existing e2e tests running in GCP for the single and multi region

@shreyaskm623 shreyaskm623 self-assigned this May 17, 2026
@shreyaskm623 shreyaskm623 force-pushed the shreyaskm/fix-existing-gcp-test branch from 75edd41 to dfef36d Compare May 18, 2026 06:23
@shreyaskm623
Copy link
Copy Markdown
Contributor Author

Attaching the Screen shot below for the runs of the single and multi region tests which are passing. I have tested them on the cockroach-shreyaskm project, I'm sure it should work even in the helm-testing project ( where we run the actual weekly test suite ) as these are all project agnostic ( @NishanthNalluri to confirm ).

Single region :

Screenshot 2026-05-18 at 11 56 21 AM

Multi region :

Screenshot 2026-05-18 at 11 58 08 AM

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the operator e2e test harness to stabilize GCP single-region and multi-region infrastructure, primarily around DNS/cluster-domain handling and readiness behavior.

Changes:

  • Make cluster/pod readiness polling resilient to transient Kubernetes API errors by retrying instead of failing immediately.
  • Normalize cluster-domain handling (defaulting single-region to cluster.local) and add GCP-specific join-service patching required for pod discovery.
  • Extend GCP infra setup to support multi-region custom domains (kube-dns stubDomains + node-local-dns restart), allow overriding the node service account, and tweak initial node counts.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/testutil/require.go Treats certain Kubernetes API errors as transient during readiness polling to reduce test flakiness.
tests/e2e/operator/region.go Defaults single-region domain to cluster.local, adjusts region config generation, patches join service on GCP, and ensures CA artifacts are cleaned before regeneration.
tests/e2e/operator/infra/gcp.go Adds kube-dns/node-local-dns patching for custom domains in multi-region GCP tests, supports node service account override, and adjusts regional cluster autoscaling parameters.
tests/e2e/operator/infra/common.go Updates kubeconfig setup to modify the cluster TLS verification behavior before renaming contexts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +305 to +309
// Step 2: Skip TLS verification to avoid x509 errors from GKE's internal CA.
longContextName := fmt.Sprintf("gke_%s_%s_%s", projectID, region, clusterName)
skipTLSCmd := exec.Command("kubectl", "config", "set-cluster", longContextName,
"--insecure-skip-tls-verify=true")
output, err = skipTLSCmd.CombinedOutput()
// Step 2: Rename context
// Step 2: Skip TLS verification to avoid x509 errors from GKE's internal CA.
longContextName := fmt.Sprintf("gke_%s_%s_%s", projectID, region, clusterName)
skipTLSCmd := exec.Command("kubectl", "config", "set-cluster", longContextName,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we skipping tls verification only while running the tests locally on a GKE cluster or is this the default pattern?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant is this change only for local or are we skipping tls everywhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip TLS everywhere for now. In case this requires a holistic fix via operator we can take that up later I think.


if r.IsMultiRegion {
for _, clusterName := range r.Clusters {
err = patchKubeDNSForCustomDomains(t, clusterName, kubeConfigPath)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to patch kubedns with custom cluster domains? Is this mandatory for Multi region tests to work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, For the multi region tests we use cluster1.local and cluster2.local which needs to be resolved.

Are you suggesting an alternate way here ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for single region tests we use cluster1.local and we don't patch kubedns for those tests right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly why the current tests are failing. Going forward, we'll use cluster.local for the single region

You can refer to the screen shot below, after that particular log, the tests will start failing. Even when I was testing, I faced the similar issue.

Screenshot 2026-05-19 at 3 22 39 PM

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check this once because long back the tests are passing with custom domains.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/cockroachdb/helm-charts/actions/runs/21127552978 - this is in Jan and we dont have changes to our e2e tests during this time.

Comment thread tests/e2e/operator/region.go Outdated
if r.Provider == "gcp" {
k8s.WaitUntilServiceAvailable(t, kubectlOptions, "cockroachdb-join", 30, 5*time.Second)
err := k8s.RunKubectlE(t, kubectlOptions, "patch", "service", "cockroachdb-join",
"--type=merge", "-p", `{"spec":{"publishNotReadyAddresses":true}}`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not make this change for join service, the behaviour should be the same irrespective of provider.

If we install a CRDB cluster manually on a GKE cluster, it works as expected without any modifications right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously faced the issue with cluster join, that's why had to add this. I tried again without this and works fine now. I have addressed this comment.


// CreateCACertificate creates CA cert and key at the same path.
func (r *Region) CreateCACertificate(t *testing.T) error {
r.CleanUpCACertificate(t)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesnt this create a new CA cert for every test? or Did we refactor to use only one CA for all the tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each test still creates its own CA cert — that behavior hasn't changed. CreateCACertificate is called once at the start of each test function (across both single-region and multi-region tests).

The CleanUpCACertificate call added at line 377 is purely defensive: it deletes any leftover ca.crt/ca.key files before generating a fresh CA. The problem it fixes is that if a previous test crashed mid-run and left stale cert files on disk, the subsequent cockroach cert create-ca command would fail because the files already exist. So it's a "clean slate before creation" guard, not a refactor of the CA lifecycle.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, want to confirm this.

Comment thread tests/testutil/require.go
t.Logf("error fetching stateful set")
return false, err
t.Logf("transient error fetching stateful set, will retry: %v", err)
return false, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we make changes to this method? This is not being used by any operator tests right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a fix made just for the consistency, Nothing to do with tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but i want to understand why this change is made now? Did we face any issues with timeout while running tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an Intermittent issue. It's anyway a very minimal change. I will go ahead and remove it if it's out of scope for this.

@NishanthNalluri
Copy link
Copy Markdown
Contributor

Do we know the exact reason why the existing gcp runs are failing?

@shreyaskm623 shreyaskm623 force-pushed the shreyaskm/fix-existing-gcp-test branch from 715c391 to 6acba72 Compare May 20, 2026 05:55
@shreyaskm623
Copy link
Copy Markdown
Contributor Author

@NishanthNalluri For the current fixes in the PR, the build for GCP is passing : https://github.com/cockroachdb/helm-charts/actions/runs/26157873855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants