Skip to content

Fix 3 race conditions#711

Merged
wikkyk merged 3 commits into
mainfrom
claude/fix-flaky-test-race-ewxDQ
May 11, 2026
Merged

Fix 3 race conditions#711
wikkyk merged 3 commits into
mainfrom
claude/fix-flaky-test-race-ewxDQ

Conversation

@wikkyk
Copy link
Copy Markdown
Collaborator

@wikkyk wikkyk commented Apr 10, 2026

Issue #, if available:
#339

Description of changes:
See commit messages for details.
Each commit has an independent fix for a different race.
The missing credentials condition commit changes runtime, the other two fix test setup.

Testing performed:
looped make test for 2 weeks under varying load.

@wikkyk wikkyk marked this pull request as draft April 10, 2026 07:12
@wikkyk wikkyk changed the title Move context initialization to NewTestEnvironment constructor Fix occasional test panic(?) Apr 10, 2026
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from 850a595 to 00a1b4b Compare April 15, 2026 09:33
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from 00a1b4b to 0f1afdc Compare April 15, 2026 09:35
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from 0f1afdc to 969fd4d Compare April 15, 2026 09:41
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from 969fd4d to 831fc88 Compare April 15, 2026 09:55
@wikkyk wikkyk changed the title Fix occasional test panic(?) Fix 3 race conditions Apr 15, 2026
@wikkyk wikkyk marked this pull request as ready for review April 15, 2026 10:18
Copilot AI review requested due to automatic review settings April 15, 2026 10:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses intermittent test panics (Issue #339) by removing race-prone test setup patterns around envtest manager context initialization, and refines controller condition reporting for missing Proxmox credentials.

Changes:

  • Update envtest helper to require a parent context at construction time and use an internally-managed derived context for manager startup.
  • Adjust webhook/controller test suites to match the new envtest helper API and manager start behavior.
  • Introduce a new ProxmoxCluster condition reason for “credentials secret not found” and use it during reconciliation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/helpers/envtest.go Makes envtest lifecycle context owned by TestEnvironment (constructor-provided), changes StartManager to use stored context.
internal/webhook/webhook_suite_test.go Updates suite setup to pass managerCtx into NewTestEnvironment and calls StartManager() without args.
internal/controller/suite_test.go Updates suite setup to pass managerCtx into NewTestEnvironment and calls StartManager() without args.
internal/controller/proxmoxcluster_controller_test.go Uses k8sClient directly when refreshing cluster objects to avoid relying on the env manager client.
internal/controller/proxmoxcluster_controller.go Sets Proxmox availability condition reason to the new “CredentialsNotFound” when credential secret reconciliation errors.
api/v1alpha2/conditions_consts.go Adds ProxmoxClusterProxmoxAvailableCredentialsNotFoundReason constant.

Comment thread test/helpers/envtest.go
@@ -148,17 +148,17 @@ func NewTestEnvironment(setupWebhook bool, pmClient proxmox.Client) *TestEnviron
Port: whio.LocalServingPort,
},
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.WithCancel(ctx) will panic if a nil context is passed into NewTestEnvironment. Even though current call sites pass managerCtx, this helper is easy to reuse elsewhere; consider guarding against nil (e.g., default to context.Background() or fail fast with a clear panic/log message) before calling context.WithCancel.

Suggested change
}
}
if ctx == nil {
ctx = context.Background()
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is actually handled elsewhere
  2. it doesn't matter, it's just a test
  3. we want to panic here because it means setup is broken

Comment thread internal/controller/proxmoxcluster_controller.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread api/v1alpha2/conditions_consts.go Outdated
Comment thread pkg/scope/cluster.go
Comment thread internal/controller/proxmoxcluster_controller.go
@wikkyk
Copy link
Copy Markdown
Collaborator Author

wikkyk commented Apr 15, 2026

because I made many changes I'll loop it for a few more days

Copilot AI review requested due to automatic review settings April 16, 2026 11:23
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from f7dac33 to e04f567 Compare April 16, 2026 11:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread pkg/scope/cluster.go
Reason: infrav1.ProxmoxClusterProxmoxAvailableProxmoxUnreachableReason,
Reason: infrav1.ProxmoxClusterProxmoxAvailableCredentialsNotFoundReason,
Message: "credentials secret not found",
})
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupProxmoxClient sets a failure condition when the credentials Secret is missing, but when this error bubbles up during NewClusterScope creation the scope isn't closed/patch-applied, so this condition update likely never gets persisted to the API server. Consider moving the condition-setting to a place that is guaranteed to patch (e.g., controller reconcile), or ensure NewClusterScope calls Close()/PatchObject() before returning the error from setupProxmoxClient so users can reliably observe the CredentialsNotFound condition.

Suggested change
})
})
if patchErr := s.patchHelper.Patch(ctx, s.ProxmoxCluster); patchErr != nil {
return nil, errors.Wrapf(patchErr, "failed to patch ProxmoxCluster after setting credentials missing condition")
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mostly unrelated to this PR. I opened a bug report: #726

Comment thread api/v1alpha2/conditions_consts.go
Copilot AI review requested due to automatic review settings April 16, 2026 11:47
Comment thread internal/controller/proxmoxcluster_controller.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread api/v1alpha2/conditions_consts.go Outdated
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from a27f418 to c1fd49c Compare April 20, 2026 12:35
Copilot AI review requested due to automatic review settings April 20, 2026 12:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from c1fd49c to 267db36 Compare April 20, 2026 12:42
@sonarqubecloud
Copy link
Copy Markdown

Comment thread internal/controller/proxmoxcluster_controller_test.go
Comment thread internal/controller/suite_test.go
@wikkyk
Copy link
Copy Markdown
Collaborator Author

wikkyk commented May 5, 2026

@65278
Copy link
Copy Markdown
Collaborator

65278 commented May 7, 2026

Before githerp eats the raw files, here's a relevant output

2026-04-29T00:37:31.3939014Z   [FAILED] in [It] - /home/runner/work/cluster-api-provider-proxmox/cluster-api-provider-proxmox/internal/controller/proxmoxcluster_controller_test.go:649 @ 04/29/26 00:37:29.555
2026-04-29T00:37:31.3939096Z   << Timeline
2026-04-29T00:37:31.3939103Z 
2026-04-29T00:37:31.3939216Z   [FAILED] Timed out after 10.000s.
2026-04-29T00:37:31.3939296Z   Expected
2026-04-29T00:37:31.3939393Z       <bool>: false
2026-04-29T00:37:31.3939468Z   to be true
2026-04-29T00:37:31.3940133Z   In [It] at: /home/runner/work/cluster-api-provider-proxmox/cluster-api-provider-proxmox/internal/controller/proxmoxcluster_controller_test.go:649 @ 04/29/26 00:37:29.555
2026-04-29T00:37:31.3940216Z ------------------------------
2026-04-29T00:37:31.3940287Z S
2026-04-29T00:37:31.3940292Z 
2026-04-29T00:37:31.3940381Z Summarizing 1 Failure:
2026-04-29T00:37:31.3940749Z   [FAIL] External Credentials Tests [It] should remove ProxmoxCluster finalizer if the secret does not exist
2026-04-29T00:37:31.3941283Z   /home/runner/work/cluster-api-provider-proxmox/cluster-api-provider-proxmox/internal/controller/proxmoxcluster_controller_test.go:649

@wikkyk wikkyk removed the e2e/none skip all e2e tests (documentation etc) - overrides all e2e/* labels label May 7, 2026
wikkyk and others added 3 commits May 8, 2026 09:53
TestEnvironment.ctx was only set inside StartManager(), which runs in a
goroutine. When the goroutine hadn't executed before the first test spec
called GetContext(), the nil context reached logr.FromContext() causing a
panic. Move context initialization into NewTestEnvironment() so it is
available immediately at construction time.

• [PANICKED] [0.010 seconds]
External Credentials Tests Reconcile an ProxmoxCluster [It] create and destroy a cluster
.../internal/controller/proxmoxcluster_controller_test.go:251

[PANICKED] Test Panicked
In [It] at: .../panic.go:262 @ 04/08/26 21:04:45.553

runtime error: invalid memory address or nil pointer dereference

Full Stack Trace
github.com/go-logr/logr.FromContext({0x0?, 0x0?})
.../github.com/go-logr/logr@v1.4.3/context_slog.go:30 +0x1c
[...]

Co-authored-by: Claude <noreply@anthropic.com>
reconcileFailedClusterState clears ProxmoxUnreachable conditions when
ProxmoxClient is non-nil. But reconcileNormalCredentialsSecret reused the
same ProxmoxUnreachable reason for credentials-secret-not-found errors,
causing the condition to flip-flop between False and True every cycle.
Under load, tests polling via the cached client could never observe a
stable False state.

Introduce CredentialsNotFound reason for credentials secret errors so
reconcileFailedClusterState does not clear them.

• [FAILED] [10.010 seconds]
External Credentials Tests [It] should remove ProxmoxCluster finalizer if the secret does not exist
.../internal/controller/proxmoxcluster_controller_test.go:326

  Timeline >>
  2026-04-10T10:30:52Z	INFO	Waiting for Cluster Controller to set OwnerRef on ProxmoxCluster	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "a1bc750a-bacd-4094-81ab-0df4764b6c01"}
  2026-04-10T10:30:52Z	INFO	Waiting for Cluster Controller to set OwnerRef on ProxmoxCluster	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "173d45ce-9a84-400d-9676-08a5299e5a3a"}
  2026-04-10T10:30:52Z	INFO	metadata.finalizers: "cluster.cluster.x-k8s.io": prefer a domain-qualified finalizer name including a path (/) to avoid accidental conflicts with other finalizer writers
  2026-04-10T10:30:52Z	INFO	Waiting for Cluster Controller to set OwnerRef on ProxmoxCluster	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "d9b7f78b-0832-4f6d-b7b6-5105294cadf6"}
  2026-04-10T10:30:52Z	INFO	Reconciling ProxmoxCluster	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "20cabe76-2855-4ed7-9033-e97250c67a75", "cluster": {"name":"test-wj8xn","namespace":"default"}}
  2026-04-10T10:30:52Z	INFO	metadata.finalizers: "proxmoxcluster.infrastructure.cluster.x-k8s.io": prefer a domain-qualified finalizer name including a path (/) to avoid accidental conflicts with other finalizer writers
  2026-04-10T10:30:53Z	ERROR	Reconciler error	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "20cabe76-2855-4ed7-9033-e97250c67a75", "error": "secrets \"2y6mqh\" not found"}
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
  	.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:353
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
  	.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300
  sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1
  	.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202
  2026-04-10T10:30:53Z	INFO	Reconciling ProxmoxCluster	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "a39b1275-3bd0-4fab-97f7-2bb7d85ff43c", "cluster": {"name":"test-wj8xn","namespace":"default"}}
  2026-04-10T10:30:53Z	ERROR	Reconciler error	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "a39b1275-3bd0-4fab-97f7-2bb7d85ff43c", "error": "reconciling cluster failure state", "errorVerbose": "reconciling cluster failure state\ngithub.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller.(*ProxmoxClusterReconciler).reconcileFailedClusterState\n\t.../internal/controller/proxmoxcluster_controller.go:261\ngithub.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller.(*ProxmoxClusterReconciler).reconcileNormal\n\t.../internal/controller/proxmoxcluster_controller.go:212\ngithub.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller.(*ProxmoxClusterReconciler).Reconcile\n\t.../internal/controller/proxmoxcluster_controller.go:153\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202\nruntime.goexit\n\t/lib/go/src/runtime/asm_amd64.s:1771"}
[...]

Co-authored-by: Claude <noreply@anthropic.com>
refreshCluster reads a just-created ProxmoxCluster via the manager's
cached client. Under load, the informer watch event for the creation may
not have been delivered yet, causing a spurious NotFound. Switch to
k8sClient which reads directly from the API server.

• [FAILED] [0.132 seconds]
External Credentials Tests Reconcile an ProxmoxCluster [It] create and destroy a cluster
.../internal/controller/proxmoxcluster_controller_test.go:251

  Timeline >>
  2026-04-12T07:43:20Z	INFO	Starting EventSource	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "source": "kind source: *v1beta2.Cluster"}
  2026-04-12T07:43:20Z	INFO	Starting EventSource	{"controller": "proxmoxmachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxMachine", "source": "kind source: *v1beta2.Machine"}
  2026-04-12T07:43:20Z	INFO	Starting EventSource	{"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "source": "kind source: *v1alpha2.ProxmoxCluster"}
  2026-04-12T07:43:20Z	INFO	Starting EventSource	{"controller": "proxmoxmachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxMachine", "source": "kind source: *v1alpha2.ProxmoxMachine"}
  2026-04-12T07:43:20Z	INFO	metadata.finalizers: "cluster.cluster.x-k8s.io": prefer a domain-qualified finalizer name including a path (/) to avoid accidental conflicts with other finalizer writers
  [FAILED] in [It] - .../internal/controller/proxmoxcluster_controller_test.go:553 @ 04/12/26 07:43:20.168
  << Timeline

  [FAILED] Expected success, but got an error:
      <*errors.StatusError | 0x2e4d9abee460>:
      ProxmoxCluster.infrastructure.cluster.x-k8s.io "proxmox-test-h7dwn" not found
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "ProxmoxCluster.infrastructure.cluster.x-k8s.io \"proxmox-test-h7dwn\" not found",
              Reason: "NotFound",
              Details: {
                  Name: "proxmox-test-h7dwn",
                  Group: "infrastructure.cluster.x-k8s.io",
                  Kind: "ProxmoxCluster",
                  UID: "",
                  Causes: nil,
                  RetryAfterSeconds: 0,
              },
              Code: 404,
          },
      }
  In [It] at: .../internal/controller/proxmoxcluster_controller_test.go:553 @ 04/12/26 07:43:20.168

Co-authored-by: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 8, 2026 07:53
@wikkyk wikkyk force-pushed the claude/fix-flaky-test-race-ewxDQ branch from 267db36 to dbb56b5 Compare May 8, 2026 07:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@65278
Copy link
Copy Markdown
Collaborator

65278 commented May 8, 2026

Okay, I've performed the following testing:

I've deployed a cluster, stopped the controller, and edited its state to "False" and set the reason infrav1.ProxmoxClusterPromoxAvailableProxmoxUnreachableReason. After restart of the controller, the cluster reconciled.

In so far, I'll green light this. The rest is perfectly harmless.

Copy link
Copy Markdown
Collaborator

@65278 65278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two of three fixes are harmless, and the other one I tested as good as I can.

@wikkyk wikkyk merged commit 61acf12 into main May 11, 2026
15 of 16 checks passed
@wikkyk wikkyk deleted the claude/fix-flaky-test-race-ewxDQ branch May 11, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants