Fix 3 race conditions#711
Conversation
850a595 to
00a1b4b
Compare
00a1b4b to
0f1afdc
Compare
0f1afdc to
969fd4d
Compare
969fd4d to
831fc88
Compare
There was a problem hiding this comment.
Pull request overview
Addresses intermittent test panics (Issue #339) by removing race-prone test setup patterns around envtest manager context initialization, and refines controller condition reporting for missing Proxmox credentials.
Changes:
- Update envtest helper to require a parent context at construction time and use an internally-managed derived context for manager startup.
- Adjust webhook/controller test suites to match the new envtest helper API and manager start behavior.
- Introduce a new ProxmoxCluster condition reason for “credentials secret not found” and use it during reconciliation.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/helpers/envtest.go |
Makes envtest lifecycle context owned by TestEnvironment (constructor-provided), changes StartManager to use stored context. |
internal/webhook/webhook_suite_test.go |
Updates suite setup to pass managerCtx into NewTestEnvironment and calls StartManager() without args. |
internal/controller/suite_test.go |
Updates suite setup to pass managerCtx into NewTestEnvironment and calls StartManager() without args. |
internal/controller/proxmoxcluster_controller_test.go |
Uses k8sClient directly when refreshing cluster objects to avoid relying on the env manager client. |
internal/controller/proxmoxcluster_controller.go |
Sets Proxmox availability condition reason to the new “CredentialsNotFound” when credential secret reconciliation errors. |
api/v1alpha2/conditions_consts.go |
Adds ProxmoxClusterProxmoxAvailableCredentialsNotFoundReason constant. |
| @@ -148,17 +148,17 @@ func NewTestEnvironment(setupWebhook bool, pmClient proxmox.Client) *TestEnviron | |||
| Port: whio.LocalServingPort, | |||
| }, | |||
| } | |||
There was a problem hiding this comment.
context.WithCancel(ctx) will panic if a nil context is passed into NewTestEnvironment. Even though current call sites pass managerCtx, this helper is easy to reuse elsewhere; consider guarding against nil (e.g., default to context.Background() or fail fast with a clear panic/log message) before calling context.WithCancel.
| } | |
| } | |
| if ctx == nil { | |
| ctx = context.Background() | |
| } |
There was a problem hiding this comment.
- this is actually handled elsewhere
- it doesn't matter, it's just a test
- we want to panic here because it means setup is broken
|
because I made many changes I'll loop it for a few more days |
f7dac33 to
e04f567
Compare
| Reason: infrav1.ProxmoxClusterProxmoxAvailableProxmoxUnreachableReason, | ||
| Reason: infrav1.ProxmoxClusterProxmoxAvailableCredentialsNotFoundReason, | ||
| Message: "credentials secret not found", | ||
| }) |
There was a problem hiding this comment.
setupProxmoxClient sets a failure condition when the credentials Secret is missing, but when this error bubbles up during NewClusterScope creation the scope isn't closed/patch-applied, so this condition update likely never gets persisted to the API server. Consider moving the condition-setting to a place that is guaranteed to patch (e.g., controller reconcile), or ensure NewClusterScope calls Close()/PatchObject() before returning the error from setupProxmoxClient so users can reliably observe the CredentialsNotFound condition.
| }) | |
| }) | |
| if patchErr := s.patchHelper.Patch(ctx, s.ProxmoxCluster); patchErr != nil { | |
| return nil, errors.Wrapf(patchErr, "failed to patch ProxmoxCluster after setting credentials missing condition") | |
| } |
There was a problem hiding this comment.
this is mostly unrelated to this PR. I opened a bug report: #726
a27f418 to
c1fd49c
Compare
c1fd49c to
267db36
Compare
|
|
Before githerp eats the raw files, here's a relevant output |
TestEnvironment.ctx was only set inside StartManager(), which runs in a
goroutine. When the goroutine hadn't executed before the first test spec
called GetContext(), the nil context reached logr.FromContext() causing a
panic. Move context initialization into NewTestEnvironment() so it is
available immediately at construction time.
• [PANICKED] [0.010 seconds]
External Credentials Tests Reconcile an ProxmoxCluster [It] create and destroy a cluster
.../internal/controller/proxmoxcluster_controller_test.go:251
[PANICKED] Test Panicked
In [It] at: .../panic.go:262 @ 04/08/26 21:04:45.553
runtime error: invalid memory address or nil pointer dereference
Full Stack Trace
github.com/go-logr/logr.FromContext({0x0?, 0x0?})
.../github.com/go-logr/logr@v1.4.3/context_slog.go:30 +0x1c
[...]
Co-authored-by: Claude <noreply@anthropic.com>
reconcileFailedClusterState clears ProxmoxUnreachable conditions when
ProxmoxClient is non-nil. But reconcileNormalCredentialsSecret reused the
same ProxmoxUnreachable reason for credentials-secret-not-found errors,
causing the condition to flip-flop between False and True every cycle.
Under load, tests polling via the cached client could never observe a
stable False state.
Introduce CredentialsNotFound reason for credentials secret errors so
reconcileFailedClusterState does not clear them.
• [FAILED] [10.010 seconds]
External Credentials Tests [It] should remove ProxmoxCluster finalizer if the secret does not exist
.../internal/controller/proxmoxcluster_controller_test.go:326
Timeline >>
2026-04-10T10:30:52Z INFO Waiting for Cluster Controller to set OwnerRef on ProxmoxCluster {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "a1bc750a-bacd-4094-81ab-0df4764b6c01"}
2026-04-10T10:30:52Z INFO Waiting for Cluster Controller to set OwnerRef on ProxmoxCluster {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "173d45ce-9a84-400d-9676-08a5299e5a3a"}
2026-04-10T10:30:52Z INFO metadata.finalizers: "cluster.cluster.x-k8s.io": prefer a domain-qualified finalizer name including a path (/) to avoid accidental conflicts with other finalizer writers
2026-04-10T10:30:52Z INFO Waiting for Cluster Controller to set OwnerRef on ProxmoxCluster {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "d9b7f78b-0832-4f6d-b7b6-5105294cadf6"}
2026-04-10T10:30:52Z INFO Reconciling ProxmoxCluster {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "20cabe76-2855-4ed7-9033-e97250c67a75", "cluster": {"name":"test-wj8xn","namespace":"default"}}
2026-04-10T10:30:52Z INFO metadata.finalizers: "proxmoxcluster.infrastructure.cluster.x-k8s.io": prefer a domain-qualified finalizer name including a path (/) to avoid accidental conflicts with other finalizer writers
2026-04-10T10:30:53Z ERROR Reconciler error {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "20cabe76-2855-4ed7-9033-e97250c67a75", "error": "secrets \"2y6mqh\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:353
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1
.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202
2026-04-10T10:30:53Z INFO Reconciling ProxmoxCluster {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "a39b1275-3bd0-4fab-97f7-2bb7d85ff43c", "cluster": {"name":"test-wj8xn","namespace":"default"}}
2026-04-10T10:30:53Z ERROR Reconciler error {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "ProxmoxCluster": {"name":"proxmox-test-bzck7","namespace":"default"}, "namespace": "default", "name": "proxmox-test-bzck7", "reconcileID": "a39b1275-3bd0-4fab-97f7-2bb7d85ff43c", "error": "reconciling cluster failure state", "errorVerbose": "reconciling cluster failure state\ngithub.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller.(*ProxmoxClusterReconciler).reconcileFailedClusterState\n\t.../internal/controller/proxmoxcluster_controller.go:261\ngithub.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller.(*ProxmoxClusterReconciler).reconcileNormal\n\t.../internal/controller/proxmoxcluster_controller.go:212\ngithub.com/ionos-cloud/cluster-api-provider-proxmox/internal/controller.(*ProxmoxClusterReconciler).Reconcile\n\t.../internal/controller/proxmoxcluster_controller.go:153\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t.../mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202\nruntime.goexit\n\t/lib/go/src/runtime/asm_amd64.s:1771"}
[...]
Co-authored-by: Claude <noreply@anthropic.com>
refreshCluster reads a just-created ProxmoxCluster via the manager's
cached client. Under load, the informer watch event for the creation may
not have been delivered yet, causing a spurious NotFound. Switch to
k8sClient which reads directly from the API server.
• [FAILED] [0.132 seconds]
External Credentials Tests Reconcile an ProxmoxCluster [It] create and destroy a cluster
.../internal/controller/proxmoxcluster_controller_test.go:251
Timeline >>
2026-04-12T07:43:20Z INFO Starting EventSource {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "source": "kind source: *v1beta2.Cluster"}
2026-04-12T07:43:20Z INFO Starting EventSource {"controller": "proxmoxmachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxMachine", "source": "kind source: *v1beta2.Machine"}
2026-04-12T07:43:20Z INFO Starting EventSource {"controller": "proxmoxcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxCluster", "source": "kind source: *v1alpha2.ProxmoxCluster"}
2026-04-12T07:43:20Z INFO Starting EventSource {"controller": "proxmoxmachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "ProxmoxMachine", "source": "kind source: *v1alpha2.ProxmoxMachine"}
2026-04-12T07:43:20Z INFO metadata.finalizers: "cluster.cluster.x-k8s.io": prefer a domain-qualified finalizer name including a path (/) to avoid accidental conflicts with other finalizer writers
[FAILED] in [It] - .../internal/controller/proxmoxcluster_controller_test.go:553 @ 04/12/26 07:43:20.168
<< Timeline
[FAILED] Expected success, but got an error:
<*errors.StatusError | 0x2e4d9abee460>:
ProxmoxCluster.infrastructure.cluster.x-k8s.io "proxmox-test-h7dwn" not found
{
ErrStatus: {
TypeMeta: {Kind: "", APIVersion: ""},
ListMeta: {
SelfLink: "",
ResourceVersion: "",
Continue: "",
RemainingItemCount: nil,
},
Status: "Failure",
Message: "ProxmoxCluster.infrastructure.cluster.x-k8s.io \"proxmox-test-h7dwn\" not found",
Reason: "NotFound",
Details: {
Name: "proxmox-test-h7dwn",
Group: "infrastructure.cluster.x-k8s.io",
Kind: "ProxmoxCluster",
UID: "",
Causes: nil,
RetryAfterSeconds: 0,
},
Code: 404,
},
}
In [It] at: .../internal/controller/proxmoxcluster_controller_test.go:553 @ 04/12/26 07:43:20.168
Co-authored-by: Claude <noreply@anthropic.com>
267db36 to
dbb56b5
Compare
|
Okay, I've performed the following testing: I've deployed a cluster, stopped the controller, and edited its state to "False" and set the reason In so far, I'll green light this. The rest is perfectly harmless. |
65278
left a comment
There was a problem hiding this comment.
Two of three fixes are harmless, and the other one I tested as good as I can.



Issue #, if available:
#339
Description of changes:
See commit messages for details.
Each commit has an independent fix for a different race.
The missing credentials condition commit changes runtime, the other two fix test setup.
Testing performed:
looped
make testfor 2 weeks under varying load.