:seedling: test/e2e: cancel spec context on fast-fail to unblock waits by guettli · Pull Request #2116 · syself/cluster-api-provider-hetzner

guettli · 2026-06-23T08:48:12Z

Summary

The CI failure on #2113 CI Run was caused by a physical bare metal server failing to boot into rescue mode — completely unrelated to that PR's changes.

However, the failure revealed that the existing fail-fast mechanism in logStatusContinuously isn't actually fast. When it detects a PermanentError on a HetznerBareMetalHost, it calls Fail() from the goroutine. That records the Ginkgo failure and exits the goroutine, but the main It goroutine keeps blocking on WaitForControlPlaneToBeReady until the full 1200-second timeout — wasting ~10 minutes of CI time per occurrence.

Fix

Add a specCancelFn (protected by specCancelMu) in caph.go, updated per-spec via setSpecCancel/clearSpecCancel.
In CaphClusterDeploymentSpec's It block, create a cancellable child context (specCtx) and register its cancel. Pass specCtx to ApplyClusterTemplateAndWait.
In logStatusContinuously, call fn() (cancel the spec context) before Fail() so WaitForControlPlaneToBeReady aborts immediately.
AfterEach still uses the original ctx, so cleanup is unaffected.

Result

When hardware reboots fail (PermanentError after ~10 min), the test now aborts within ~30 seconds of detection instead of waiting the full 1200-second timeout.

Test plan

CI baremetal test passes on healthy hardware (context cancellation is a no-op in the happy path)
On hardware failure, test fails fast (~30s after PermanentError is set) instead of waiting 20 minutes

🤖 Generated with Claude Code

When logStatusContinuously detects a PermanentError or NoAvailableHost condition, it previously called Fail() from the goroutine. That records the Ginkgo failure and exits the goroutine, but the main It goroutine keeps blocking on WaitForControlPlaneToBeReady until the full 1200-second timeout. Fix: add a per-spec cancel function (specCancelFn, guarded by specCancelMu in caph.go) that the fail-fast goroutine can call alongside Fail(). Each It block creates a cancellable child context (specCtx), registers its cancel function via setSpecCancel, and passes specCtx to ApplyClusterTemplateAndWait. When the goroutine fires, specCtx is cancelled immediately and the wait aborts — saving up to 10+ minutes of CI time when hardware reboots fail. AfterEach still uses the original ctx so cleanup is unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> # Committing as: thomas.guettler@syself.com

No need for an intermediate fn variable — specCancelFn is a context.CancelFunc that does not touch specCancelMu, so calling it directly under the lock is safe and simpler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> # Committing as: thomas.guettler@syself.com

github-actions Bot added size/S Denotes a PR that changes 20-50 lines, ignoring generated files. area/test Changes made in the test directory labels Jun 23, 2026

guettli requested a review from Dhairya-Arora01 June 23, 2026 08:55

guettli and others added 2 commits June 23, 2026 12:03

Merge branch 'main' into fix/bm-e2e-fail-fast-context-cancel

3a58fbb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🌱 test/e2e: cancel spec context on fast-fail to unblock waits#2116

🌱 test/e2e: cancel spec context on fast-fail to unblock waits#2116
guettli wants to merge 3 commits into
mainfrom
fix/bm-e2e-fail-fast-context-cancel

guettli commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

guettli commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Result

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guettli commented Jun 23, 2026 •

edited

Loading