Skip to content

🌱 test/e2e: cancel spec context on fast-fail to unblock waits#2116

Open
guettli wants to merge 3 commits into
mainfrom
fix/bm-e2e-fail-fast-context-cancel
Open

🌱 test/e2e: cancel spec context on fast-fail to unblock waits#2116
guettli wants to merge 3 commits into
mainfrom
fix/bm-e2e-fail-fast-context-cancel

Conversation

@guettli

@guettli guettli commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

The CI failure on #2113 CI Run was caused by a physical bare metal server failing to boot into rescue mode — completely unrelated to that PR's changes.

However, the failure revealed that the existing fail-fast mechanism in logStatusContinuously isn't actually fast. When it detects a PermanentError on a HetznerBareMetalHost, it calls Fail() from the goroutine. That records the Ginkgo failure and exits the goroutine, but the main It goroutine keeps blocking on WaitForControlPlaneToBeReady until the full 1200-second timeout — wasting ~10 minutes of CI time per occurrence.

Fix

  • Add a specCancelFn (protected by specCancelMu) in caph.go, updated per-spec via setSpecCancel/clearSpecCancel.
  • In CaphClusterDeploymentSpec's It block, create a cancellable child context (specCtx) and register its cancel. Pass specCtx to ApplyClusterTemplateAndWait.
  • In logStatusContinuously, call fn() (cancel the spec context) before Fail() so WaitForControlPlaneToBeReady aborts immediately.
  • AfterEach still uses the original ctx, so cleanup is unaffected.

Result

When hardware reboots fail (PermanentError after ~10 min), the test now aborts within ~30 seconds of detection instead of waiting the full 1200-second timeout.

Test plan

  • CI baremetal test passes on healthy hardware (context cancellation is a no-op in the happy path)
  • On hardware failure, test fails fast (~30s after PermanentError is set) instead of waiting 20 minutes

🤖 Generated with Claude Code

When logStatusContinuously detects a PermanentError or NoAvailableHost
condition, it previously called Fail() from the goroutine. That records the
Ginkgo failure and exits the goroutine, but the main It goroutine keeps
blocking on WaitForControlPlaneToBeReady until the full 1200-second timeout.

Fix: add a per-spec cancel function (specCancelFn, guarded by specCancelMu in
caph.go) that the fail-fast goroutine can call alongside Fail(). Each It block
creates a cancellable child context (specCtx), registers its cancel function
via setSpecCancel, and passes specCtx to ApplyClusterTemplateAndWait. When the
goroutine fires, specCtx is cancelled immediately and the wait aborts — saving
up to 10+ minutes of CI time when hardware reboots fail.

AfterEach still uses the original ctx so cleanup is unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Committing as: thomas.guettler@syself.com
@github-actions github-actions Bot added size/S Denotes a PR that changes 20-50 lines, ignoring generated files. area/test Changes made in the test directory labels Jun 23, 2026
@guettli guettli requested a review from Dhairya-Arora01 June 23, 2026 08:55
guettli and others added 2 commits June 23, 2026 12:03
No need for an intermediate fn variable — specCancelFn is a
context.CancelFunc that does not touch specCancelMu, so calling it
directly under the lock is safe and simpler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Committing as: thomas.guettler@syself.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test Changes made in the test directory size/S Denotes a PR that changes 20-50 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant