Skip to content

test: make WaitUntilNodeReady robust to watch disconnections#7827

Closed
r2k1 wants to merge 1 commit intomainfrom
r2k1/fix-wait-until-node-ready
Closed

test: make WaitUntilNodeReady robust to watch disconnections#7827
r2k1 wants to merge 1 commit intomainfrom
r2k1/fix-wait-until-node-ready

Conversation

@r2k1
Copy link
Copy Markdown
Contributor

@r2k1 r2k1 commented Feb 7, 2026

Summary

  • Replace single-shot Kubernetes Watch in WaitUntilNodeReady with a watch retry loop that reconnects on disconnection or error instead of silently exiting
  • Use immediately-invoked func literal so watcher.Stop() is always called via defer
  • Fail fast on node deletion events
  • Adopt toolkit.LogStepf for timing (consistent with codebase pattern)
  • Better failure messages: dump full node JSON on timeout
  • Simplify timing in prepareAKSNodeWaitUntilNodeReady already logs its own duration via LogStepf

Known limitation

Watch-based approach may miss nodes that are already Ready before the watch starts (no initial List check). In practice this is fine for e2e tests since watch is started before VMSS provisioning completes, but a future improvement could add an initial List before entering the watch loop.

Test plan

  • go build ./... passes
  • e2e tests pass with node provisioning

This comment was marked as resolved.

@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 0c0cda5 to 667faf9 Compare February 7, 2026 09:39
Copilot AI review requested due to automatic review settings February 7, 2026 09:45
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 667faf9 to d62b821 Compare February 7, 2026 09:45
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from d62b821 to cf7a371 Compare February 7, 2026 09:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread e2e/kube.go Outdated
Comment on lines 150 to 165
nodes, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
logf(ctx, "failed to list nodes: %v", err)
} else {
foundNode = nil
for _, node := range nodes.Items {
if strings.HasPrefix(node.Name, vmssName) {
foundNode = &node
if isNodeReady(&node) {
return node.Name, nil
}
break
}
}
}

Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Nodes().List keeps failing (e.g., auth/RBAC issue), this loop will only log the error and then eventually return either "node with prefix ... not found" or "node ... not ready" on context timeout, which can hide the real root cause. Consider tracking the last List error and returning it (or returning immediately for non-retryable errors) when ctx.Done() fires so the final failure message reflects the API failure.

Copilot uses AI. Check for mistakes.
Comment thread e2e/kube.go Outdated
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cf7a371 to cab645d Compare February 7, 2026 09:57
@r2k1 r2k1 changed the title fix(e2e): make WaitUntilNodeReady robust to watch disconnections test: make WaitUntilNodeReady robust to watch disconnections Feb 7, 2026
Copilot AI review requested due to automatic review settings February 7, 2026 22:56
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cab645d to 3b80171 Compare February 7, 2026 22:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread e2e/kube.go
Comment on lines +155 to +166
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
if err != nil {
t.Logf("failed to start node watch: %v, retrying in 5s", err)
select {
case <-ctx.Done():
case <-time.After(5 * time.Second):
}
return ""
}
defer watcher.Stop()

for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
t.Logf("node %s is ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
return node.Name
for event := range watcher.ResultChan() {
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitUntilNodeReady starts a watch and then only reacts to future events. If the node is already present (or already Ready) when the watch begins, there may be no subsequent events and this can block until the context times out. Consider doing an initial Nodes().List/Nodes().Get check (or switching fully to PollUntilContextTimeout + List) to evaluate current state before watching/polling again.

Copilot uses AI. Check for mistakes.
Comment thread e2e/kube.go
Comment on lines 145 to +156
func (k *Kubeclient) WaitUntilNodeReady(ctx context.Context, t testing.TB, vmssName string) string {
startTime := time.Now()
t.Logf("waiting for node %s to be ready in k8s API", vmssName)
t.Logf("waiting for node %s to be ready", vmssName)
defer func() {
t.Logf("waited for node %s to be ready in k8s API for %s", vmssName, time.Since(startTime))
t.Logf("waited for node %s to be ready for %s", vmssName, time.Since(startTime))
}()

var node *corev1.Node = nil
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
require.NoError(t, err, "failed to start watching nodes")
defer watcher.Stop()

for event := range watcher.ResultChan() {
if event.Type != watch.Added && event.Type != watch.Modified {
continue
}

var nodeFromEvent *corev1.Node
switch v := event.Object.(type) {
case *corev1.Node:
nodeFromEvent = v

default:
t.Logf("skipping object type %T", event.Object)
continue
}

if !strings.HasPrefix(nodeFromEvent.Name, vmssName) {
continue
}

// found the right node. Use it!
node = nodeFromEvent
nodeTaints, _ := json.Marshal(node.Spec.Taints)
nodeConditions, _ := json.Marshal(node.Status.Conditions)
var lastNode *corev1.Node
for ctx.Err() == nil {
name := func() string {
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
if err != nil {
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this was switched to ticker-based polling using List and that the function returns (string, error) instead of calling t.Fatalf, but the implementation still uses Watch and still calls t.Fatalf on failure/deleted events. Either update the implementation to match (List+polling, return error) or adjust the PR description to reflect the actual behavior.

Copilot uses AI. Check for mistakes.
Comment thread e2e/kube.go Outdated
}

t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
t.Logf("node watch disconnected, restarting")
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs "node watch disconnected, restarting" even when the context is done (e.g., timeout/cancel), because the log happens after the watch loop returns but before the next loop condition check. Consider guarding this log with a ctx.Err() == nil check to avoid misleading output on expected shutdown.

Suggested change
t.Logf("node watch disconnected, restarting")
if ctx.Err() == nil {
t.Logf("node watch disconnected, restarting")
}

Copilot uses AI. Check for mistakes.
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 3b80171 to 1ee4c6b Compare February 7, 2026 23:39
Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail)
into WaitUntilNodeReady. Add watch retry logic for disconnections and
context cancellation. Add timing log for bastion creation.
Copilot AI review requested due to automatic review settings April 23, 2026 23:31
@r2k1 r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 1ee4c6b to 8ab5fe0 Compare April 23, 2026 23:31
@r2k1 r2k1 closed this Apr 23, 2026
@r2k1 r2k1 deleted the r2k1/fix-wait-until-node-ready branch April 23, 2026 23:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread e2e/kube.go
Comment on lines +156 to +157
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
if err != nil {
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitUntilNodeReady now relies solely on a watch stream. If the node is already Ready before the watch starts (or if it becomes Ready without subsequent status updates), this can block until the context deadline even though the node is ready. Consider doing an initial Nodes().List to detect an already-ready node, then starting the watch from the returned resourceVersion to avoid missing events between list and watch.

Suggested change
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
if err != nil {
nodeList, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
t.Logf("failed to list nodes: %v, retrying in 5s", err)
select {
case <-ctx.Done():
case <-time.After(5 * time.Second):
}
return ""
}
for i := range nodeList.Items {
node := &nodeList.Items[i]
if !strings.HasPrefix(node.Name, vmssName) {
continue
}
lastNode = node
for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
t.Logf("node %s is ready", node.Name)
return node.Name
}
}
}
watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{ResourceVersion: nodeList.ResourceVersion})
if err != nil {

Copilot uses AI. Check for mistakes.
Comment thread e2e/kube.go
Comment on lines +167 to +188
for event := range watcher.ResultChan() {
if event.Type == watch.Error {
t.Logf("node watch error: %v", event.Object)
return ""
}
node, ok := event.Object.(*corev1.Node)
if !ok || !strings.HasPrefix(node.Name, vmssName) {
continue
}
if event.Type == watch.Deleted {
t.Fatalf("node %s was deleted", node.Name)
}
lastNode = node
for _, cond := range node.Status.Conditions {
if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
t.Logf("node %s is ready", node.Name)
return node.Name
}
}
}

t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
}

return false, nil
})

if err != nil {
if lastNode == nil {
t.Fatalf("%q haven't appeared in k8s API server: %v", vmssName, err)
return ""
}()
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When watcher.ResultChan() closes (e.g., apiserver closes the connection) or when a watch.Error event occurs, the code immediately re-establishes a new watch with no delay. This can spin in a tight loop and hammer the API server during outages/flaky connections. Add a small backoff (similar to the 5s delay used on watch start failures) before retrying after channel close/error events.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants