test: make WaitUntilNodeReady robust to watch disconnections by r2k1 · Pull Request #7827 · Azure/AgentBaker

r2k1 · 2026-02-07T09:36:44Z

Summary

Replace single-shot Kubernetes Watch in WaitUntilNodeReady with a watch retry loop that reconnects on disconnection or error instead of silently exiting
Use immediately-invoked func literal so watcher.Stop() is always called via defer
Fail fast on node deletion events
Adopt toolkit.LogStepf for timing (consistent with codebase pattern)
Better failure messages: dump full node JSON on timeout
Simplify timing in prepareAKSNode — WaitUntilNodeReady already logs its own duration via LogStepf

Known limitation

Watch-based approach may miss nodes that are already Ready before the watch starts (no initial List check). In practice this is fine for e2e tests since watch is started before VMSS provisioning completes, but a future improvement could add an initial List before entering the watch loop.

Test plan

go build ./... passes
e2e tests pass with node provisioning

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot · 2026-02-07T09:49:30Z

+		nodes, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
+		if err != nil {
+			logf(ctx, "failed to list nodes: %v", err)
+		} else {
+			foundNode = nil
+			for _, node := range nodes.Items {
+				if strings.HasPrefix(node.Name, vmssName) {
+					foundNode = &node
+					if isNodeReady(&node) {
+						return node.Name, nil
+					}
+					break
+				}
+			}
 		}



If Nodes().List keeps failing (e.g., auth/RBAC issue), this loop will only log the error and then eventually return either "node with prefix ... not found" or "node ... not ready" on context timeout, which can hide the real root cause. Consider tracking the last List error and returning it (or returning immediately for non-retryable errors) when ctx.Done() fires so the final failure message reflects the API failure.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot · 2026-02-07T23:01:26Z

+			watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
+			if err != nil {
+				t.Logf("failed to start node watch: %v, retrying in 5s", err)
+				select {
+				case <-ctx.Done():
+				case <-time.After(5 * time.Second):
+				}
+				return ""
+			}
+			defer watcher.Stop()

-		for _, cond := range node.Status.Conditions {
-			if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
-				t.Logf("node %s is ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
-				return node.Name
+			for event := range watcher.ResultChan() {


WaitUntilNodeReady starts a watch and then only reacts to future events. If the node is already present (or already Ready) when the watch begins, there may be no subsequent events and this can block until the context times out. Consider doing an initial Nodes().List/Nodes().Get check (or switching fully to PollUntilContextTimeout + List) to evaluate current state before watching/polling again.

Copilot · 2026-02-07T23:01:26Z

 func (k *Kubeclient) WaitUntilNodeReady(ctx context.Context, t testing.TB, vmssName string) string {
 	startTime := time.Now()
-	t.Logf("waiting for node %s to be ready in k8s API", vmssName)
+	t.Logf("waiting for node %s to be ready", vmssName)
 	defer func() {
-		t.Logf("waited for node %s to be ready in k8s API for %s", vmssName, time.Since(startTime))
+		t.Logf("waited for node %s to be ready for %s", vmssName, time.Since(startTime))
 	}()

-	var node *corev1.Node = nil
-	watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
-	require.NoError(t, err, "failed to start watching nodes")
-	defer watcher.Stop()
-
-	for event := range watcher.ResultChan() {
-		if event.Type != watch.Added && event.Type != watch.Modified {
-			continue
-		}
-
-		var nodeFromEvent *corev1.Node
-		switch v := event.Object.(type) {
-		case *corev1.Node:
-			nodeFromEvent = v
-
-		default:
-			t.Logf("skipping object type %T", event.Object)
-			continue
-		}
-
-		if !strings.HasPrefix(nodeFromEvent.Name, vmssName) {
-			continue
-		}
-
-		// found the right node. Use it!
-		node = nodeFromEvent
-		nodeTaints, _ := json.Marshal(node.Spec.Taints)
-		nodeConditions, _ := json.Marshal(node.Status.Conditions)
+	var lastNode *corev1.Node
+	for ctx.Err() == nil {
+		name := func() string {
+			watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
+			if err != nil {


The PR description says this was switched to ticker-based polling using List and that the function returns (string, error) instead of calling t.Fatalf, but the implementation still uses Watch and still calls t.Fatalf on failure/deleted events. Either update the implementation to match (List+polling, return error) or adjust the PR description to reflect the actual behavior.

Copilot · 2026-02-07T23:01:26Z

 		}
-
-		t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
+		t.Logf("node watch disconnected, restarting")


This logs "node watch disconnected, restarting" even when the context is done (e.g., timeout/cancel), because the log happens after the watch loop returns but before the next loop condition check. Consider guarding this log with a ctx.Err() == nil check to avoid misleading output on expected shutdown.

Suggested change

t.Logf("node watch disconnected, restarting")

if ctx.Err() == nil {

t.Logf("node watch disconnected, restarting")

}

Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail) into WaitUntilNodeReady. Add watch retry logic for disconnections and context cancellation. Add timing log for bastion creation.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-23T23:35:41Z

+			watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
+			if err != nil {


WaitUntilNodeReady now relies solely on a watch stream. If the node is already Ready before the watch starts (or if it becomes Ready without subsequent status updates), this can block until the context deadline even though the node is ready. Consider doing an initial Nodes().List to detect an already-ready node, then starting the watch from the returned resourceVersion to avoid missing events between list and watch.

Suggested change

watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})

if err != nil {

nodeList, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{})

if err != nil {

t.Logf("failed to list nodes: %v, retrying in 5s", err)

select {

case <-ctx.Done():

case <-time.After(5 * time.Second):

}

return ""

}

for i := range nodeList.Items {

node := &nodeList.Items[i]

if !strings.HasPrefix(node.Name, vmssName) {

continue

}

lastNode = node

for _, cond := range node.Status.Conditions {

if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {

t.Logf("node %s is ready", node.Name)

return node.Name

}

}

}

watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{ResourceVersion: nodeList.ResourceVersion})

if err != nil {

Copilot · 2026-04-23T23:35:42Z

+			for event := range watcher.ResultChan() {
+				if event.Type == watch.Error {
+					t.Logf("node watch error: %v", event.Object)
+					return ""
+				}
+				node, ok := event.Object.(*corev1.Node)
+				if !ok || !strings.HasPrefix(node.Name, vmssName) {
+					continue
+				}
+				if event.Type == watch.Deleted {
+					t.Fatalf("node %s was deleted", node.Name)
+				}
+				lastNode = node
+				for _, cond := range node.Status.Conditions {
+					if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue {
+						t.Logf("node %s is ready", node.Name)
+						return node.Name
+					}
 				}
 			}
-
-			t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions))
-		}
-
-		return false, nil
-	})
-
-	if err != nil {
-		if lastNode == nil {
-			t.Fatalf("%q haven't appeared in k8s API server: %v", vmssName, err)
 			return ""
+		}()


When watcher.ResultChan() closes (e.g., apiserver closes the connection) or when a watch.Error event occurs, the code immediately re-establishes a new watch with no delay. This can spin in a tight loop and hammer the API server during outages/flaky connections. Add a small backoff (similar to the 5s delay used on watch start failures) before retrying after channel close/error events.

Copilot AI review requested due to automatic review settings February 7, 2026 09:36

r2k1 requested review from AbelHu, Devinwong, YaoC, awesomenix, cameronmeissner, djsly, ganeshkumarashok, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, timmy-wright, yagmurbaydogan, yewmsft and zachary-bailey as code owners February 7, 2026 09:36

This comment was marked as resolved.

Sign in to view

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 0c0cda5 to 667faf9 Compare February 7, 2026 09:39

Copilot AI review requested due to automatic review settings February 7, 2026 09:45

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 667faf9 to d62b821 Compare February 7, 2026 09:45

r2k1 temporarily deployed to test February 7, 2026 09:45 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 February 7, 2026 09:45 View session

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from d62b821 to cf7a371 Compare February 7, 2026 09:45

r2k1 temporarily deployed to test February 7, 2026 09:46 — with GitHub Actions Inactive

Copilot AI reviewed Feb 7, 2026

View reviewed changes

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cf7a371 to cab645d Compare February 7, 2026 09:57

r2k1 temporarily deployed to test February 7, 2026 09:57 — with GitHub Actions Inactive

r2k1 changed the title ~~fix(e2e): make WaitUntilNodeReady robust to watch disconnections~~ test: make WaitUntilNodeReady robust to watch disconnections Feb 7, 2026

Copilot AI review requested due to automatic review settings February 7, 2026 22:56

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from cab645d to 3b80171 Compare February 7, 2026 22:56

r2k1 temporarily deployed to test February 7, 2026 22:57 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 February 7, 2026 22:57 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 3b80171 to 1ee4c6b Compare February 7, 2026 23:39

r2k1 temporarily deployed to test February 7, 2026 23:39 — with GitHub Actions Inactive

simplify WaitUntilNodeReady and add bastion creation timing

8ab5fe0

Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail) into WaitUntilNodeReady. Add watch retry logic for disconnections and context cancellation. Add timing log for bastion creation.

Copilot AI review requested due to automatic review settings April 23, 2026 23:31

r2k1 force-pushed the r2k1/fix-wait-until-node-ready branch from 1ee4c6b to 8ab5fe0 Compare April 23, 2026 23:31

r2k1 requested review from SriHarsha001, calvin197, sulixu and surajssd as code owners April 23, 2026 23:31

r2k1 temporarily deployed to test April 23, 2026 23:31 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 April 23, 2026 23:33 View session

r2k1 closed this Apr 23, 2026

r2k1 deleted the r2k1/fix-wait-until-node-ready branch April 23, 2026 23:34

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: make WaitUntilNodeReady robust to watch disconnections#7827

test: make WaitUntilNodeReady robust to watch disconnections#7827
r2k1 wants to merge 1 commit intomainfrom
r2k1/fix-wait-until-node-ready

r2k1 commented Feb 7, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{})
		if err != nil {

Conversation

r2k1 commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known limitation

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r2k1 commented Feb 7, 2026 •

edited

Loading