test: make WaitUntilNodeReady robust to watch disconnections#7827
test: make WaitUntilNodeReady robust to watch disconnections#7827
Conversation
0c0cda5 to
667faf9
Compare
667faf9 to
d62b821
Compare
d62b821 to
cf7a371
Compare
| nodes, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| logf(ctx, "failed to list nodes: %v", err) | ||
| } else { | ||
| foundNode = nil | ||
| for _, node := range nodes.Items { | ||
| if strings.HasPrefix(node.Name, vmssName) { | ||
| foundNode = &node | ||
| if isNodeReady(&node) { | ||
| return node.Name, nil | ||
| } | ||
| break | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
If Nodes().List keeps failing (e.g., auth/RBAC issue), this loop will only log the error and then eventually return either "node with prefix ... not found" or "node ... not ready" on context timeout, which can hide the real root cause. Consider tracking the last List error and returning it (or returning immediately for non-retryable errors) when ctx.Done() fires so the final failure message reflects the API failure.
cf7a371 to
cab645d
Compare
cab645d to
3b80171
Compare
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| t.Logf("failed to start node watch: %v, retrying in 5s", err) | ||
| select { | ||
| case <-ctx.Done(): | ||
| case <-time.After(5 * time.Second): | ||
| } | ||
| return "" | ||
| } | ||
| defer watcher.Stop() | ||
|
|
||
| for _, cond := range node.Status.Conditions { | ||
| if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | ||
| t.Logf("node %s is ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions)) | ||
| return node.Name | ||
| for event := range watcher.ResultChan() { |
There was a problem hiding this comment.
WaitUntilNodeReady starts a watch and then only reacts to future events. If the node is already present (or already Ready) when the watch begins, there may be no subsequent events and this can block until the context times out. Consider doing an initial Nodes().List/Nodes().Get check (or switching fully to PollUntilContextTimeout + List) to evaluate current state before watching/polling again.
| func (k *Kubeclient) WaitUntilNodeReady(ctx context.Context, t testing.TB, vmssName string) string { | ||
| startTime := time.Now() | ||
| t.Logf("waiting for node %s to be ready in k8s API", vmssName) | ||
| t.Logf("waiting for node %s to be ready", vmssName) | ||
| defer func() { | ||
| t.Logf("waited for node %s to be ready in k8s API for %s", vmssName, time.Since(startTime)) | ||
| t.Logf("waited for node %s to be ready for %s", vmssName, time.Since(startTime)) | ||
| }() | ||
|
|
||
| var node *corev1.Node = nil | ||
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| require.NoError(t, err, "failed to start watching nodes") | ||
| defer watcher.Stop() | ||
|
|
||
| for event := range watcher.ResultChan() { | ||
| if event.Type != watch.Added && event.Type != watch.Modified { | ||
| continue | ||
| } | ||
|
|
||
| var nodeFromEvent *corev1.Node | ||
| switch v := event.Object.(type) { | ||
| case *corev1.Node: | ||
| nodeFromEvent = v | ||
|
|
||
| default: | ||
| t.Logf("skipping object type %T", event.Object) | ||
| continue | ||
| } | ||
|
|
||
| if !strings.HasPrefix(nodeFromEvent.Name, vmssName) { | ||
| continue | ||
| } | ||
|
|
||
| // found the right node. Use it! | ||
| node = nodeFromEvent | ||
| nodeTaints, _ := json.Marshal(node.Spec.Taints) | ||
| nodeConditions, _ := json.Marshal(node.Status.Conditions) | ||
| var lastNode *corev1.Node | ||
| for ctx.Err() == nil { | ||
| name := func() string { | ||
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| if err != nil { |
There was a problem hiding this comment.
The PR description says this was switched to ticker-based polling using List and that the function returns (string, error) instead of calling t.Fatalf, but the implementation still uses Watch and still calls t.Fatalf on failure/deleted events. Either update the implementation to match (List+polling, return error) or adjust the PR description to reflect the actual behavior.
| } | ||
|
|
||
| t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions)) | ||
| t.Logf("node watch disconnected, restarting") |
There was a problem hiding this comment.
This logs "node watch disconnected, restarting" even when the context is done (e.g., timeout/cancel), because the log happens after the watch loop returns but before the next loop condition check. Consider guarding this log with a ctx.Err() == nil check to avoid misleading output on expected shutdown.
| t.Logf("node watch disconnected, restarting") | |
| if ctx.Err() == nil { | |
| t.Logf("node watch disconnected, restarting") | |
| } |
3b80171 to
1ee4c6b
Compare
Inline single-use helpers (failNodeNotReady, isNodeReady, logNodeDetail) into WaitUntilNodeReady. Add watch retry logic for disconnections and context cancellation. Add timing log for bastion creation.
1ee4c6b to
8ab5fe0
Compare
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | ||
| if err != nil { |
There was a problem hiding this comment.
WaitUntilNodeReady now relies solely on a watch stream. If the node is already Ready before the watch starts (or if it becomes Ready without subsequent status updates), this can block until the context deadline even though the node is ready. Consider doing an initial Nodes().List to detect an already-ready node, then starting the watch from the returned resourceVersion to avoid missing events between list and watch.
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{}) | |
| if err != nil { | |
| nodeList, err := k.Typed.CoreV1().Nodes().List(ctx, metav1.ListOptions{}) | |
| if err != nil { | |
| t.Logf("failed to list nodes: %v, retrying in 5s", err) | |
| select { | |
| case <-ctx.Done(): | |
| case <-time.After(5 * time.Second): | |
| } | |
| return "" | |
| } | |
| for i := range nodeList.Items { | |
| node := &nodeList.Items[i] | |
| if !strings.HasPrefix(node.Name, vmssName) { | |
| continue | |
| } | |
| lastNode = node | |
| for _, cond := range node.Status.Conditions { | |
| if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | |
| t.Logf("node %s is ready", node.Name) | |
| return node.Name | |
| } | |
| } | |
| } | |
| watcher, err := k.Typed.CoreV1().Nodes().Watch(ctx, metav1.ListOptions{ResourceVersion: nodeList.ResourceVersion}) | |
| if err != nil { |
| for event := range watcher.ResultChan() { | ||
| if event.Type == watch.Error { | ||
| t.Logf("node watch error: %v", event.Object) | ||
| return "" | ||
| } | ||
| node, ok := event.Object.(*corev1.Node) | ||
| if !ok || !strings.HasPrefix(node.Name, vmssName) { | ||
| continue | ||
| } | ||
| if event.Type == watch.Deleted { | ||
| t.Fatalf("node %s was deleted", node.Name) | ||
| } | ||
| lastNode = node | ||
| for _, cond := range node.Status.Conditions { | ||
| if cond.Type == corev1.NodeReady && cond.Status == corev1.ConditionTrue { | ||
| t.Logf("node %s is ready", node.Name) | ||
| return node.Name | ||
| } | ||
| } | ||
| } | ||
|
|
||
| t.Logf("node %s is not ready. Taints: %s Conditions: %s", node.Name, string(nodeTaints), string(nodeConditions)) | ||
| } | ||
|
|
||
| return false, nil | ||
| }) | ||
|
|
||
| if err != nil { | ||
| if lastNode == nil { | ||
| t.Fatalf("%q haven't appeared in k8s API server: %v", vmssName, err) | ||
| return "" | ||
| }() |
There was a problem hiding this comment.
When watcher.ResultChan() closes (e.g., apiserver closes the connection) or when a watch.Error event occurs, the code immediately re-establishes a new watch with no delay. This can spin in a tight loop and hammer the API server during outages/flaky connections. Add a small backoff (similar to the 5s delay used on watch start failures) before retrying after channel close/error events.
Summary
WaitUntilNodeReadywith a watch retry loop that reconnects on disconnection or error instead of silently exitingwatcher.Stop()is always called viadefertoolkit.LogStepffor timing (consistent with codebase pattern)prepareAKSNode—WaitUntilNodeReadyalready logs its own duration viaLogStepfKnown limitation
Watch-based approach may miss nodes that are already Ready before the watch starts (no initial List check). In practice this is fine for e2e tests since watch is started before VMSS provisioning completes, but a future improvement could add an initial
Listbefore entering the watch loop.Test plan
go build ./...passes