Skip to content

fix(ci): bootstrap virtualization print nodes for debug#2317

Merged
universal-itengineer merged 1 commit into
mainfrom
fix/ci/nested-e2e-bootstrap
May 5, 2026
Merged

fix(ci): bootstrap virtualization print nodes for debug#2317
universal-itengineer merged 1 commit into
mainfrom
fix/ci/nested-e2e-bootstrap

Conversation

@universal-itengineer

@universal-itengineer universal-itengineer commented May 5, 2026

Copy link
Copy Markdown
Member

Description

Make the nested E2E bootstrap diagnostics more tolerant to temporary Kubernetes API failures.

This change keeps debug output commands from failing the workflow when the nested cluster API temporarily returns errors, and updates the virt-handler readiness loop to recalculate worker nodes on every attempt. If worker nodes cannot be listed, the loop now keeps waiting instead of treating 0/0 as a successful virt-handler readiness state.

Why do we need it, and what problem does it solve?

Nested E2E clusters can briefly lose API availability while the control plane is recovering, for example when etcd or kube-apiserver is under load or restarting. In that window, diagnostic kubectl get node calls may return errors such as etcdserver: request timed out or 502 Bad Gateway and fail the CI job even though the cluster can recover shortly afterward.

Example error

E0505 03:17:44.603556    2966 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: an error on the server (\"<html>\\r\\n<head><title>502 Bad Gateway</title></head>\\r\\n<body>\\r\\n<center><h1>502 Bad Gateway</h1></center>\\r\\n<hr><center>nginx</center>\\r\\n</body>\\r\\n</html>\") has prevented the request from succeeding"
  .... has prevented the request from succeeding (get nodes)

The virt-handler readiness check also previously calculated the worker count only once before the retry loop. If that initial request failed and produced 0 workers, the check could incorrectly succeed with 0/0 ready handlers.

What is the expected result?

  1. Run the nested E2E bootstrap workflow.
  2. If the nested cluster API temporarily fails during diagnostic output, the workflow logs a warning and continues collecting available information.
  3. If worker nodes cannot be listed during virt-handler readiness checks, the workflow waits for the next retry instead of reporting success with 0/0 handlers.

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: ci
type: fix
summary: Make nested E2E bootstrap diagnostics tolerate temporary Kubernetes API failures.
impact_level: low

@universal-itengineer universal-itengineer marked this pull request as ready for review May 5, 2026 09:09
@universal-itengineer universal-itengineer force-pushed the fix/ci/nested-e2e-bootstrap branch 2 times, most recently from aa93115 to cebef57 Compare May 5, 2026 09:12
Signed-off-by: Nikita Korolev <nikita.korolev@flant.com>
@universal-itengineer universal-itengineer force-pushed the fix/ci/nested-e2e-bootstrap branch from cebef57 to 3295cbd Compare May 5, 2026 09:13
@universal-itengineer universal-itengineer added this to the v1.9.0 milestone May 5, 2026
@universal-itengineer universal-itengineer merged commit 077b199 into main May 5, 2026
26 of 29 checks passed
@universal-itengineer universal-itengineer deleted the fix/ci/nested-e2e-bootstrap branch May 5, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants