ci: add libvirt diagnostics and network activation for CoreOS TMT SSH failures#2157
Closed
Copilot wants to merge 2 commits into
Closed
ci: add libvirt diagnostics and network activation for CoreOS TMT SSH failures#2157Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
Add dump_libvirt_diagnostics() in tmt.rs to capture virsh state when SSH connectivity verification fails (VM list, network list, DHCP leases, dominfo, domiflist, dumpxml). Add a CI step in the test-coreos job to ensure the libvirt default network is active before running CoreOS tests. Assisted-by: GitHub Copilot (claude-sonnet-4.5) Agent-Logs-Url: https://github.com/bootc-dev/bootc/sessions/fc40be37-2aea-41f1-b021-4f5b1d804a4a Co-authored-by: jmarrero <1894385+jmarrero@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix flaky CoreOS TMT SSH provisioning in CI
ci: add libvirt diagnostics and network activation for CoreOS TMT SSH failures
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
test-coreosjob fails intermittently with an opaqueSSH connectivity check failed after 60 attemptserror — no VM/network state is captured, making root cause analysis impossible.Changes
crates/xtask/src/tmt.rsdump_libvirt_diagnostics(sh, vm_name)called on SSH verification failure inrun_tmt, emitting:virsh list --all,virsh net-list --all,virsh net-dhcp-leases defaultvirsh dominfo,virsh domiflist,virsh dumpxmlfor the failing VM.ignore_status()) so diagnostics never block cleanup.github/workflows/ci.yml—test-coreosjobdefaultnetwork is active before launching the CoreOS VM; prints DHCP leases for confirmation:The network activation step eliminates the most likely root cause (inactive
defaultnetwork / no DHCP). The diagnostics dump ensures any remaining failures produce actionable output instead of a bare timeout.Original prompt
Fix flaky/failing CoreOS TMT SSH provisioning in CI (bootc-dev/bootc)
Context
.github/workflows/ci.ymlcontains atest-coreosjob that builds an image and runsjust test-tmt-on-coreos plan-bootc-install-on-coreos.Waiting for SSH... attempt 60/60SSH verification failed for plan /tmt/plans/tests-install/plan-bootc-install-on-coreos: Verifying SSH connectivity: SSH connectivity check failed after 60 attemptserror: Recipetest-tmt-on-coreosfailed ... exit code 12a459ac9b31eac38a294e3febec3811cbb3ccc6e, PR tests: Skip BLI detection test when discoverable-partitions config is absent #2155) and is not caused by the test change; it looks like libvirt/network/provisioning/guest boot regressions or missing diagnostics.Goal
Required changes
Add robust debug capture when SSH connectivity verification fails in the CoreOS TMT harness.
just test-tmt-on-coreos(likelyhack/test-tmt-on-coreosor a similar wrapper).ERRtrap or explicit failure handler that prints:virsh list --allvirsh net-list --allvirsh net-dhcp-leases <network>(at leastdefault)virsh dominfo <domain>/virsh domiflist <domain>/virsh dumpxml <domain>for the relevant domain.Ensure libvirt network is active before starting CoreOS tests.
.github/workflows/ci.ymltest-coreosjob, add a step before running the CoreOS test to check and start the libvirt network (default or the one used by the harness).Keep changes minimal and safe.
Update documentation/comments as needed.
Deliverables
References
test-tmt-on-coreosrecipe around line 211 and the failing TMT plantmt/plans/tests-install/plan-bootc-install-on-coreos..github/workflows/ci.yml.User context
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: The job is failing because the Fedora CoreOS VM never becomes reachable over SSH during the TMT plan
tests-install/plan-bootc-install-on-coreos:Waiting for SSH... attempt 60/60SSH verification failed ... SSH connectivity check failed after 60 attemptserror: Recipe test-tmt-on-coreos failed ... exit code 1This is almost always one of:
Targeted solution: make the CoreOS provision step fail earlier with actionable diagnostics, and ensure the libvirt network is usable
Right now, the harness times out without capturing why the guest isn’t reachable. Update the
just test-tmt-on-coreos ...wrapper (the logs indicate it’s implemented byhack/test-tmt-on-coreos, failing around “line 211”) to dump libvirt/QEMU state when SSH doesn’t come up.Code suggestion (add a failure trap + libvirt diagnostics)
In
hack/test-tmt-on-coreos, wrap the run so that on failure you automatically print:virsh list --allvirsh dominfo <domain>virsh domiflist <domain>virsh net-list --allvirsh net-dhcp-leases default(or the network you use)virsh consolecapture if possible (orvirsh dumpxml <domain>at minimum)Example pattern:
This won’t “fix” networking by itself, but it turns the current opaque timeout into a...
This pull request was created from Copilot chat.