kola: Add soft-reboot support for external tests#4119
Conversation
|
Skipping CI for Draft Pull Request. |
depends on: coreos/coreos-assembler#4119
Is there a case where a soft-reboot would be a better test than a hard reboot? Is the goal here to reduce test time? |
|
Right now soft-reboots are not really supported by ostree but we are implementing soft-reboots for ostree on : ostreedev/ostree#3420 |
|
Ok, Just comment here when you move this out of draft |
cgwalters
left a comment
There was a problem hiding this comment.
Looks good to me overall! Main issue is code duplication with the "hard" reboot path but that's probably hard to fix and not worth doing.
| echo "test beginning" | ||
| # Check that boot_id stays the same across soft-reboot | ||
| INITIAL_BOOT_ID=$(cat /proc/sys/kernel/random/boot_id) | ||
| echo "Initial boot ID: $INITIAL_BOOT_ID" | logger |
There was a problem hiding this comment.
Don't need the |logger, we always log tests to the journal
| # Verify boot_id is the same (soft-reboot should not change it) | ||
| CURRENT_BOOT_ID=$(cat /proc/sys/kernel/random/boot_id) | ||
| echo "Current boot ID after soft-reboot: $CURRENT_BOOT_ID" | logger |
There was a problem hiding this comment.
How is this verified though? I think we need to save the boot id to a persistent file like /var/cache/kola-boot-id or something in the first boot and then compare it here.
| mark2) | ||
| echo "test in mark2" | ||
| FINAL_BOOT_ID=$(cat /proc/sys/kernel/random/boot_id) | ||
| echo "Final boot ID after forced soft-reboot: $FINAL_BOOT_ID" | logger |
|
OK test now dies with a timeout which is the same behavior I see when I do edit: looking at the console log the VM appears to come back. It's just cosa that loses the ability to ssh back in, even during a |
|
OK needed to modify qemu.go to get the test to comeback too. Now need to figureout this failed service. |
4944160 to
d0743f6
Compare
|
I am not sure of the story of that service but it looks like we could ignore that failed service without any real issues. |
|
My offhand guess as to what's happening here is that service isn't prepared for soft reboots. https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/lib/systemd/system/coreos-ignition-firstboot-complete.service#L4 is still triggered because we didn't change the kernel commandline across the soft reboot. I think CoreOS probably does need to fix this because soft rebooting from the initial boot is a sane thing to do and I see real world use cases for it. It is perhaps as simple as |
On coreos/coreos-assembler#4119 surfaced that this service would fail on soft-reboots, it's non fatal but would make the Kola tests fail. It looks like originally this was set to fail instead of risking not running it: https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14
coreos/coreos-assembler#4119 surfaced that this service would fail on soft-reboots, it's non fatal but would make the Kola tests fail. It looks like originally this was set to fail instead of risking not running it: https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14
|
Holy cow that was painful to figure out but fixes in #4133 get us improved error message handling to the point where I now see Which was the real bug here - with soft reboots |
coreos/coreos-assembler#4119 surfaced that this service would fail on soft-reboots, it's non fatal but would make the Kola tests fail. It looks like originally this was set to fail instead of risking not running it: https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14
coreos/coreos-assembler#4119 surfaced that this service would fail on soft-reboots, it's non fatal but would make the Kola tests fail. It looks like originally this was set to fail instead of risking not running it: https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14 Co-authored-by: Colin Walters <walters@verbum.org>
…t-reboots coreos/coreos-assembler#4119 surfaced that this service would fail on soft-reboots, it's non fatal but would make the Kola tests fail. It looks like originally this was set to fail instead of risking not running it: https://github.com/coreos/fedora-coreos-config/blob/20feb176f19c3142b7256c1eb5bf1cb7c53b29b9/overlay.d/05core/usr/libexec/coreos-ignition-firstboot-complete#L14 Co-authored-by: Colin Walters <walters@verbum.org>
coreos/coreos-assembler#4119 surfaced that this services using ConditionKernelCommandLine=ignition.firstboot would fail on soft-reboots, it's non fatal but would make the Kola tests fail. Co-authored-by: Colin Walters <walters@verbum.org>
|
/test rhcos |
coreos/coreos-assembler#4119 surfaced that this services using ConditionKernelCommandLine=ignition.firstboot would fail on soft-reboots, it's non fatal but would make the Kola tests fail. Co-authored-by: Jonathan Lebon <jonathan@jlebon.com> Co-authored-by: Colin Walters <walters@verbum.org>
coreos/coreos-assembler#4119 surfaced that this services using ConditionKernelCommandLine=ignition.firstboot would fail on soft-reboots, it's non fatal but would make the Kola tests fail. Co-authored-by: Jonathan Lebon <jonathan@jlebon.com> Co-authored-by: Colin Walters <walters@verbum.org>
|
/retest-required |
cgwalters
left a comment
There was a problem hiding this comment.
I don't think we had any blockers left right?
Implements soft-reboot capabilities for Kola, it enables tests to use systemd's soft-reboot functionality. The implementation follows the same pattern as regular reboots but for `systemctl soft-reboot`, tracks systemd boot timestamps rather than kernel boot IDs for state detection. Co-Authored-By: Colin Walters <walters@verbum.org> Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Colin Walters <walters@verbum.org> Signed-off-by: Joseph Marrero Corchado <jmarrero@redhat.com>
| // For soft-reboot, we don't change the boot ID, but the systemd SoftRebootsCount should increment | ||
| // We use systemd's SoftRebootsCount property to detect when the system has soft-rebooted | ||
| c := make(chan error) | ||
| go func() { |
There was a problem hiding this comment.
Completely unrelated to this but just for the future...I am increasingly of the opinion that the way tests should work is for the target machine to run autonomously, and report its status instead of being polled. IOW the kubernetes model, not the ansible model. It's a bit ironic that we ended up with a test system that follows more closely to the latter.
At least the external tests always run in a systemd unit. If we instead had an "agent reports status" model then this poll/retry code wouldn't be necessary. In fact the harness wouldn't even need to be aware of soft reboots in general!
But anyways, the test suite is what it is now.
cgwalters
left a comment
There was a problem hiding this comment.
Still LGTM (good catches on the further review!)
Implements soft-reboot capabilities for Kola,
it enables tests to use systemd's soft-reboot functionality.
The implementation follows the same pattern as regular reboots but for
systemctl soft-reboot, tracks systemd boottimestamps rather than kernel boot IDs for state detection.