tests/baseosmgr: e2e coverage for force-fallback and retry-update#1164
Open
eriknordmark wants to merge 4 commits into
Open
tests/baseosmgr: e2e coverage for force-fallback and retry-update#1164eriknordmark wants to merge 4 commits into
eriknordmark wants to merge 4 commits into
Conversation
Adds a new tests/baseosmgr/ suite with two scripts that exercise the two controller-driven knobs in pkg/pillar/cmd/baseosmgr that are otherwise reachable only through real partition flips and a real controller. - force_fallback exercises handleForceFallback end-to-end: trigger an EVE image update and wait for the new partition to commit (active), capture the pre-upgrade short version up front, then bump the global force.fallback.counter knob and assert the device reboots back to the captured old version. This is the "switch back to the previous image" path: baseosmgr observes the counter change via ZedAgentStatus, the precondition (curr=active, other=unused with non-empty ShortVersion) holds after a successful upgrade, baseosmgr flips the other partition to "updating", and nodeagent reboots into it with BootReasonUpdate. - retry_update exercises handleUpdateRetryCounter + isImageInErrorState end-to-end: configure a long upgrade-test window and a short fallback timer, trigger an update, black-hole the controller's IP via in-EVE iptables once the new partition is inprogress so the test window times out and BootReasonFallback fires, then bump RetryUpdateCounter via eveimage-update-retry and assert the same image succeeds on the second attempt (the iptables rule dies with the previous boot's in-memory state, so connectivity restores naturally for the retry's test window). This covers the "failed image" branch of isImageInErrorState (curr=active, other=inprogress with a matching activate=true config), the SetOtherPartitionStateUpdating / saveConfigRetryUpdateCounter path, and the second BootReasonUpdate cycle that nodeagent issues into the same partition. Both tests configure timer.test.baseimage.update / timer.config.interval / timer.deviceinfo.interval up front so the cycle finishes in a few minutes per phase rather than the default ten. Both tests eveimage-remove + brief wait before eveimage-update so re-running with the same EVE version still triggers a fresh install. retry_update is gated on devmodel == ZedVirtual-4G or VBox because it relies on `eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP` to silently drop controller traffic. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
retry_update commits the new EVE version (12.1.0 by default) onto the active partition as part of its success assertion. Without a revert, EVE stays on that version forever, which (a) leaves the device on a non-coverage build if the test ran in a coverage-instrumented G3 sequence, and (b) breaks any downstream test/suite that assumes EVE is on the originally-configured `eve.tag`. Mirror the revert sequence already used by the standalone update_eve_image flow (eveimage-remove the leftover BaseOsConfig, download the original rootfs, eveimage-update back, wait for the active partition to flip, eveimage-remove again, eden eve reset). Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_fallback
force_fallback pushes a BaseOsConfig for 12.1.0 to exercise the
force-fallback path, but only `eden eve reset` (clears device config)
at the end, not `eveimage-remove`. The 12.1.0 BaseOsConfig therefore
lingers in adam after the test, which leaks state into subsequent
tests / suites — most notably retry_update, which begins with its
own eveimage-remove + eveimage-update and depends on adam being in a
known-clean state for the second eveimage-update to actually fire a
fresh install request.
Add a final eveimage-remove of {{ $short_version }} before the
eden eve reset.
Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two follow-ups for the baseosmgr suite: 1. Raise outer -test.timeout values in eden.baseosmgr.tests.txt to match realistic suite runtime on the coverage-instrumented EVE build. retry_update issues four sequential lim.test waits at -timewait 30m each, plus 6 minutes of exec sleep — the 60-minute outer cap killed the suite mid-revert. Bumped to 90m for retry_update and 45m for force_fallback. The test-script -timewait values are unchanged; only the wrapper cap relaxes. 2. Add a get-config assertion after every eveimage-remove call so the test fails loudly if the controller config still references the removed image. The current eden CLI EdgeNodeEVEImageRemove only removes the legacy baseosconfig list entry and leaves the modern single-block baseos field + contentInfo[] populated; this assertion surfaces that bug end-to-end. (The corresponding eden CLI fix lands in PR lf-edge#1172.) Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9edc2fe to
690bd96
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new `tests/baseosmgr/` suite with two e2e tests that exercise the two controller-driven knobs in `pkg/pillar/cmd/baseosmgr` that are otherwise reachable only through real partition flips and a real controller round-trip:
These complement the existing `tests/update_eve_image/` suite (which already covers the happy upgrade path + simple revert) and the new `tests/nodeagent/baseos_fallback_*` (#1162, which covers the fallback-on-cloud-disconnect path from nodeagent's perspective).
What's covered (and what isn't)
`force_fallback` exercises `handleForceFallback`'s precondition check
(curr=active, other=unused with non-empty `ShortVersion` — only true after a
prior successful upgrade), the
`/persist/checkpoint/forceFallbackCounter` plumbing, the
`SetOtherPartitionStateUpdating` path, and that nodeagent reacts to the
flip and reboots into the previous image. The version-based assertion
(`cmp` of the post-rollback active version against a captured `old_ver`
file) is unique to this path — only force-fallback can revert
`SwList[0].ShortVersion` to the pre-upgrade value.
`retry_update` exercises the "failed image" branch of
`isImageInErrorState` (curr=active, other=inprogress with a matching
`BaseOsConfig.Activate=true`), the
`SetOtherPartitionStateUpdating` + `saveConfigRetryUpdateCounter` path,
and the second `BootReasonUpdate` cycle that nodeagent issues into the
same partition. The black-hole is removed naturally by the post-fallback
reboot (in-memory iptables rule dies with the previous boot), so the
retry's test window passes against a working controller.
Test plan
QEMU (`ZedVirtual-4G`). (deferred: eden harness is busy)
Per-test runtime estimate (laptop, KVM-accelerated):
Implementation notes
`eden info ... PartitionState:active --tail=1` before the upgrade,
saves the result to `old_ver`, and after the rollback runs the same
command again and `cmp`s against `old_ver`. This is more robust than
asserting `! stdout '$short_version'` because the post-fallback Info
publish-cadence might still echo a stale pre-reboot sample if the
assertion fires too early; the `cmp` is exact.
wait before `eveimage-update`, so re-running with the same EVE version
still triggers a fresh update.
controller polling so subsequent config pushes (the
`force.fallback.counter` bump or the `RetryUpdateCounter` bump from
`eveimage-update-retry`) propagate within ~10s. `timer.deviceinfo.interval=30`
ensures post-rollback Info publishes promptly so `lim.test` doesn't
wait 10 min for the next periodic push.
identical to the new `tests/nodeagent/baseos_fallback_blackhole.txt` in
tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance #1162. Device-model gating is the same (`ZedVirtual-4G` / `VBox` only).
addressable on their own (`eden.escript.test -test.run TestEdenScripts/force_fallback -testdata tests/baseosmgr/testdata/`).
Folding them into a broader `smoke` / `eve-upgrade` workflow is a
follow-up.
Suite structure
```
tests/baseosmgr/
├── Makefile (boilerplate matching tests/nodeagent/)
├── eden-config.yml
├── eden.baseosmgr.tests.txt
└── testdata/
├── force_fallback.txt
└── retry_update.txt
```
Why draft
Has not been run end-to-end yet (eden harness was busy at write time);
will be exercised on `ZedVirtual-4G` / `VBox` before marking ready.
The companion eve-side PRs that the unit-test side of this work is in:
(lifts package coverage from 0 % to ~75 %).
This PR is independent of those (it only touches eden), but the same
review eyes apply to the design.