tests/baseosmgr: e2e coverage for force-fallback and retry-update by eriknordmark · Pull Request #1164 · lf-edge/eden

eriknordmark · 2026-05-08T11:51:55Z

Summary

Adds a new `tests/baseosmgr/` suite with two e2e tests that exercise the two controller-driven knobs in `pkg/pillar/cmd/baseosmgr` that are otherwise reachable only through real partition flips and a real controller round-trip:

Test	Path tested	Failure / drive mode
`force_fallback`	`handleForceFallback` (`forcefallback.go`)	Bumps the global config knob `force.fallback.counter` after a successful upgrade and asserts the device reverts to the captured pre-upgrade version.
`retry_update`	`handleUpdateRetryCounter` + `isImageInErrorState` (`handlebaseos.go`)	Triggers an upgrade, black-holes the controller's IP via in-EVE `iptables` so the post-update test window times out (`BootReasonFallback`), then bumps `RetryUpdateCounter` via `eveimage-update-retry` and asserts the same image succeeds on the second attempt.

These complement the existing `tests/update_eve_image/` suite (which already covers the happy upgrade path + simple revert) and the new `tests/nodeagent/baseos_fallback_*` (#1162, which covers the fallback-on-cloud-disconnect path from nodeagent's perspective).

What's covered (and what isn't)

`force_fallback` exercises `handleForceFallback`'s precondition check
(curr=active, other=unused with non-empty `ShortVersion` — only true after a
prior successful upgrade), the
`/persist/checkpoint/forceFallbackCounter` plumbing, the
`SetOtherPartitionStateUpdating` path, and that nodeagent reacts to the
flip and reboots into the previous image. The version-based assertion
(`cmp` of the post-rollback active version against a captured `old_ver`
file) is unique to this path — only force-fallback can revert
`SwList[0].ShortVersion` to the pre-upgrade value.

`retry_update` exercises the "failed image" branch of
`isImageInErrorState` (curr=active, other=inprogress with a matching
`BaseOsConfig.Activate=true`), the
`SetOtherPartitionStateUpdating` + `saveConfigRetryUpdateCounter` path,
and the second `BootReasonUpdate` cycle that nodeagent issues into the
same partition. The black-hole is removed naturally by the post-fallback
reboot (in-memory iptables rule dies with the previous boot), so the
retry's test window passes against a working controller.

Test plan

Both tests run locally against a coverage-instrumented EVE under
QEMU (`ZedVirtual-4G`). (deferred: eden harness is busy)
CI run on lf-edge/eden's harness.

Per-test runtime estimate (laptop, KVM-accelerated):

`force_fallback`: ~10–15 min (one full upgrade-then-rollback cycle).
`retry_update`: ~12–18 min (one failed-attempt cycle + one retry cycle).

Implementation notes

Capture-then-compare for the version assertion: `force_fallback` runs
`eden info ... PartitionState:active --tail=1` before the upgrade,
saves the result to `old_ver`, and after the rollback runs the same
command again and `cmp`s against `old_ver`. This is more robust than
asserting `! stdout '$short_version'` because the post-fallback Info
publish-cadence might still echo a stale pre-reboot sample if the
assertion fires too early; the `cmp` is exact.
Idempotency across runs: both tests issue `eveimage-remove` + brief
wait before `eveimage-update`, so re-running with the same EVE version
still triggers a fresh update.
Config-propagation timing: `timer.config.interval=10` shortens
controller polling so subsequent config pushes (the
`force.fallback.counter` bump or the `RetryUpdateCounter` bump from
`eveimage-update-retry`) propagate within ~10s. `timer.deviceinfo.interval=30`
ensures post-rollback Info publishes promptly so `lim.test` doesn't
wait 10 min for the next periodic push.
Black-hole mechanism (retry_update): `eden eve ssh -- iptables -A OUTPUT -d -j DROP`,
identical to the new `tests/nodeagent/baseos_fallback_blackhole.txt` in
tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance #1162. Device-model gating is the same (`ZedVirtual-4G` / `VBox` only).
Suite is not wired into broader workflows — the two tests are
addressable on their own (`eden.escript.test -test.run TestEdenScripts/force_fallback -testdata tests/baseosmgr/testdata/`).
Folding them into a broader `smoke` / `eve-upgrade` workflow is a
follow-up.

Suite structure

```
tests/baseosmgr/
├── Makefile (boilerplate matching tests/nodeagent/)
├── eden-config.yml
├── eden.baseosmgr.tests.txt
└── testdata/
├── force_fallback.txt
└── retry_update.txt
```

Why draft

Has not been run end-to-end yet (eden harness was busy at write time);
will be exercised on `ZedVirtual-4G` / `VBox` before marking ready.

The companion eve-side PRs that the unit-test side of this work is in:

docs(baseosmgr): add architecture document eve#5921 — `docs(baseosmgr): add architecture document`.
baseosmgr: add Phase-1 unit tests + pathConfig seam eve#5922 — `baseosmgr: add Phase-1 unit tests + pathConfig seam`.
baseosmgr: Phase 2 — Zboot/kubeapi/HVType seams + tests eve#5923 — `baseosmgr: Phase 2 — Zboot/kubeapi/HVType seams + tests`
(lifts package coverage from 0 % to ~75 %).

This PR is independent of those (it only touches eden), but the same
review eyes apply to the design.

Adds a new tests/baseosmgr/ suite with two scripts that exercise the two controller-driven knobs in pkg/pillar/cmd/baseosmgr that are otherwise reachable only through real partition flips and a real controller. - force_fallback exercises handleForceFallback end-to-end: trigger an EVE image update and wait for the new partition to commit (active), capture the pre-upgrade short version up front, then bump the global force.fallback.counter knob and assert the device reboots back to the captured old version. This is the "switch back to the previous image" path: baseosmgr observes the counter change via ZedAgentStatus, the precondition (curr=active, other=unused with non-empty ShortVersion) holds after a successful upgrade, baseosmgr flips the other partition to "updating", and nodeagent reboots into it with BootReasonUpdate. - retry_update exercises handleUpdateRetryCounter + isImageInErrorState end-to-end: configure a long upgrade-test window and a short fallback timer, trigger an update, black-hole the controller's IP via in-EVE iptables once the new partition is inprogress so the test window times out and BootReasonFallback fires, then bump RetryUpdateCounter via eveimage-update-retry and assert the same image succeeds on the second attempt (the iptables rule dies with the previous boot's in-memory state, so connectivity restores naturally for the retry's test window). This covers the "failed image" branch of isImageInErrorState (curr=active, other=inprogress with a matching activate=true config), the SetOtherPartitionStateUpdating / saveConfigRetryUpdateCounter path, and the second BootReasonUpdate cycle that nodeagent issues into the same partition. Both tests configure timer.test.baseimage.update / timer.config.interval / timer.deviceinfo.interval up front so the cycle finishes in a few minutes per phase rather than the default ten. Both tests eveimage-remove + brief wait before eveimage-update so re-running with the same EVE version still triggers a fresh install. retry_update is gated on devmodel == ZedVirtual-4G or VBox because it relies on `eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP` to silently drop controller traffic. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

retry_update commits the new EVE version (12.1.0 by default) onto the active partition as part of its success assertion. Without a revert, EVE stays on that version forever, which (a) leaves the device on a non-coverage build if the test ran in a coverage-instrumented G3 sequence, and (b) breaks any downstream test/suite that assumes EVE is on the originally-configured `eve.tag`. Mirror the revert sequence already used by the standalone update_eve_image flow (eveimage-remove the leftover BaseOsConfig, download the original rootfs, eveimage-update back, wait for the active partition to flip, eveimage-remove again, eden eve reset). Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e_fallback force_fallback pushes a BaseOsConfig for 12.1.0 to exercise the force-fallback path, but only `eden eve reset` (clears device config) at the end, not `eveimage-remove`. The 12.1.0 BaseOsConfig therefore lingers in adam after the test, which leaks state into subsequent tests / suites — most notably retry_update, which begins with its own eveimage-remove + eveimage-update and depends on adam being in a known-clean state for the second eveimage-update to actually fire a fresh install request. Add a final eveimage-remove of {{ $short_version }} before the eden eve reset. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Two follow-ups for the baseosmgr suite: 1. Raise outer -test.timeout values in eden.baseosmgr.tests.txt to match realistic suite runtime on the coverage-instrumented EVE build. retry_update issues four sequential lim.test waits at -timewait 30m each, plus 6 minutes of exec sleep — the 60-minute outer cap killed the suite mid-revert. Bumped to 90m for retry_update and 45m for force_fallback. The test-script -timewait values are unchanged; only the wrapper cap relaxes. 2. Add a get-config assertion after every eveimage-remove call so the test fails loudly if the controller config still references the removed image. The current eden CLI EdgeNodeEVEImageRemove only removes the legacy baseosconfig list entry and leaves the modern single-block baseos field + contentInfo[] populated; this assertion surfaces that bug end-to-end. (The corresponding eden CLI fix lands in PR lf-edge#1172.) Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

eriknordmark marked this pull request as ready for review May 9, 2026 06:16

eriknordmark requested a review from uncleDecart as a code owner May 9, 2026 06:16

eriknordmark requested review from europaul and rene May 10, 2026 08:36

eriknordmark mentioned this pull request May 10, 2026

lim.test TestInfo times out matching dinfo.systemAdapter.status.ports.ifname even though EVE publishes valid systemAdapter info #1166

Closed

eriknordmark and others added 4 commits May 15, 2026 11:17

eriknordmark force-pushed the baseosmgr-tests branch from 9edc2fe to 690bd96 Compare May 15, 2026 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/baseosmgr: e2e coverage for force-fallback and retry-update#1164

tests/baseosmgr: e2e coverage for force-fallback and retry-update#1164
eriknordmark wants to merge 4 commits into
lf-edge:masterfrom
eriknordmark:baseosmgr-tests

eriknordmark commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eriknordmark commented May 8, 2026

Summary

What's covered (and what isn't)

Test plan

Implementation notes

Suite structure

Why draft

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant