Skip to content

tests/baseosmgr: e2e coverage for force-fallback and retry-update#1164

Open
eriknordmark wants to merge 4 commits into
lf-edge:masterfrom
eriknordmark:baseosmgr-tests
Open

tests/baseosmgr: e2e coverage for force-fallback and retry-update#1164
eriknordmark wants to merge 4 commits into
lf-edge:masterfrom
eriknordmark:baseosmgr-tests

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

Summary

Adds a new `tests/baseosmgr/` suite with two e2e tests that exercise the two controller-driven knobs in `pkg/pillar/cmd/baseosmgr` that are otherwise reachable only through real partition flips and a real controller round-trip:

Test Path tested Failure / drive mode
`force_fallback` `handleForceFallback` (`forcefallback.go`) Bumps the global config knob `force.fallback.counter` after a successful upgrade and asserts the device reverts to the captured pre-upgrade version.
`retry_update` `handleUpdateRetryCounter` + `isImageInErrorState` (`handlebaseos.go`) Triggers an upgrade, black-holes the controller's IP via in-EVE `iptables` so the post-update test window times out (`BootReasonFallback`), then bumps `RetryUpdateCounter` via `eveimage-update-retry` and asserts the same image succeeds on the second attempt.

These complement the existing `tests/update_eve_image/` suite (which already covers the happy upgrade path + simple revert) and the new `tests/nodeagent/baseos_fallback_*` (#1162, which covers the fallback-on-cloud-disconnect path from nodeagent's perspective).

What's covered (and what isn't)

`force_fallback` exercises `handleForceFallback`'s precondition check
(curr=active, other=unused with non-empty `ShortVersion` — only true after a
prior successful upgrade), the
`/persist/checkpoint/forceFallbackCounter` plumbing, the
`SetOtherPartitionStateUpdating` path, and that nodeagent reacts to the
flip and reboots into the previous image. The version-based assertion
(`cmp` of the post-rollback active version against a captured `old_ver`
file) is unique to this path — only force-fallback can revert
`SwList[0].ShortVersion` to the pre-upgrade value.

`retry_update` exercises the "failed image" branch of
`isImageInErrorState` (curr=active, other=inprogress with a matching
`BaseOsConfig.Activate=true`), the
`SetOtherPartitionStateUpdating` + `saveConfigRetryUpdateCounter` path,
and the second `BootReasonUpdate` cycle that nodeagent issues into the
same partition. The black-hole is removed naturally by the post-fallback
reboot (in-memory iptables rule dies with the previous boot), so the
retry's test window passes against a working controller.

Test plan

  • Both tests run locally against a coverage-instrumented EVE under
    QEMU (`ZedVirtual-4G`). (deferred: eden harness is busy)
  • CI run on lf-edge/eden's harness.

Per-test runtime estimate (laptop, KVM-accelerated):

  • `force_fallback`: ~10–15 min (one full upgrade-then-rollback cycle).
  • `retry_update`: ~12–18 min (one failed-attempt cycle + one retry cycle).

Implementation notes

  • Capture-then-compare for the version assertion: `force_fallback` runs
    `eden info ... PartitionState:active --tail=1` before the upgrade,
    saves the result to `old_ver`, and after the rollback runs the same
    command again and `cmp`s against `old_ver`. This is more robust than
    asserting `! stdout '$short_version'` because the post-fallback Info
    publish-cadence might still echo a stale pre-reboot sample if the
    assertion fires too early; the `cmp` is exact.
  • Idempotency across runs: both tests issue `eveimage-remove` + brief
    wait before `eveimage-update`, so re-running with the same EVE version
    still triggers a fresh update.
  • Config-propagation timing: `timer.config.interval=10` shortens
    controller polling so subsequent config pushes (the
    `force.fallback.counter` bump or the `RetryUpdateCounter` bump from
    `eveimage-update-retry`) propagate within ~10s. `timer.deviceinfo.interval=30`
    ensures post-rollback Info publishes promptly so `lim.test` doesn't
    wait 10 min for the next periodic push.
  • Black-hole mechanism (retry_update): `eden eve ssh -- iptables -A OUTPUT -d -j DROP`,
    identical to the new `tests/nodeagent/baseos_fallback_blackhole.txt` in
    tests/nodeagent: e2e coverage for reboot paths, restart counter, and disk-space maintenance #1162. Device-model gating is the same (`ZedVirtual-4G` / `VBox` only).
  • Suite is not wired into broader workflows — the two tests are
    addressable on their own (`eden.escript.test -test.run TestEdenScripts/force_fallback -testdata tests/baseosmgr/testdata/`).
    Folding them into a broader `smoke` / `eve-upgrade` workflow is a
    follow-up.

Suite structure

```
tests/baseosmgr/
├── Makefile (boilerplate matching tests/nodeagent/)
├── eden-config.yml
├── eden.baseosmgr.tests.txt
└── testdata/
├── force_fallback.txt
└── retry_update.txt
```

Why draft

Has not been run end-to-end yet (eden harness was busy at write time);
will be exercised on `ZedVirtual-4G` / `VBox` before marking ready.

The companion eve-side PRs that the unit-test side of this work is in:

This PR is independent of those (it only touches eden), but the same
review eyes apply to the design.

eriknordmark and others added 4 commits May 15, 2026 11:17
Adds a new tests/baseosmgr/ suite with two scripts that exercise the
two controller-driven knobs in pkg/pillar/cmd/baseosmgr that are
otherwise reachable only through real partition flips and a real
controller.

- force_fallback exercises handleForceFallback end-to-end: trigger
  an EVE image update and wait for the new partition to commit
  (active), capture the pre-upgrade short version up front, then
  bump the global force.fallback.counter knob and assert the device
  reboots back to the captured old version. This is the "switch
  back to the previous image" path: baseosmgr observes the counter
  change via ZedAgentStatus, the precondition (curr=active,
  other=unused with non-empty ShortVersion) holds after a successful
  upgrade, baseosmgr flips the other partition to "updating", and
  nodeagent reboots into it with BootReasonUpdate.

- retry_update exercises handleUpdateRetryCounter +
  isImageInErrorState end-to-end: configure a long upgrade-test
  window and a short fallback timer, trigger an update, black-hole
  the controller's IP via in-EVE iptables once the new partition is
  inprogress so the test window times out and BootReasonFallback
  fires, then bump RetryUpdateCounter via eveimage-update-retry and
  assert the same image succeeds on the second attempt (the iptables
  rule dies with the previous boot's in-memory state, so
  connectivity restores naturally for the retry's test window).
  This covers the "failed image" branch of isImageInErrorState
  (curr=active, other=inprogress with a matching activate=true
  config), the SetOtherPartitionStateUpdating /
  saveConfigRetryUpdateCounter path, and the second BootReasonUpdate
  cycle that nodeagent issues into the same partition.

Both tests configure timer.test.baseimage.update / timer.config.interval
/ timer.deviceinfo.interval up front so the cycle finishes in a few
minutes per phase rather than the default ten. Both tests
eveimage-remove + brief wait before eveimage-update so re-running
with the same EVE version still triggers a fresh install.

retry_update is gated on devmodel == ZedVirtual-4G or VBox because
it relies on `eden eve ssh -- iptables -A OUTPUT -d <adam-ip> -j DROP`
to silently drop controller traffic.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
retry_update commits the new EVE version (12.1.0 by default) onto the
active partition as part of its success assertion. Without a revert,
EVE stays on that version forever, which (a) leaves the device on a
non-coverage build if the test ran in a coverage-instrumented G3
sequence, and (b) breaks any downstream test/suite that assumes EVE
is on the originally-configured `eve.tag`.

Mirror the revert sequence already used by the standalone
update_eve_image flow (eveimage-remove the leftover BaseOsConfig,
download the original rootfs, eveimage-update back, wait for the
active partition to flip, eveimage-remove again, eden eve reset).

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_fallback

force_fallback pushes a BaseOsConfig for 12.1.0 to exercise the
force-fallback path, but only `eden eve reset` (clears device config)
at the end, not `eveimage-remove`. The 12.1.0 BaseOsConfig therefore
lingers in adam after the test, which leaks state into subsequent
tests / suites — most notably retry_update, which begins with its
own eveimage-remove + eveimage-update and depends on adam being in a
known-clean state for the second eveimage-update to actually fire a
fresh install request.

Add a final eveimage-remove of {{ $short_version }} before the
eden eve reset.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two follow-ups for the baseosmgr suite:

1. Raise outer -test.timeout values in eden.baseosmgr.tests.txt to
   match realistic suite runtime on the coverage-instrumented EVE
   build. retry_update issues four sequential lim.test waits at
   -timewait 30m each, plus 6 minutes of exec sleep — the 60-minute
   outer cap killed the suite mid-revert. Bumped to 90m for
   retry_update and 45m for force_fallback. The test-script
   -timewait values are unchanged; only the wrapper cap relaxes.

2. Add a get-config assertion after every eveimage-remove call so the
   test fails loudly if the controller config still references the
   removed image. The current eden CLI EdgeNodeEVEImageRemove only
   removes the legacy baseosconfig list entry and leaves the modern
   single-block baseos field + contentInfo[] populated; this
   assertion surfaces that bug end-to-end. (The corresponding eden
   CLI fix lands in PR lf-edge#1172.)

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant