diff --git a/pkg/pillar/docs/baseosmgr.md b/pkg/pillar/docs/baseosmgr.md new file mode 100644 index 00000000000..8ce2e01aebb --- /dev/null +++ b/pkg/pillar/docs/baseosmgr.md @@ -0,0 +1,502 @@ +# Base OS Manager + +## Overview + +`baseosmgr` is the EVE microservice responsible for the **A/B partition state +machine** that backs every base-os upgrade. It owns the two `IMGA`/`IMGB` +partitions: which one is `active`, which one is `inprogress` / `updating` / +`unused`, and the version strings recorded in each. Its main jobs are: + +* take a controller-supplied `BaseOsConfig` (a content tree UUID and a desired + `BaseOsVersion`) and **install the new image into the *other* partition** — + but only after `volumemgr` has finished downloading and verifying it, +* on `Activate=true`, **flip the other partition's GRUB state to `updating`** + so that the next reboot lands on it; the actual reboot is `nodeagent`'s job, +* on `BaseOsConfig` **delete** while an install is still in flight, **cancel + the pending install worker** and, if the other partition has already been + flipped to `updating`, **roll it back to `unused`** so a later reboot does + not pick up the abandoned image, +* on the post-upgrade test pass (signalled by `nodeagent` via + `ZbootConfig{TestComplete:true}` for the current partition), **commit the + upgrade** by calling `zboot.MarkCurrentPartitionStateActive`, +* surface, into `BaseOsStatus`, the **reason a previous upgrade failed** — + copying the previous boot's `RebootReason`/`RebootTime` (received via + `NodeAgentStatus`) onto the BaseOsStatus that owns the now-`inprogress` + other partition, +* respect the **kubevirt node-drain** protocol: defer the partition flip until + `zedkube` reports the node has finished draining, +* implement the **force-fallback** knob: if the controller bumps + `ForceFallbackCounter`, mark the (currently `unused`) other partition as + `updating` so `nodeagent` reboots into the prior image, +* implement the **retry-update counter**: if the controller bumps + `RetryUpdateCounter` while the other partition is in `inprogress` with the + same version that just failed, kick the partition state back to `updating` + so the same image is tried again across a reboot, +* maintain a published mirror of the on-disk GRUB state, **`ZbootStatus`**, + one entry per partition, that every other agent (`nodeagent`, `zedagent`) + reads instead of going through the `zboot` package directly. + +`baseosmgr` itself never reboots the device, never decides *when* to reboot, +and never decides whether the upgrade is "good enough". Those policy +decisions live in `nodeagent`. `baseosmgr` is the *mechanism* underneath: +mostly a wrapper around `pkg/pillar/zboot`, the `volumemgr`/content-tree +flow, and a tiny background `worker` pool that performs the actual `dd` of +the rootfs image into the partition device. + +## Key Input/Output + +**baseosmgr consumes** (all via pubsub unless noted): + +* base os configuration + * `BaseOsConfig` from `zedagent`, keyed on `ContentTreeUUID` + * carries `BaseOsVersion`, `Activate`, `RetryUpdateCounter`. Empty + `ContentTreeUUID` is rejected as a config error. +* content tree status + * `ContentTreeStatus` from `volumemgr` + * the download/load progress for the rootfs blob; baseosmgr will not + even consider installing until `State == LOADED`. +* zboot config (the test-complete signal) + * `ZbootConfig` from `nodeagent`, one entry per partition (`IMGA`, + `IMGB`); only `TestComplete` is meaningful. When it flips to `true` + for the *current* partition while that partition is `inprogress`, + that is `nodeagent`'s "post-upgrade test passed, commit it" message. +* nodeagent status (last reboot reason) + * `NodeAgentStatus` from `nodeagent` + * baseosmgr only consumes `RebootReason` / `RebootTime` / `RebootImage` + from the *previous* boot — used by `updateBaseOsStatusOnReboot` to + surface the failure onto whichever `BaseOsStatus` owns the partition + that we just rolled back from. +* zedagent status (force-fallback knob) + * `ZedAgentStatus` from `zedagent` + * baseosmgr only consumes `ForceFallbackCounter`. A bump (relative to + the persistent file `/persist/checkpoint/forceFallbackCounter`) + triggers `handleForceFallback` to flip the *unused* other partition + to `updating`. +* node-drain status (kubevirt builds) + * `kubeapi.NodeDrainStatus` from `zedkube` + * gates the partition flip when EVE is running kubevirt; while a drain + is in progress, baseosmgr stashes the deferred BaseOs uuid on the + context and re-runs the status update on `COMPLETE`. +* global configuration + * `ConfigItemValueMap` from `zedagent`; only used to set log level — + baseosmgr has no behavior knobs of its own here. +* on-disk state (read at start) + * `/persist/status/current_retry_update_counter` and + `/persist/status/config_retry_update_counter` — last seen value of + `RetryUpdateCounter` from a successful update, and from the most + recent `BaseOsConfig`. Used to detect whether the controller bumped + the counter relative to a known-good state. + * `/persist/checkpoint/forceFallbackCounter` — last seen value of + `ForceFallbackCounter`; baseline against incoming + `ZedAgentStatus.ForceFallbackCounter`. +* on-disk state (read indirectly via `zboot`) + * GRUB env (`grubenv`) on the boot disk: partition states (`active`, + `inprogress`, `unused`, `updating`), short/long version strings. + baseosmgr publishes a pubsub mirror; everything else in pillar reads + that mirror instead of GRUB. +* startup gates (synchronous waits, not subscriptions) + * `wait.WaitForOnboarded` (UUID known) — happens **before** any + publication or subscription is set up, + * `wait.WaitForVault` (vault unlocked) — `volumemgr` downloads EVE-OS + images into `/persist/vault` and is therefore not operational until + the vault is open, + * `containerd.WaitForUserContainerd` (user containerd ready) — needed + because the rootfs image lands as an OCI ref and the install worker + has to be able to read blobs out of it. + +**baseosmgr publishes**: + +* `BaseOsStatus`, keyed on `ContentTreeUUID` (consumed by `zedagent`, + forwarded to the controller) + * `BaseOsVersion`, `Activated`, `TooEarly` (failed because previous was + still in test), `PartitionLabel`/`PartitionState`/`PartitionDevice`, + `State` (the `volumemgr`-style state from `INITIAL` through `INSTALLED`), + and an `ErrorAndTime` block for failures. +* `ZbootStatus` per partition (`IMGA`, `IMGB`), keyed on `PartitionLabel` + (consumed by `nodeagent` and `zedagent`) + * `PartitionState` (the GRUB state), `CurrentPartition` (`true` for the + one we booted from), `PartitionDevname` (e.g. `/dev/sda3`), + `ShortVersion`/`LongVersion` (read out of the partition's + `/etc/eve-release`), `TestComplete` (mirrors what we observed back + from the on-disk env after acting on `ZbootConfig`). +* `BaseOSMgrStatus` with key `"global"` + * just `CurrentRetryUpdateCounter` — the value of `RetryUpdateCounter` + at the time of the last successful update commit. Consumed by + `zedagent` so the controller can tell which retry attempt's outcome + it is looking at. +* `NodeDrainRequest` with key `"global"` (kubevirt builds) + * published when an upgrade has reached the partition-flip step but a + drain is still required; consumed by `zedkube`. +* persistent files written under `/persist/` + * `status/current_retry_update_counter`, `status/config_retry_update_counter` + via `fileutils.WriteRename`, + * `checkpoint/forceFallbackCounter` likewise. +* GRUB-env writes (via `zboot`, *not* via pubsub) + * `SetOtherPartitionStateUpdating` (after a successful install, and on + force-fallback / retry-update), + * `SetOtherPartitionStateUnused` (when version inside the freshly-dd'd + image does not match what we expected), + * `MarkCurrentPartitionStateActive` (commit, on + `ZbootConfig{TestComplete:true}` for the current partition). + * The actual `dd` of the rootfs onto the partition device is + `zboot.WriteToPartition`, called only from the install worker. + +## Components + +`baseosmgr` is one event loop in `Run()` plus a single-purpose background +worker pool. The logical responsibilities are partitioned across the source +files as follows. + +### Lifecycle / pubsub wiring (`baseosmgr.go`) + +`Run()` waits for onboarding, sets up the three publications +(`BaseOsStatus`, `ZbootStatus`, `BaseOSMgrStatus`), reads the persistent +retry counters, calls `updateAndPublishZbootStatusAll` to seed `ZbootStatus` +from `zboot`, then activates all subscriptions. The 25-second `stillRunning` +ticker is the only periodic work — the rest is pure event handling. The +event loop blocks on `subGlobalConfig`, `subBaseOsConfig`, `subZbootConfig`, +`subContentTreeStatus`, `subNodeAgentStatus`, `subZedAgentStatus`, +`subNodeDrainStatus`, `worker.MsgChan()`, and the watchdog ticker. + +The same file contains the pubsub dispatch wrappers +(`handleBaseOsConfigCreate/Modify/Delete`, `handleZbootConfigCreate/Modify/Delete`, +`handleNodeAgentStatusCreate/Modify`, `handleGlobalConfigImpl`) and the trivial +publication helpers `initializeSelfPublishHandles`, `publishBaseOSMgrStatus`. + +### BaseOs config processing (`handlebaseos.go`) + +`baseOsHandleStatusUpdate` → `doBaseOsStatusUpdate` is the heart of the +agent. The decision tree, in order, is: + +1. content-tree errors → propagate to `BaseOsStatus.Error`, +2. version already in `current` partition → mark `INSTALLED`/`Activated`, +3. version already in `other` partition (and `Activate=true`) → mark + `DOWNLOADED`, fall through to overwrite anyway (ContentTree might + have been re-downloaded), +4. EVE-k vs non-EVE-k personality switch → reject with an error, +5. `doBaseOsInstall`: `validatePartition` (refuse if other = `inprogress` + with same version — that's the "previous attempt failed" case), then + `checkBaseOsVolumeStatus` (returns *not done* until ContentTree is + `LOADED`), +6. if `Activate=false` → `doBaseOsInactivate` (currently a no-op that + just notes "flip will happen on reboot"), +7. `validateAndAssignPartition`: refuse if current = `inprogress` or + other = `active` (we're still in the test window of a different + upgrade); otherwise assign `PartitionLabel = otherPartName`, +8. `doBaseOsActivate`: size check (image vs partition), call + `installDownloadedObjects` to schedule the `dd` worker, on worker + completion call `zboot.SetOtherPartitionStateUpdating`, + `checkInstalledVersion` (read `ShortVersion` back out of the freshly + written partition; on mismatch flip back to `unused`). + +The `baseOsHandleStatusUpdateUUID` wrapper is what `volumemgr` and the +worker call back into; it adds the `shouldDeferForNodeDrain` check before +re-entering `baseOsHandleStatusUpdate`. + +### Partition state mirror (`handlebaseos.go`) + +`updateAndPublishZbootStatusAll`, `updateAndPublishZbootStatus`, +`createZbootStatus`, `getZbootStatus`, `publishZbootStatus` form the +pubsub mirror of the GRUB env. `baseOsGetActivationStatus` and +`baseOsSetPartitionInfoInStatus` propagate the published partition state +into individual `BaseOsStatus` entries. `getPartitionState` prefers the +published mirror but falls back to a live `zboot` read. + +### Test-complete / commit (`handlebaseos.go`) + +`handleZbootTestComplete` is the inbound side of nodeagent's +"test-passed" signal. When `ZbootConfig.TestComplete` flips to `true` +for the current partition (and that partition is in `inprogress`), it +calls `zboot.MarkCurrentPartitionStateActive`, mirrors the new state +into `ZbootStatus`, calls `updateAndPublishBaseOsStatusAll` so every +`BaseOsStatus` picks up the new `PartitionState`, then +`maybeRetryInstall` to kick anything that had been `TooEarly`. + +### Reboot-reason surfacing (`handlebaseos.go`) + +`updateBaseOsStatusOnReboot` runs on every `NodeAgentStatus` arrival. +If the *other* partition is `inprogress` (i.e. nodeagent rebooted us +back off a failed upgrade) and a `BaseOsStatus` matches that +partition's `ShortVersion`, `handleOtherPartRebootReason` copies the +previous boot's `RebootReason`/`RebootTime` onto the BaseOsStatus, so +the controller learns *why* the upgrade failed — typically `Fallback` +from a missed test window, or a kernel panic stack from the new image. + +### Retry-update counter (`handlebaseos.go`) + +`handleUpdateRetryCounter` is invoked from +`handleBaseOsConfigCreate/Modify` and from the tail of +`handleZbootTestComplete`. The branches are: + +* current partition not `active` → ignore the counter (we're still + testing something), +* `isImageInErrorState` (other partition is `inprogress` with a + matching `BaseOsConfig.Activate=true`) and counter changed → save + counter, re-arm the retry by calling + `zboot.SetOtherPartitionStateUpdating` so the partition flip + happens again on next reboot, +* otherwise → just save the counter and republish `BaseOSMgrStatus`. + +### Force-fallback (`forcefallback.go`) + +`handleForceFallback` is the inbound from `ZedAgentStatus`. It writes +`/persist/checkpoint/forceFallbackCounter` and, when the counter +changes, checks for the very narrow precondition (current is +`active`, other is `unused`, other has a non-empty `ShortVersion`) +and does `zboot.SetOtherPartitionStateUpdating` plus a `BaseOsStatus` +publish. The precondition is intentionally strict: this is a +"controller-driven roll-back to the previous image", not a generic +partition reshuffler. + +### Content-tree dispatch (`handlevolumemgr.go`) + +`handleContentTreeStatusImpl` looks up which `BaseOsStatus` is using +this content tree and calls `baseOsHandleStatusUpdateUUID` for each. +This is what walks the install state machine forward as +`volumemgr` advances the download. + +### Install worker (`handledownload.go`, `worker.go`) + +`installDownloadedObjects` / `installDownloadedObject` either submit +or pop a result from the worker pool. The worker (`installWorker`) +calls `zboot.WriteToPartition(log, ref, target)` — this is the actual +`dd if= of=/dev/sdX`. The result handler +`processInstallWorkResult` re-enters +`baseOsHandleStatusUpdateUUID`, where the `wres != nil` branch in +`installDownloadedObject` then returns `proceed=true`. + +### Node-drain glue (`handlenodedrain.go`) + +`shouldDeferForNodeDrain` is called from `baseOsHandleStatusUpdateUUID` +just before activation kicks in — i.e. before the install worker writes +the image and before `zboot.SetOtherPartitionStateUpdating` flips the +other partition state. It either returns `false` (non-kube build, drain +not required, or drain `COMPLETE`) or stashes +`ctx.deferredBaseOsID` and returns `true` (drain needed; will be +retried when `NodeDrainStatus.COMPLETE` arrives). +`handleNodeDrainStatusImpl` is the inbound side that picks the +deferred id back up. + +## Control-flow + +There are five largely independent control paths through baseosmgr. + +### 1. Boot and onboarding + +```text +Run() + └─ wait.WaitForOnboarded() (block until we know our UUID) + └─ initializeSelfPublishHandles (BaseOsStatus, ZbootStatus, BaseOSMgrStatus) + └─ readSavedCurrentRetryUpdateCounter + └─ readSavedConfigRetryUpdateCounter + └─ publishBaseOSMgrStatus (first publication) + └─ initializeGlobalConfigHandles (subscriptions activated immediately) + └─ initializeNodeAgentHandles (NodeAgentStatus, ZedAgentStatus, ZbootConfig) + └─ initializeZedagentHandles (BaseOsConfig) + └─ initializeVolumemgrHandles (ContentTreeStatus) + └─ initializeNodeDrainHandles (NodeDrainStatus, NodeDrainRequest) + └─ updateAndPublishZbootStatusAll (seed ZbootStatus from zboot+grubenv) + └─ ctx.worker = worker.NewPool(... installWorker ...) + └─ pubZbootStatus.SignalRestarted() (lets nodeagent know seeding is done) + └─ wait for GCInitialized + └─ wait.WaitForVault() + └─ containerd.WaitForUserContainerd() + └─ event loop +``` + +`pubZbootStatus.SignalRestarted()` is critical: `nodeagent` waits for it +before it derives `updateInprogress` from the partition state. + +### 2. New BaseOsConfig (controller-driven upgrade) + +```text +zedagent → BaseOsConfig{ContentTreeUUID, BaseOsVersion, Activate=true} + → handleBaseOsConfigCreate + → validateBaseOsConfig + → publishBaseOsStatus (empty, with version) + → baseOsHandleStatusUpdate + → baseOsGetActivationStatus + → doBaseOsStatusUpdate + ├─ same version in current? → INSTALLED/Activated, return + ├─ same version in other? → DOWNLOADED, fall through + ├─ EVE-k personality mismatch? → error, return + ├─ doBaseOsInstall + │ ├─ validatePartition (other=inprogress same ver? → fail) + │ └─ checkBaseOsVolumeStatus (waits for ContentTreeStatus.LOADED) + ├─ Activate=false? → doBaseOsInactivate, return + ├─ validateAndAssignPartition (curr=inprogress|other=active? → TooEarly) + └─ doBaseOsActivate + ├─ check partition state ∈ {unused,inprogress,updating} + ├─ check partition size ≥ image size + ├─ installDownloadedObjects + │ └─ AddWorkInstall (worker → zboot.WriteToPartition) + ├─ wait for worker.MsgChan() result + ├─ checkInstalledVersion (read ShortVersion back; mismatch → unused) + └─ zboot.SetOtherPartitionStateUpdating + → handleUpdateRetryCounter +``` + +The corresponding pubsub side-effects out of this path are +`BaseOsStatus` republishes at every step (state, error, partition +fields, `Activated` once the worker finishes) and a `ZbootStatus` +republish when the partition state flips to `updating`. + +### 3. Volume / content-tree progress + +```text +volumemgr → ContentTreeStatus(state advances DOWNLOADING→VERIFIED→LOADED) + → handleContentTreeStatusImpl + → lookupBaseOsStatusesByContentID + → for each: baseOsHandleStatusUpdateUUID + ├─ if content LOADED and we're about to flip: + │ shouldDeferForNodeDrain (kube → stash deferredBaseOsID, return) + └─ baseOsHandleStatusUpdate (re-enter doBaseOsStatusUpdate) +``` + +This is what walks the install state machine forward. While +`MinState < LOADED`, `checkBaseOsVolumeStatus` returns `done=false` +and the install path no-ops; once `LOADED` arrives — and we're +about to actually switch to it — `baseOsHandleStatusUpdateUUID` +is also where the kubevirt drain gate fires, before +`doBaseOsActivate` runs. The same wrapper is re-entered on +worker completion (`processInstallWorkResult`) and on drain +completion (`handleNodeDrainStatusImpl`). + +### 4. Test-complete / commit + +```text +nodeagent → ZbootConfig{IMG?: TestComplete=true} + → handleZbootConfigImpl + → handleZbootTestComplete + ├─ Key() must be currentPart, currentPart must be inprogress + ├─ zboot.MarkCurrentPartitionStateActive + ├─ publishZbootStatus(curr: TestComplete=true) + ├─ updateAndPublishZbootStatusAll (re-read all partition states) + ├─ updateAndPublishBaseOsStatusAll (propagate into every BaseOsStatus) + ├─ maybeRetryInstall (kick anything previously TooEarly) + └─ handleUpdateRetryCounter (sync currentUpdateRetry → save) +``` + +If `MarkCurrentPartitionStateActive` itself fails (rare — +disk-write failure on the boot disk), the BaseOsStatus picks up the +error and the partition is left in `inprogress`; on the next boot +nodeagent will hit the fallback path again. + +### 5. Reboot-after-failed-update + +```text +nodeagent → NodeAgentStatus{RebootReason, RebootImage, RebootTime} + → handleNodeAgentStatusImpl + → ctx.rebootReason / rebootTime / rebootImage = ... + → updateBaseOsStatusOnReboot + ├─ otherPart in inprogress? + ├─ matching BaseOsStatus by partLabel + ShortVersion? + └─ handleOtherPartRebootReason + ├─ if rebootImage == currentPart → no-op (we booted *this* image, no rollback) + └─ status.SetError(rebootReason, rebootTime) // surface the failure +``` + +Validation that the failure happened on the *other* image (rather +than the current one) is what the `rebootImage == curPart` early +return handles. + +### 6. Side channels + +* **Force-fallback** (`forcefallback.go`): bumping + `ZedAgentStatus.ForceFallbackCounter` while curr=active and + other=unused is the controller's "switch back to the previous + image" knob. baseosmgr writes the new counter to + `/persist/checkpoint/forceFallbackCounter` and flips the other + partition to `updating`; nodeagent then reboots us into it. +* **Retry-update** (`handlebaseos.go`): bumping + `BaseOsConfig.RetryUpdateCounter` while curr=active and + other=inprogress with a matching `BaseOsConfig.Activate=true` is + the controller's "try the failed image again" knob. baseosmgr + saves the counter to + `/persist/status/config_retry_update_counter` and flips the + other partition to `updating`. + +## Debugging + +### PubSub + +On a running device: + +```sh +# What the controller asked for +ls /run/zedagent/BaseOsConfig/ +cat /run/zedagent/BaseOsConfig/.json | jq + +# What baseosmgr is doing about it +ls /run/baseosmgr/BaseOsStatus/ +cat /run/baseosmgr/BaseOsStatus/.json | jq + +# The mirror of the GRUB env (consumed by nodeagent + zedagent) +cat /run/baseosmgr/ZbootStatus/IMGA.json | jq +cat /run/baseosmgr/ZbootStatus/IMGB.json | jq + +# nodeagent's commit signal back to baseosmgr +cat /persist/status/nodeagent/ZbootConfig/IMGA.json | jq +cat /persist/status/nodeagent/ZbootConfig/IMGB.json | jq + +# The retry counter snapshot +cat /run/baseosmgr/BaseOSMgrStatus/global.json | jq +``` + +A healthy idle device has `IMGx.PartitionState=active` for the current +partition, `IMGy.PartitionState=unused` for the other, and +`BaseOsStatus.Activated=true` for whichever uuid matches the +`active` partition's `ShortVersion`. During an upgrade the other +partition transitions `unused → updating → inprogress → active`. + +Persistent files of interest under `/persist/`: + +* `status/current_retry_update_counter` — last counter that succeeded +* `status/config_retry_update_counter` — last counter we acknowledged + from `BaseOsConfig` +* `checkpoint/forceFallbackCounter` — last counter we acknowledged + from `ZedAgentStatus` + +### Logs + +Useful `grep` patterns: + +```text +"doBaseOsStatusUpdate" – top-level state-machine entry, prints the BaseOsConfig +"validatePartition" – early reject (other=inprogress same version) +"validateAndAssignPartition" – partition assignment / TooEarly path +"Image size .* greater than partition" – size precheck failure +"installWorker to install" – the actual dd starting +"installWorker DONE install" – the dd finished +"Mark other partition .* unused" – version-mismatch rollback after install +"checkInstalledVersion" – reading version back out of the partition +"handleZbootTestComplete" – commit path entry +"MarkCurrentPartitionStateActive" – commit succeeded +"Handle ForceFallbackCounter update" – force-fallback knob bumped +"ForceFallback from .* to" – force-fallback actually firing +"handleUpdateRetryCounter" – retry-counter machinery +"UpdateRetry from .* to" – retry-counter actually firing +"shouldDeferForNodeDrain" – kubevirt drain gate +"nodedrain-step:" – kubevirt drain glue +"updateBaseOsStatusOnReboot" – surfacing previous-boot error onto BaseOsStatus +``` + +### Forcing transitions for development + +* The normal upgrade path is exercised by + `eden controller edge-node eveimage-update file://.squashfs`, + which makes `zedagent` publish a `BaseOsConfig{Activate:true}`. + See `tests/update_eve_image/testdata/update_eve_image_http.txt`. +* To exercise the *post-test commit* path quickly, set + `timer.test.baseimage.update=30` so nodeagent's test window is 30s + rather than the default 600s. +* To exercise the *fallback* (rollback) path, see the eden tests + under `tests/nodeagent/testdata/baseos_fallback_*.txt`: they cut + controller reachability during the test window so nodeagent + reboots back to the previous image. +* To exercise the *retry-update* path, install a known-bad image + (so the test window times out and the partition lands in + `inprogress`), then bump `RetryUpdateCounter` in `BaseOsConfig`. +* To exercise *force-fallback*, after a successful upgrade so the + other partition is `unused` with a previous version, bump + `ForceFallbackCounter` in the controller.