Skip to content

Commit 463045b

Browse files
eriknordmarkclaude
andcommitted
docs(nodeagent): add architecture document
Adds pkg/pillar/docs/nodeagent.md describing the responsibilities, pubsub inputs and outputs, internal logical components, control flow through the four main paths (boot/onboarding, periodic timer tick, controller-driven device operation, baseos upgrade) and a debugging section. Structured to mirror the existing nim.md so the pillar docs remain consistent across microservices. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 63fa63d commit 463045b

1 file changed

Lines changed: 399 additions & 0 deletions

File tree

pkg/pillar/docs/nodeagent.md

Lines changed: 399 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,399 @@
1+
# Node Agent
2+
3+
## Overview
4+
5+
`nodeagent` is the EVE microservice responsible for the lifecycle of the node
6+
itself — as opposed to the workloads it runs. Its main jobs are:
7+
8+
* drive the **A/B baseos upgrade** process, in cooperation with `baseosmgr` and
9+
`zedagent`, including the post-upgrade *test* window during which the new
10+
image must prove itself by reaching the controller before being marked
11+
`active`,
12+
* **reboot, shut down or power off** the device when asked by the controller
13+
(via `zedagent`) or when an internal health timer expires,
14+
* watch a small set of **health signals** (controller reachability, vault
15+
state, TPM sanity, free disk space, certificate refusal, kubernetes node
16+
drain) and either trigger a reboot or push the device into
17+
*Maintenance Mode*,
18+
* on every boot, **reconstruct and report the previous reboot** (reason, boot
19+
reason, stack/dmesg, image, time) and bump the persistent `restartCounter`,
20+
* surface installer logs from the very first boot of a freshly installed image.
21+
22+
`nodeagent` is intentionally small (a single `nodeagentContext` event loop, no
23+
sub-packages); most of its complexity lives in *when* to do something rather
24+
than *how*. The "how" — flipping partition states, reaching the controller —
25+
is delegated to `baseosmgr`/`zboot` and `zedagent`.
26+
27+
## Key Input/Output
28+
29+
**nodeagent consumes** (all via pubsub unless noted):
30+
31+
* global configuration properties
32+
* `ConfigItemValueMap` from `zedagent`
33+
* supplies the four health-timer thresholds:
34+
`timer.reboot.no.network` (`ResetIfCloudGoneTime`),
35+
`timer.update.fallback.no.network` (`FallbackIfCloudGoneTime`),
36+
`timer.test.baseimage.update` (`MintimeUpdateSuccess`),
37+
`timer.vault.ready.cutoff` (`VaultReadyCutOffTime`)
38+
* zedagent status
39+
* `ZedAgentStatus` from `zedagent`
40+
* carries the controller-driven `RebootCmd` / `ShutdownCmd` / `PoweroffCmd`
41+
requests with a `RequestedRebootReason` and `RequestedBootReason`,
42+
* carries `ConfigGetStatus` (`Success`, `TemporaryFail`, `ReadSaved`,
43+
`Fail`) which is the heartbeat used to drive every "have we lost the
44+
controller?" timer,
45+
* carries `EdgeNodeCertsRefused` to drive the corresponding
46+
Maintenance Mode reason
47+
* zboot status
48+
* `ZbootStatus` per partition from `baseosmgr`
49+
* tells nodeagent when the *other* partition has been flipped to
50+
`updating` (→ schedule a reboot into the new image) and when the
51+
*current* partition has been flipped to `active` (→ upgrade fully
52+
committed)
53+
* domain status
54+
* `DomainStatus` from `domainmgr`
55+
* polled while a reboot/shutdown/poweroff is in flight to wait for all
56+
app domains to be halted
57+
* vault status
58+
* `VaultStatus` from `vaultmgr` — drives `MaintenanceModeReasonVaultLockedUp`
59+
and, if an upgrade is in progress and the vault never opens within
60+
`VaultReadyCutOffTime`, triggers a fallback reboot
61+
(`BootReasonVaultFailure`)
62+
* TPM sanity status
63+
* `TpmSanityStatus` from `tpmmgr` — drives
64+
`MaintenanceModeReasonTpmEncFailure`
65+
* volume manager status
66+
* `VolumeMgrStatus` from `volumemgr` — its `RemainingSpace` field drives
67+
`MaintenanceModeReasonNoDiskSpace`
68+
* node drain status (kubevirt builds only)
69+
* `kubeapi.NodeDrainStatus` from `zedkube`
70+
* keeps `WaitDrainInProgress` set in `NodeAgentStatus` so that `zedagent`
71+
holds back the controller-requested reboot/shutdown/poweroff until the
72+
kube node has finished draining its workloads
73+
* onboarding status
74+
* `OnboardingStatus` (via `wait.WaitForOnboarded`) — nodeagent blocks on
75+
this once before joining the main event loop
76+
* on-disk state (read at start)
77+
* `/persist/reboot-reason`, `/persist/boot-reason`, `/persist/reboot-stack`,
78+
`/persist/reboot-image` (via `agentlog.Get*`), used to reconstruct the
79+
previous reboot,
80+
* `/persist/SMART_details.json` and `/persist/SMART_details_previous.json`
81+
— SMART power-cycle counters from the storage controller; consulted
82+
when no reboot reason was recorded, to distinguish a dirty power-off
83+
(counter incremented) from a kernel panic / watchdog reset
84+
(counter unchanged),
85+
* `/run/global/first-boot` — marker dropped by the installer on the very
86+
first boot; presence sets the boot reason to `BootReasonFirst`,
87+
* `/persist/installer/installer.log` plus
88+
`/persist/installer/send-require` — installer output to be replayed
89+
into the regular log stream,
90+
* `/persist/status/restartcounter` — monotonic restart counter,
91+
* `/persist/fault-injection/readfile` — fault-injection knob.
92+
93+
**nodeagent publishes**:
94+
95+
* `NodeAgentStatus` (consumed by `zedagent`, ultimately surfaced to the
96+
controller)
97+
* current active partition (`IMGA` or `IMGB`),
98+
* `UpdateInprogress` plus the `RemainingTestTime` countdown shown to the
99+
operator during post-upgrade validation,
100+
* `DeviceReboot` / `DeviceShutdown` / `DevicePoweroff` plus
101+
`AllDomainsHalted` (the fine-grained progression of the operation),
102+
* `RebootReason`, `BootReason`, `RebootStack`, `RebootTime`, `RebootImage`
103+
from the previous boot,
104+
* `RestartCounter`,
105+
* `LocalMaintenanceMode` and the multi-reason
106+
`LocalMaintenanceModeReasons`,
107+
* `HVTypeKube`, `WaitDrainInProgress`
108+
* `ZbootConfig` — one entry per partition (`IMGA`, `IMGB`); the only
109+
meaningful field is `TestComplete`, which is flipped to `true` when the
110+
post-upgrade test window expires successfully and `baseosmgr` should
111+
commit the new image. This publication is **persistent**: it is read back
112+
on next boot.
113+
* `RebootReason` / `BootReason` files in `/persist/` — written via
114+
`agentlog.RebootReason()` just before issuing the actual `zboot.Reset()`
115+
or `zboot.Poweroff()` so that the *next* boot of nodeagent can pick them
116+
up.
117+
118+
## Components
119+
120+
Unlike NIM, nodeagent is not split into separately-testable components with
121+
Go interfaces between them. It is a single `nodeagentContext` running one
122+
goroutine for the main event loop. The logical responsibilities, however,
123+
are cleanly partitioned across the source files; that partitioning is the
124+
right place to think about coverage:
125+
126+
### Lifecycle / pubsub wiring (`nodeagent.go`)
127+
128+
`Run()` initializes the agent, creates the publishers and subscribers,
129+
starts a 10-second `tickerTimer` and a 25-second `stillRunning` watchdog,
130+
blocks for `GlobalConfig` and onboarding to be available, and then enters
131+
the main `select` loop. The same file also contains the handlers for the
132+
non-zboot subscriptions: `globalConfig`, `vaultStatus`, `volumeMgrStatus`,
133+
`tpmStatus`, and the `zedAgentStatus` ingress that translates controller
134+
device-ops into local `handleDeviceCmd()` calls.
135+
136+
### Reboot-reason reconstruction (`nodeagent.go`)
137+
138+
`handleLastRebootReason()` is called once at startup. It reads anything the
139+
*previous* boot left behind (`agentlog.GetRebootReason()`,
140+
`GetBootReason()`, `GetRebootImage()`), and if nothing was recorded it
141+
synthesizes a default using two side-channel signals:
142+
143+
* `/run/global/first-boot` (set by the installer) → `BootReasonFirst`,
144+
* the `PowerCycleCount` delta between current and previous SMART snapshots
145+
`BootReasonPowerFail` (count went up: dirty power cycle) versus
146+
`BootReasonKernel` (count unchanged: kernel panic / watchdog with no
147+
kdump),
148+
* fallback: `BootReasonUnknown`.
149+
150+
The reboot stack, if any, is logged line-by-line and (if it is bigger than
151+
~1700 bytes) tail-truncated so that the publication fits in pubsub. This
152+
function is also where `restartCounter` gets read, incremented, and
153+
written back.
154+
155+
### Health timers (`handletimers.go`)
156+
157+
`handleDeviceTimers()` fires every 10 seconds and is the heart of the
158+
agent. It only operates on its own monotonic `timeTickCount` (incremented
159+
by the timer interval), never on wall-clock time, so that NTP jumping the
160+
clock by decades on first boot does not trip every timer at once. It runs
161+
four checks in order:
162+
163+
1. **`handleFallbackOnCloudDisconnect`** — only when an upgrade is being
164+
tested. If the controller has been unreachable for
165+
`FallbackIfCloudGoneTime`, the new image is presumed bad: schedule a
166+
reboot with `BootReasonFallback`, `baseosmgr` will then flip the
167+
partition back.
168+
2. **`handleRebootOnVaultLocked`** — if `vaultmgr` reports the vault as
169+
`DATASEC_AT_REST_ERROR`, wait at most `VaultReadyCutOffTime`. If an
170+
upgrade is in progress when the deadline fires, reboot with
171+
`BootReasonVaultFailure` (the upgrade fails); otherwise enter
172+
Maintenance Mode with `MaintenanceModeReasonVaultLockedUp`.
173+
3. **`handleResetOnCloudDisconnect`** — independently of any upgrade, if
174+
the controller has been unreachable for `ResetIfCloudGoneTime`,
175+
schedule a reboot with `BootReasonDisconnect`. This is the long-tail
176+
"we have lost the cloud, try a clean restart" timer, intended to
177+
recover from odd hardware/driver failures (for example a hung Ethernet
178+
adapter) that a reboot is likely to clear.
179+
4. **`handleUpgradeTestValidation`** — if a post-upgrade test is in flight
180+
(`testInprogress`) and `MintimeUpdateSuccess` has elapsed, declare the
181+
image good: clear the test, set `ZbootConfig.TestComplete = true` so
182+
`baseosmgr` commits the partition.
183+
184+
`updateZedagentCloudConnectStatus()` translates `ConfigGetStatus`
185+
transitions into `lastControllerReachableTime` updates and into
186+
start/restart/clear of the test window.
187+
188+
`handleDeviceCmd()` and `scheduleNodeOperation()` are the entry points for
189+
both controller-driven (`RebootCmd`/`ShutdownCmd`/`PoweroffCmd`) and
190+
internally-driven device operations. They guard against double-trigger,
191+
update `NodeAgentStatus`, and spawn `handleNodeOperation()` in its own
192+
goroutine.
193+
194+
`handleNodeOperation()` waits `minRebootDelay` (30s by default), persists
195+
the reboot reason via `agentlog.RebootReason()`, calls
196+
`waitForAllDomainsHalted()` (poll `DomainStatus` up to
197+
`maxDomainHaltTime`), `syscall.Sync()`, waits another `minRebootDelay`,
198+
flushes coverage data, and finally calls `zboot.Reset()` or
199+
`zboot.Poweroff()`. A 120-second backstop goroutine `os.Exit(0)`s the
200+
process if the zboot call hangs — the underlying `reboot` syscall has
201+
been seen to stall inside the kernel due to kernel bugs, so the backstop
202+
ensures the in-kernel watchdog takes over and restarts the node.
203+
204+
### A/B upgrade orchestration (`handlebaseos.go`, `handlezboot.go`)
205+
206+
`handleZbootStatusImpl()` is the inbound side. When the *current*
207+
partition transitions to `active` while we still thought the upgrade was
208+
in progress, the agent latches the upgrade as fully committed
209+
(`updateInprogress=false`, etc.). It then dispatches:
210+
211+
* `doZbootBaseOsInstallationComplete()` — the *other* partition just
212+
became `updating` (a new image was written): schedule a reboot with
213+
`BootReasonUpdate` and a friendly `NORMAL: baseos-update(...) to EVE
214+
version X reboot` message.
215+
* `doZbootBaseOsTestValidationComplete()` — the *current* partition's
216+
`TestComplete` flag has been observed back from `baseosmgr` after we
217+
set it; clear it on the config side and mark `updateComplete=true`.
218+
219+
`handlezboot.go` contains the small lookup helpers (`lookupZbootConfig`,
220+
`lookupZbootStatus`, `getZbootOtherPartition`,
221+
`isZbootOtherPartitionStateUpdating`, `publishZbootConfig*`). All
222+
real partition operations are delegated to the `pillar/zboot` package
223+
which knows about `IMGA`/`IMGB`/`grubenv`.
224+
225+
### Kube node-drain glue (`handlenodedrain.go`)
226+
227+
Kubevirt builds receive a `kubeapi.NodeDrainStatus` from `zedkube`. As
228+
long as a drain initiated by *device-op* (reboot/shutdown/poweroff) is
229+
between `REQUESTED` and `COMPLETE`, nodeagent flips
230+
`WaitDrainInProgress` so that `zedagent` keeps the deferred device op
231+
deferred. On `COMPLETE`, the flag is cleared and the device op is
232+
allowed to proceed.
233+
234+
### Maintenance Mode
235+
236+
Maintenance Mode is a multi-reason flag (`MaintenanceModeMultiReason`)
237+
maintained via two helpers, `addMaintenanceModeReason()` and
238+
`removeMaintenanceModeReason()`. Each contributing handler (vault, TPM,
239+
disk space, certs-refused) calls these and re-publishes
240+
`NodeAgentStatus`. The mode is only fully cleared when *every* reason
241+
has been removed.
242+
243+
## Control-flow
244+
245+
There are four largely independent control paths through nodeagent.
246+
247+
### 1. Boot and onboarding
248+
249+
```text
250+
Run()
251+
└─ subscribe GlobalConfig
252+
└─ wait for GCInitialized (sets log levels)
253+
└─ parseSMARTData()
254+
└─ handleLastRebootReason() (publishes nothing yet,
255+
└─ handleInstallationLog() updates ctx fields)
256+
└─ create publications, ZbootConfig, NodeAgentStatus
257+
└─ subscribe vault/volume/tpm
258+
└─ publishZbootConfigAll() (one entry per partition)
259+
└─ ctx.updateInprogress = zboot.IsCurrentPartitionStateInProgress()
260+
└─ publishNodeAgentStatus() (first publication)
261+
└─ subscribe DomainStatus
262+
└─ wait.WaitForOnboarded()
263+
└─ setTestStartTime() (no-op unless updateInprogress)
264+
└─ subscribe ZbootStatus, ZedAgentStatus, NodeDrainStatus
265+
└─ event loop
266+
```
267+
268+
### 2. Periodic timer tick (every 10s)
269+
270+
```text
271+
tickerTimer fires
272+
→ updateTickerTime() (advance ctx.timeTickCount)
273+
→ handleFallbackOnCloudDisconnect() (only if updateInprogress)
274+
→ handleRebootOnVaultLocked() (only if vault disabled)
275+
→ handleResetOnCloudDisconnect() (always)
276+
→ handleUpgradeTestValidation() (only if testInprogress)
277+
```
278+
279+
### 3. Controller-driven device operation
280+
281+
```text
282+
zedagent publishes ZedAgentStatus{RebootCmd:true, …}
283+
→ handleZedAgentStatusImpl()
284+
→ handleDeviceCmd(op=Reboot)
285+
→ scheduleNodeOperation(reason, bootReason, op)
286+
→ ctx.deviceReboot = true
287+
→ publishNodeAgentStatus() (zedagent now sees DeviceReboot)
288+
→ go handleNodeOperation(op)
289+
├ wait minRebootDelay
290+
├ agentlog.RebootReason(...) (persists reason for next boot)
291+
├ waitForAllDomainsHalted()
292+
├ ctx.allDomainsHalted = true; publish
293+
├ syscall.Sync(); wait minRebootDelay
294+
├ flushCoverage
295+
└ zboot.Reset() / zboot.Poweroff()
296+
```
297+
298+
The very same `scheduleNodeOperation()` is what the four health timers
299+
call when they decide the device must be reset.
300+
301+
### 4. Baseos upgrade
302+
303+
```text
304+
(a) "other partition is updating" — new image just written
305+
zedagent → ZbootStatus(other = updating)
306+
→ handleZbootStatusImpl()
307+
→ doZbootBaseOsInstallationComplete()
308+
→ scheduleNodeOperation(BootReasonUpdate, Reboot)
309+
310+
(b) post-reboot, current partition still inprogress — test window
311+
Run() sets updateInprogress = true
312+
→ setTestStartTime() once GlobalConfig is in
313+
ConfigGetStatus = Success keeps lastControllerReachableTime fresh
314+
After MintimeUpdateSuccess seconds:
315+
handleUpgradeTestValidation()
316+
→ initiateBaseOsControllerTestComplete()
317+
→ publish ZbootConfig{TestComplete:true} for curPart
318+
319+
(c) baseosmgr acknowledges by flipping curPart to active and clearing
320+
its TestComplete in ZbootStatus:
321+
handleZbootStatusImpl():
322+
if curPart && updateInprogress && state==active:
323+
updateInprogress = false; testComplete = false; updateComplete = false
324+
doZbootBaseOsTestValidationComplete():
325+
republish ZbootConfig with TestComplete=false; updateComplete=true
326+
```
327+
328+
If the test window times out without the controller being reachable,
329+
`handleFallbackOnCloudDisconnect()` instead schedules a fallback
330+
reboot (`BootReasonFallback`). `baseosmgr` then rolls the partition
331+
back on the next boot.
332+
333+
## Debugging
334+
335+
### PubSub
336+
337+
On a running device:
338+
339+
```sh
340+
cat /run/nodeagent/NodeAgentStatus/nodeagent.json | jq
341+
cat /persist/status/nodeagent/ZbootConfig/IMGA.json | jq
342+
cat /persist/status/nodeagent/ZbootConfig/IMGB.json | jq
343+
```
344+
345+
The first shows the agent's view of update/reboot state and the list of
346+
maintenance-mode reasons. The other two show whether nodeagent has
347+
asked `baseosmgr` to commit the new image (`TestComplete`).
348+
349+
Persistent files of interest under `/persist/`:
350+
351+
* `status/restartcounter` — number of restarts of pillar
352+
* `reboot-reason`, `boot-reason`, `reboot-stack`, `reboot-image`
353+
written just before reboot; consumed and discarded on next boot
354+
* `SMART_details.json`, `SMART_details_previous.json` — power-cycle
355+
counter snapshots used by the boot-reason heuristic
356+
* `installer/installer.log`, `installer/send-require` — installer
357+
output to be replayed into the log stream on first post-install boot
358+
359+
### Logs
360+
361+
Useful `grep` patterns:
362+
363+
```text
364+
"Current partition RebootReason" – previous boot's reason as read at startup
365+
"found bootReason" – previous boot's BootReason
366+
"Default RebootReason" – nodeagent had to synthesize one
367+
"Starting upgrade validation for" – post-upgrade test window opening
368+
"inprogress, waiting for" – periodic countdown of remaining test time
369+
"Upgrade Validation Test Complete" – post-upgrade test window expired OK
370+
"Exceeded fallback outage" – BootReasonFallback path firing
371+
"Exceeded outage for controller" – BootReasonDisconnect path firing
372+
"Exceeded time for vault to be ready" – BootReasonVaultFailure path firing
373+
"setting MaintenanceModeReason" – addMaintenanceModeReason()
374+
"clearing MaintenanceModeReason" – removeMaintenanceModeReason()
375+
"No reason to be in maintenance mode" – mode fully cleared
376+
"baseos-update(" – BootReasonUpdate scheduling
377+
"handleNodeOperation: minRebootDelay" – the 30s pre-reboot pause
378+
"waitForAllDomainsHalted" – polling DomainStatus before reboot
379+
"Doing a sync.." – just before zboot.Reset/Poweroff
380+
"nodedrain-step:" – kube node-drain glue
381+
```
382+
383+
### Forcing transitions for development
384+
385+
* Reboot/shutdown/poweroff via the controller is the normal path; on
386+
a dev device it can also be exercised by making `zedagent` publish
387+
`ZedAgentStatus{RebootCmd:true,…}`.
388+
* The fallback / reset timers can be exercised by cutting controller
389+
reachability (`eden eve link down` in eden) for longer than the
390+
configured `timer.update.fallback.no.network` /
391+
`timer.reboot.no.network`.
392+
* The post-upgrade test window can be shortened with
393+
`timer.test.baseimage.update=30` (used by the
394+
`update_eve_image` eden test).
395+
* The fault-injection knob `/persist/fault-injection/readfile` causes
396+
nodeagent to read an arbitrary file at startup. Pointing it at a
397+
large file is the easiest way to drive pillar into an out-of-memory
398+
condition, which then triggers the OOM-killer and a watchdog reboot —
399+
useful for exercising the OOM/watchdog path end-to-end.

0 commit comments

Comments
 (0)