From 5b6e46682624a2473f240df8c25dfe50ac60d7c5 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Wed, 18 Mar 2026 17:23:06 +0000 Subject: [PATCH 01/14] Add Smart Switch DPU Reliability and Availability HLD Add High Level Design document covering DPU failure scenarios for Smart Switch, including software failures, hardware failures, and NPU/switch level failures. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 622 ++++++++++++++++++ 1 file changed, 622 insertions(+) create mode 100644 doc/smart-switch/pmon/enhance-dpu-robustness.md diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md new file mode 100644 index 00000000000..8950cab5c7a --- /dev/null +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -0,0 +1,622 @@ +# Smart Switch: PMON: Enhance DPU Robustness # + +## Table of Content ## + +- [Revision](#revision) +- [Scope](#scope) +- [Definitions/Abbreviations](#definitionsabbreviations) +- [Overview](#overview) +- [Terminology](#terminology) +- [Critical Processes for DPU Management](#critical-processes-for-dpu-management) +- [Timers and Thresholds](#timers-and-thresholds) +- [DPU Status DB Info](#dpu-status-db-info) +- [DPU Recovery State Machine](#dpu-recovery-state-machine) +- [DPU Software Failures](#dpu-software-failures) + - [Critical Process Restart on DPU](#critical-process-restart-on-dpu) + - [Critical Process Persistently Down on DPU](#critical-process-persistently-down-on-dpu) + - [pmon Crash on NPU](#pmon-crash-on-npu) + - [databasedpu Crash on NPU](#databasedpu-crash-on-npu) +- [DPU Hardware Failures](#dpu-hardware-failures) + - [DPU Hardware Failure (Complete DPU Down)](#dpu-hardware-failure-complete-dpu-down) + - [DPU Power Failure / Unexpected Shutdown](#dpu-power-failure--unexpected-shutdown) + - [PCIe Failure](#pcie-failure) +- [NPU / Switch Level Failures](#npu--switch-level-failures) + - [NPU Kernel Crash / Memory Exhaustion](#npu-kernel-crash--memory-exhaustion) +- [Planned Operations](#planned-operations) + - [DPU Graceful Shutdown](#dpu-graceful-shutdown) + - [DPU Cold Reboot](#dpu-cold-reboot) + - [Full SmartSwitch Reboot](#full-smartswitch-reboot) +- [Scenario DB State Summary](#scenario-db-state-summary) +- [Repository Change Summary](#repository-change-summary) +- [References](#references) + +--- + +## Revision ## + +| Rev | Author | Change Description | +| :---: | :----------------: | -------------------------------------- | +| 0.1 | Vasundhara Volam | Initial Version | + +--- + +## Scope ## + +This document covers the High Level Design for DPU failure scenarios on a SmartSwitch from the PMON (Platform Monitor) perspective — specifically focused on detection, DB state management, and recovery actions performed by `chassisd` and other PMON sub-daemons. + +The scope includes: + +- DPU software failures (critical process crashes and restarts on DPU; pmon and databasedpu crashes on NPU) +- DPU hardware failures (complete DPU down, power failure / unexpected shutdown, PCIe failure) +- NPU/switch-level failures (kernel crash, memory exhaustion) +- DB state tracking for DPU failure detection and recovery (new and existing DB entries) +- DB state tracking for planned operations +- PMON critical process definitions and criticality levels +- Timers and thresholds used by PMON for failure detection and recovery + +--- + +## Definitions/Abbreviations ## + +| Term | Meaning | +| ---- | ------------------------------------------------------- | +| API | Application Programming Interface | +| ASIC | Application-Specific Integrated Circuit | +| CLI | Command-Line Interface | +| DB | Redis Database | +| DPU | Data Processing Unit | +| gNOI | gRPC Network Operations Interface | +| gRPC | Google Remote Procedure Call | +| NPU | Network Processing Unit | +| PCIe | PCI Express (Peripheral Component Interconnect Express) | +| PMON | Platform Monitor | +| RPC | Remote Procedure Call | +| SAI | Switch Abstraction Interface | + +--- + +## Overview ## + +SmartSwitch consists of one NPU (switch ASIC) and multiple DPUs. All front panel ports are connected to the NPU. DPUs are connected to the NPU via PCIe and back-panel ports. + +The PMON (Platform Monitor) daemon on the NPU is responsible for monitoring DPU health and managing DPU lifecycle operations. Its primary sub-daemon, `chassisd`, continuously polls DPU states (midplane, control plane, data plane), detects failures, performs recovery actions (power-cycle, PCIe rescan), and updates database entries to reflect DPU readiness. + +This document enumerates all failure scenarios that can occur on a DPU or its supporting infrastructure from the PMON perspective, describes detection mechanisms driven by `chassisd`, recovery paths, and the corresponding database state changes. It also covers planned operations (graceful shutdown, cold reboot, full SmartSwitch reboot) and the DB state changes introduced to support them. + +--- + +## Terminology ## + +| Term | Explanation | +| ---- | ----------- | +| chassisd | Chassis daemon running inside `pmon` on the NPU; monitors DPU health states, manages DPU power-cycle and reset operations | +| pmon | Platform Monitor daemon on NPU; hosts `chassisd` and other hardware monitoring sub-daemons | +| syncd | Sync daemon; manages SAI API calls to DPU ASIC | +| control plane state | DPU SONiC is booted up, all containers are up, interfaces are up, and DPU is ready to accept configuration. Derived from SYSTEM_READY in STATE_DB. Values: `"unknown"`, `"up"`, `"down"`. | +| midplane link state | The PCIe link between the NPU and DPU is operational. Monitored and updated by NPU pmon `chassisd` via the `is_midplane_reachable` platform API. Values: `"unknown"`, `"up"`, `"down"`. | +| dataplane state | Configuration is downloaded, pipeline stages are up, and DPU hardware (port/ASIC) is ready to take traffic. Values: `"unknown"`, `"up"`, `"down"`. | + +--- + +## Critical Processes for DPU Management ## + +The following processes are critical for SmartSwitch DPU lifecycle management. A failure in any of these impacts the ability to monitor, recover, or manage DPUs. + +**PMON-managed processes (on NPU):** + +| Process | Role | Failure Impact | +| ------- | ---- | -------------- | +| `chassisd` | Monitors DPU health (midplane, control plane, data plane); manages power-cycle, reset, and DB state updates | All DPU failure detection and recovery stops; no DB updates | +| `pcied` | Monitors PCIe link state between NPU and DPUs; updates `PCIE_DETACH_INFO` in STATE_DB | PCIe failures go undetected; `PCIE_DETACH_INFO` not updated | + +**Other critical NPU processes:** + +| Process | Container | Role | Failure Impact | +| ------- | --------- | ---- | -------------- | +| `gnoi_reboot_daemon.py` | `gnmi` | Sends gNOI Reboot RPCs to DPUs for graceful shutdown / reboot | Graceful shutdown and planned reboot operations fail; DPU cannot be halted cleanly before power-cycle | +| `sysmgr` | Host | Routes DPU planned shutdown and reboot requests to host services for execution | Planned DPU reset operations cannot be carried out | + + +--- + +## Timers and Thresholds ## + +All timers and thresholds used by PMON for DPU failure detection and recovery are listed below. Values shown are defaults; some are configurable via `platform.json`. + +| Timer / Threshold | Default Value | Configurable | Used By | Description | +| ----------------- | :-----------: | :----------: | ------- | ----------- | +| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state` | +| DPU auto-recovery timeout | 60 seconds | Yes (`platform.json`) | `chassisd` | Time allowed for a DPU to recover from a critical process restart before escalating. If `dpu_control_plane_state` or `dpu_midplane_state` remains `down` beyond this timeout, `chassisd` initiates a DPU reset, if DPU is still up and running. | +| DPU power-cycle timeout | 180 seconds | Yes | `chassisd` | Time `chassisd` waits for `dpu_control_plane_state` to return to `up` before issuing a power-cycle | +| `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | +| `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | +| `reset_limit` | 5 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | + +> **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. + +--- + +## DPU Status DB Info ## + +### Existing DB entries ### + +The following DB entries track the DPU lifecycle state and are referenced during failure detection and recovery. + +**DPU State in CHASSIS_STATE_DB:** + +``` +DPU_STATE|DPU: +{ + "dpu_control_plane_state": "up" | "down", + "dpu_control_plane_time": "", + "dpu_data_plane_state": "up" | "down", + "dpu_data_plane_time": "", + "dpu_midplane_link_state": "up" | "down", + "dpu_midplane_link_time": "" +} +``` + +**PCIe Detach Info in STATE_DB:** + +``` +PCIE_DETACH_INFO|DPU: +{ + "dpu_id": "", + "dpu_state": "detaching" | "detached" | "reattached", + "bus_info": "[DDDD:]BB:SS.F" +} +``` + +**Graceful Shutdown / Reboot Tracking in STATE_DB:** + +``` +CHASSIS_MODULE_TABLE|DPU: +{ + "oper_status": "Online" | "Offline", + "state_transition_in_progress": "True" | "False", + "transition_start_time": "", + "transition_type": "shutdown" | "reboot" | "none" +} +``` + +> **Note:** The `state_transition_in_progress`, `transition_start_time`, and `transition_type` fields are managed by the graceful-shutdown implementation in [sonic-gnmi](https://github.com/sonic-net/sonic-gnmi) and [sonic-utilities](https://github.com/sonic-net/sonic-utilities). These fields are not managed by sonic-platform-daemons. + +### New DB entries ### + +The following DB entries will now be newly created to track DPU failure states. + +**DPU additional Info in CHASSIS_STATE_DB on NPU** + +``` +DPU_STATE|DPU: +{ + "ready_status": "true" | "false", + "recovery_status": "recoverable" | "unrecoverable", + "reset_count": "", + "last_down_time": "", + "last_ready_time": "" +} +``` + +| Field | Description | Set by | Cleared by | +| ----- | ----------- | ------ | ---------- | +| `ready_status` | Set to `"true"` when the DPU is fully up and ready (midplane, control plane, data plane all up). Set to `"false"` when the DPU goes down or undergoes a reset. | `chassisd` | `chassisd` (set to `"false"` on failure/reset) | +| `recovery_status` | Set to `"recoverable"` on initialization. Set to `"unrecoverable"` when `reset_count` reaches `reset_limit`. | `chassisd` | `chassisd` (reset to `"recoverable"` on planned restart) | +| `reset_count` | Number of unplanned DPU resets. Reset to 0 on `chassisd` reset on NPU (e.g., NPU reboot, `pmon` restart). | `chassisd` | `chassisd` | +| `last_down_time` | UTC timestamp of the last time the DPU went down | `chassisd` | — | +| `last_ready_time` | UTC timestamp of the last time the DPU became ready | `chassisd` | — | + +**DPU Auto-Recovery Feature in CONFIG_DB on NPU** + +``` +FEATURE|dpu-auto-recovery: +{ + "state": "enabled" | "disabled" | "always_disabled", + "auto_restart": "enabled" | "disabled", + "high_mem_alert": "disabled" +} +``` + +| Field | Default | Description | +| ----- | ------- | ----------- | +| `state` | `enabled` | Enable or disable the DPU auto-recovery feature. When `disabled` or `always_disabled`, `chassisd` will not automatically power-cycle DPUs on failure. | +| `auto_restart` | `enabled` | Standard SONiC FEATURE table field — enables `systemd` to restart the feature's associated service if it crashes. | +| `high_mem_alert` | `disabled` | Standard SONiC FEATURE table field — high memory usage alert threshold. | + +> **Note:** `dpu-auto-recovery` is **not** a separate service or container. It is a feature flag entry in CONFIG_DB's `FEATURE` table, read by `chassisd` (running inside the `pmon` container) to determine whether automatic DPU power-cycle recovery is enabled. The `auto_restart` and `high_mem_alert` fields are standard SONiC FEATURE table fields required by the feature infrastructure; they do not govern `chassisd` itself. When `state` is `disabled`, `chassisd` still monitors and updates DPU states in CHASSIS_STATE_DB, but will not initiate automatic power-cycle recovery. Manual intervention is required to recover failed DPUs. + +--- + +## DPU Recovery State Machine ## + +The following diagram shows the state transitions managed by `chassisd` for a single DPU. Each box represents a `chassisd`-observed DPU state; edges show the triggers and actions. + +```mermaid +stateDiagram-v2 + [*] --> Booting : DPU power on + + Booting --> Ready : All states up + + Ready --> SWFailure : Control plane down + SWFailure --> Ready : Self recovers within 60s + SWFailure --> PowerCycle : 180s timeout expires + + Ready --> PowerCycle : HW failure detected + + PowerCycle --> Booting : Power cycle issued + PowerCycle --> Unrecoverable : reset count >= reset limit + + Ready --> PlannedShutdown : CLI module shutdown + Ready --> PlannedReboot : CLI reboot DPU + + PlannedShutdown --> Offline : gNOI HALT then power down + PlannedReboot --> Booting : gNOI HALT then power cycle + + Offline --> Booting : CLI module startup + + Unrecoverable --> Booting : chassisd reset on NPU +``` + +| State | `ready_status` | `recovery_status` | Key DB Indicators | +| ----- | :------------: | :----------------: | ----------------- | +| **Booting** | `false` | `recoverable` | `dpu_control_plane_state: unknown` | +| **Ready** | `true` | `recoverable` | All three states `up` | +| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up` | +| **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | +| **Offline** | `false` | `recoverable` | `oper_status: Offline` | +| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `reset_limit` | + +--- + +## DPU Software Failures ## + +### Critical process restart on DPU ### + +**Description:** +When any process in the `syncd` or `swss` dockers crashes on the DPU, but the container supervisor successfully restarts the process and the DPU recovers on its own within the auto-recovery timeout (default: 60 seconds). No power-cycle is needed. + +**Detection (by PMON):** +- `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. + +**PMON Action:** +- `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. +- `chassisd` waits up to 60 seconds for the DPU to self-recover. +- Once `dpu_control_plane_state` transitions back to `up`, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `dpu_control_plane_state` | `up` | `down` | `up` | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | + +--- + +### Critical process persistently down on DPU ### + +**Description:** +When any critical process in `syncd`, `swss`, `pmon`, or `database` crashes on the DPU and **remains down beyond the auto-recovery timeout** (i.e., the container supervisor cannot successfully restart it, or the process keeps crash-looping). Unlike a transient restart, this scenario indicates a persistent failure that requires a DPU power-cycle to recover. + +**Detection (by PMON):** +- `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. +- State remains `down` beyond the 60-second auto-recovery timeout. + +**PMON Action:** +- `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. +- After the power-cycle timeout (default: 180 seconds, measured from the time failure is first detected — i.e., total wait is 180 seconds, which includes the initial 60-second auto-recovery window) elapses without recovery, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. +- Once `dpu_control_plane_state` transitions back to `up`, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. +- If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `dpu_control_plane_state` | `up` | `down` | `up` | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | +| `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `reset_limit`) | + +--- + +### pmon crash on NPU ### + +**Description:** +The `pmon` (Platform Monitor) daemon on the NPU crashes. This is a **critical** PMON failure — `chassisd` and all other PMON sub-daemons stop, halting all DPU health monitoring. + +**Detection (by PMON):** +- Not self-detectable. `systemd` detects the `pmon` container is down and restarts it. +- DPU health state updates to `CHASSIS_STATE_DB` stop during the outage. + +**PMON Action:** +- On `chassisd` bringup sequence after restart, `chassisd` sets `ready_status` to `false` and updates `last_down_time` for **all** DPUs. +- `chassisd` re-polls all DPU states and updates `CHASSIS_STATE_DB` with current values. +- For each DPU found healthy, `chassisd` sets `ready_status` back to `true` and updates `last_ready_time`. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `ready_status` (all DPUs) | `true` | stale | `false` → `true` (per DPU) | +| `last_down_time` (all DPUs) | — | — | `` | +| `last_ready_time` (all DPUs) | — | — | `` (per DPU) | + +> **Note:** If only `chassisd` crashes within the `pmon` container (while `pmon` itself stays running), `supervisord` inside `pmon` restarts `chassisd` automatically. The recovery behavior is identical to the full `pmon` crash case described above — `chassisd` re-initializes all DPU states on startup. + +--- + +### databasedpu crash on NPU ### + +**Description:** +The `databasedpu` (per-DPU Redis database instance) on the NPU crashes. Each DPU has a dedicated Redis instance on the NPU (port 6381 + DPU ID, bound to midplane bridge IP 169.254.200.254). + +**Detection (by PMON):** +- `chassisd` cannot read DPU state from the corresponding Redis instance. + +**PMON Action:** +- `chassisd` detects loss of DPU state, sets `ready_status` to `false`, and updates `last_down_time`. +- After `systemd` restarts the Redis instance and DPU reconnects, `chassisd` polls DPU state, sets `ready_status` back to `true`, and updates `last_ready_time` once all states are verified. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | + +--- + +## DPU Hardware Failures ## + +### DPU Hardware Failure (Complete DPU Down) ### + +**Description:** +A DPU completely fails due to hardware fault, thermal event, or unrecoverable error. The DPU is no longer responsive on the midplane or back-panel ports. + +**Detection (by PMON):** +- NPU: Oper state of the DPU `CHASSIS_MODULE_TABLE|DPU|oper_status` is set to `offline`. + +**PMON Action:** +- `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. +- `chassisd` power-cycles the DPU **immediately** (no 180-second timeout — the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. +- After power-cycle, DPU goes through full boot sequence: midplane attach → PCIe rescan → SONiC boot → container startup. +- `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. +- If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `oper_status` | `Online` | `Offline` | `Online` | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | +| `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `reset_limit`) | + +--- + +### DPU Power Failure / Unexpected Shutdown ### + +**Description:** +The DPU loses power unexpectedly or shuts down without graceful notification (e.g., voltage regulator failure, firmware crash). + +**Detection (by PMON):** +- NPU `pmon` detects midplane ping failure → `dpu_midplane_link_state` set to `down`. +- `dpu_control_plane_state` transitions to `down`. + +**PMON Action:** +- `chassisd` sets `ready_status` to `false` and updates `last_down_time`. +- `chassisd` power-cycles the DPU **immediately** (no 180-second timeout — midplane and control plane are already confirmed down) and increments `reset_count`. +- After power-cycle, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `dpu_midplane_link_state` | `up` | `down` | `up` | +| `dpu_control_plane_state` | `up` | `down` | `up` | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | +| `reset_count` | N | N | N+1 | + +--- + +### PCIe Failure ### + +**Description:** +The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable from the NPU. The DPU may still be running internally but is disconnected from the NPU. + +**Detection (by PMON):** +- `pcied` detects PCIe link down and updates `PCIE_DETACH_INFO|DPU` in STATE_DB with `dpu_state: detached`. +- Independently, `chassisd` detects midplane loss via `is_midplane_reachable()` polling and updates `dpu_midplane_link_state` → `down` in CHASSIS_STATE_DB. + +**PMON Action:** +- `chassisd` sets `ready_status` to `false` and updates `last_down_time`. +- `chassisd` power-cycles the DPU **immediately** (no 180-second timeout — midplane link loss and PCIe detach confirm the DPU is unreachable) and increments `reset_count`. +- After power-cycle, PCIe rescan is performed: + - Platform vendor API: `pci_reattach()` (provided by `sonic_platform`). +- `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. + +**DB State Transition:** + +| DB Field | Before | During Failure | After Recovery | +| -------- | :----: | :------------: | :------------: | +| `dpu_midplane_link_state` | `up` | `down` | `up` | +| `PCIE_DETACH_INFO` `dpu_state` | `reattached` | `detached` | `reattached` | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | +| `reset_count` | N | N | N+1 | + +--- + +## NPU / Switch Level Failures ## + +### NPU Kernel Crash / Memory Exhaustion ### + +**Description:** +The entire switch (NPU + all DPUs) goes down due to kernel panic or memory exhaustion. All DPUs on the switch are impacted simultaneously. + +**Detection (by PMON):** +- On NPU recovery, `chassisd` reads the reboot cause from `/host/reboot-cause/reboot-cause.txt`. If the reboot cause indicates a kernel crash or memory exhaustion (e.g., `Kernel Panic`), `chassisd` treats all DPU states as potentially stale and triggers re-initialization. + +**PMON Action:** +- On recovery, `chassisd` initializes all DPU states as `down`, sets `ready_status` to `false`, and updates `last_down_time` for all DPUs. +- `chassisd` re-establishes midplane connectivity and polls each DPU's state. +- If a DPU is still running and healthy (midplane, control plane, data plane all `up`), `chassisd` sets `ready_status` back to `true` and updates `last_ready_time`. +- If a DPU is unresponsive or in a bad state, `chassisd` sends gNOI Reboot RPC to reset it. Each such DPU then goes through: midplane attach → PCIe rescan → SONiC boot → container startup. + +**DB State Transition:** + +| DB Field | Before Crash | On NPU Recovery | After DPU Recovery | +| -------- | :----------: | :-------------: | :----------------: | +| `ready_status` (all DPUs) | `true` | `false` | `true` (per DPU) | +| `last_down_time` (all DPUs) | — | `` | — | +| `last_ready_time` (all DPUs) | — | — | `` (per DPU) | + +--- + +## Planned Operations ## + +### DPU Graceful Shutdown ### + +**Description:** +Orderly shutdown of a DPU via CLI command: `config chassis module shutdown DPU`. + +**PMON Sequence:** +1. `chassisd` calls `set_admin_state(down)` → `module_base.py` triggers `graceful_shutdown_handler()`. +2. `CHASSIS_MODULE_TABLE` in STATE_DB updated: + - `state_transition_in_progress`: `True` + - `transition_start_time`: `` + - `transition_type`: `shutdown` +3. `chassisd` updates CHASSIS_STATE_DB: + - `DPU_STATE|DPU`: `ready_status`: `false`, `last_down_time`: `` +4. `gnoi_reboot_daemon.py` detects the transition and sends gNOI Reboot RPC (Method: `HALT`) to DPU. +5. DPU gracefully shuts down all services via `reboot -p`. +6. NPU polls `gnoi_client -rpc RebootStatus` until `active=false` (services terminated). +7. `state_transition_in_progress` set to `False`. +8. `module_base.py` calls platform API `power_down()` to power off DPU. +9. PCIe detach: platform vendor API `pci_detach()`. +10. Sensor ignore configs added, sensord restarted. + +**DB State Transition:** + +| DB Field | Before | After Shutdown | +| -------- | :----: | :------------: | +| `ready_status` | `true` | `false` | +| `last_down_time` | — | `` | +| `oper_status` | `Online` | `Offline` | +| `state_transition_in_progress` | `False` | `True` → `False` | + +**Race Condition Handling:** +- If module shutdown is requested during a DPU reboot: operation fails; retry after reboot completes. +- If switch reboot is requested during module shutdown: graceful shutdown completes; switch reboot proceeds. +- Concurrent startup/shutdown on the same module: fails; user retries later. +- If `config chassis module shutdown` is issued while `chassisd` is in the middle of an auto-recovery power-cycle for the same DPU: `chassisd` detects the admin-down request, aborts the auto-recovery loop, and proceeds with the graceful shutdown sequence. +- If `pcied` detects a PCIe failure and updates `PCIE_DETACH_INFO` at the same time `chassisd` initiates a power-cycle due to midplane loss: `chassisd` holds a per-DPU lock during the power-cycle sequence. `pcied` updates `PCIE_DETACH_INFO` independently (no lock contention). `chassisd` reads `PCIE_DETACH_INFO` during its power-cycle flow and performs PCIe rescan if `dpu_state` is `detached`. No conflicting actions occur because `pcied` is read-only from `chassisd`'s perspective — it only updates state, while `chassisd` acts on it. + +--- + +### DPU Cold Reboot ### + +**Description:** +Reboot a DPU with full power-cycle via CLI: `reboot -d `. + +**PMON Sequence:** +1. NPU sends gNOI Reboot RPC (Method: `HALT`) to DPU. +2. NPU polls gNOI `RebootStatus` until `active=false` and `Status=STATUS_SUCCESS`. +3. Timeout: `dpu_halt_services_timeout` (Read from `platform.json`, default 60 seconds). +4. PCIe detach: platform vendor API `pci_detach()`. +5. Platform vendor reboot API invoked (DPU cold boot / power-cycle). +6. PCIe reattach: platform vendor API `pci_reattach()`. +7. DPU boots, services start, reports `dpu_control_plane_state=up`. +8. `chassisd` verifies all DPU states and sets `ready_status` to `true`. + +**DB State Transition:** + +| DB Field | Before | During Reboot | After Recovery | +| -------- | :----: | :-----------: | :------------: | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | +| `PCIE_DETACH_INFO` `dpu_state` | `reattached` | `detaching` → `detached` | `reattached` | + +**Error handling:** +- If gNOI service is unreachable: detach PCIe and proceed after timeout. +- If PCIe reattach fails: error handling + restoration mechanism triggered. +- If DPU stuck: hardware watchdog triggers reset (vendor-specific). + +--- + +### Full SmartSwitch Reboot ### + +**Description:** +Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All DPUs are gracefully shut down in parallel before the NPU reboots. + +**PMON Sequence:** +1. NPU sends gNOI Reboot RPC (Method: `HALT`) to **all** DPUs in parallel (multiple threads). +2. NPU polls gNOI `RebootStatus` for each DPU until `active=false` and `Status=STATUS_SUCCESS`. +3. Timeout per DPU: `dpu_halt_services_timeout` (default from `platform.json`, typically 60 seconds). +4. For each DPU: PCIe detach via platform vendor API `pci_detach()`. +5. NPU proceeds with its own reboot sequence. +6. On NPU boot, PCIe enumeration discovers all DPUs. +7. `chassisd` power-cycles each DPU and performs PCIe reattach. +8. Each DPU boots: midplane attach → SONiC boot → container startup → reports `dpu_control_plane_state=up`. + +**DB State Transition:** + +| DB Field | Before | During Reboot | After Recovery | +| -------- | :----: | :-----------: | :------------: | +| `ready_status` (all DPUs) | `true` | `false` | `true` (per DPU) | +| `last_down_time` (all DPUs) | — | `` | — | +| `last_ready_time` (all DPUs) | — | — | `` (per DPU) | +| `PCIE_DETACH_INFO` `dpu_state` (per DPU) | `reattached` | `detaching` → `detached` | `reattached` | + +**Error handling:** +- If a DPU does not respond to gNOI Reboot RPC within the timeout: NPU proceeds with PCIe detach and continues the reboot. The unresponsive DPU is cold-booted on NPU recovery. +- If a DPU fails to come back after the full switch reboot: `chassisd` retries power-cycle up to `reset_limit` (tracked via `reset_count`). If still unresponsive, `chassisd` sets `recovery_status` to `"unrecoverable"`. +- If the NPU reboot is initiated while a DPU graceful shutdown is in progress: the graceful shutdown completes first, then the NPU reboot proceeds. + +--- + +## Scenario DB State Summary ## + +| DPU Scenario | `dpu_control_plane_state` | `dpu_midplane_link_state` | `ready_status` | PMON Action | +| ------------ | :-----------------------: | :-----------------------: | :-----------: | ----------- | +| DPU booting – initial state | unknown | unknown | false | `chassisd` polls; waiting for DPU to come up | +| DPU healthy and running – first boot | up | up | true | Set `ready_status=true` after verifying all states | +| DPU crash / unplanned reboot | down | down | false | Power-cycle DPU; increment `reset_count` | +| DPU up after crash | up | up | true | Set `ready_status=true` after verifying all states | +| DPU stuck (lost connectivity) | down | down | false | Power-cycle DPU; increment `reset_count` | +| DPU up after losing connectivity / reboot | up | up | true | Set `ready_status=true` after verifying all states | +| DPU control plane restart – critical services | down → up | up | false → true | Wait for auto-recovery; set `ready_status=true` on recovery | +| NPU/DPU OS upgrade | down → up | up | false → true | Re-poll DPU states on NPU recovery | +| DPU dead – power cycle | down | down | false | Power-cycle DPU; increment `reset_count` | +| DPU dead – unrecoverable | down | down | false | `reset_count` reached `reset_limit`; `recovery_status` set to `"unrecoverable"`; raise alert | +| Full SmartSwitch reboot (planned) | down → up | down → up | false → true | gNOI halt; power-cycle; re-verify | + +--- + +## Repository Change Summary ## + +| Repository | Component | Changes | +| ---------- | --------- | ------- | +| [sonic-platform-daemons](https://github.com/sonic-net/sonic-platform-daemons) | `chassisd` | DPU failure detection, automated power-cycle recovery, new CHASSIS_STATE_DB fields (`ready_status`, `recovery_status`, `reset_count`, `last_down_time`, `last_ready_time`) | +| [sonic-buildimage](https://github.com/sonic-net/sonic-buildimage) | PMON container | Configuration updates for new `chassisd` failure recovery features | + +--- + +## References ## + +- [Smart Switch PMON](../pmon/smartswitch-pmon.md) +- [Smart Switch Graceful Shutdown](../graceful-shutdown/graceful-shutdown.md) +- [Smart Switch Reboot HLD](../reboot/reboot-hld.md) +- [Smart Switch Database Architecture](../smart-switch-database-architecture/smart-switch-database-design.md) +- [Smart Switch IP Address Assignment](../ip-address-assigment/smart-switch-ip-address-assignment.md) +- [Smart Switch DPU Upgrade HLD](../upgrade/dpu-upgrade-hld.md) From 67f434677aad1ef5e2c7db46b6484f22a4cf3f1e Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Mon, 27 Apr 2026 21:42:19 +0000 Subject: [PATCH 02/14] [pmon]: Fix DPU state values - replace unknown with down DPU control plane, midplane, and data plane states are always 'down' during booting, never 'unknown'. Update terminology, state machine table, and scenario summary accordingly. Signed-off-by: Vasundhara Volam --- doc/smart-switch/pmon/enhance-dpu-robustness.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 8950cab5c7a..1661e655cdf 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -92,9 +92,9 @@ This document enumerates all failure scenarios that can occur on a DPU or its su | chassisd | Chassis daemon running inside `pmon` on the NPU; monitors DPU health states, manages DPU power-cycle and reset operations | | pmon | Platform Monitor daemon on NPU; hosts `chassisd` and other hardware monitoring sub-daemons | | syncd | Sync daemon; manages SAI API calls to DPU ASIC | -| control plane state | DPU SONiC is booted up, all containers are up, interfaces are up, and DPU is ready to accept configuration. Derived from SYSTEM_READY in STATE_DB. Values: `"unknown"`, `"up"`, `"down"`. | -| midplane link state | The PCIe link between the NPU and DPU is operational. Monitored and updated by NPU pmon `chassisd` via the `is_midplane_reachable` platform API. Values: `"unknown"`, `"up"`, `"down"`. | -| dataplane state | Configuration is downloaded, pipeline stages are up, and DPU hardware (port/ASIC) is ready to take traffic. Values: `"unknown"`, `"up"`, `"down"`. | +| control plane state | DPU SONiC is booted up, all containers are up, interfaces are up, and DPU is ready to accept configuration. Derived from SYSTEM_READY in STATE_DB. Values: `"up"`, `"down"`. | +| midplane link state | The PCIe link between the NPU and DPU is operational. Monitored and updated by NPU pmon `chassisd` via the `is_midplane_reachable` platform API. Values: `"up"`, `"down"`. | +| dataplane state | Configuration is downloaded, pipeline stages are up, and DPU hardware (port/ASIC) is ready to take traffic. Values: `"up"`, `"down"`. | --- @@ -259,7 +259,7 @@ stateDiagram-v2 | State | `ready_status` | `recovery_status` | Key DB Indicators | | ----- | :------------: | :----------------: | ----------------- | -| **Booting** | `false` | `recoverable` | `dpu_control_plane_state: unknown` | +| **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down` | | **Ready** | `true` | `recoverable` | All three states `up` | | **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up` | | **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | @@ -589,7 +589,7 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All | DPU Scenario | `dpu_control_plane_state` | `dpu_midplane_link_state` | `ready_status` | PMON Action | | ------------ | :-----------------------: | :-----------------------: | :-----------: | ----------- | -| DPU booting – initial state | unknown | unknown | false | `chassisd` polls; waiting for DPU to come up | +| DPU booting – initial state | down | down | false | `chassisd` polls; waiting for DPU to come up | | DPU healthy and running – first boot | up | up | true | Set `ready_status=true` after verifying all states | | DPU crash / unplanned reboot | down | down | false | Power-cycle DPU; increment `reset_count` | | DPU up after crash | up | up | true | Set `ready_status=true` after verifying all states | From 174a2f7ce583a9ca8b69de89e814fa44fd8d68b7 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Thu, 30 Apr 2026 00:18:06 +0000 Subject: [PATCH 03/14] [doc][smart-switch][pmon]: Clarify DPU recovery semantics and auto-recovery gating - Replace the ambiguous two-timer model (60s auto-recovery + 180s power-cycle) with a single, clearly-named dpu_auto_recovery_timeout (60s). Update the timer table, state machine edge labels, and all DPU software/hardware failure scenarios to use the consistent name. - Rename 'Critical process' subsections to 'Process' for accuracy; update TOC anchors and Scope wording accordingly. - Add ManualIntervention state to the DPU recovery state machine and gate SWFailure/HW-failure transitions on the auto-recovery feature flag. Add a global note plus per-scenario 'When auto-recovery is disabled' bullets so the FEATURE|dpu-auto-recovery=disabled behavior is consistent across every failure scenario. - Rework NPU Kernel Crash recovery: chassisd unconditionally power-cycles every admin-up DPU via the platform vendor path (power_down/pci_detach/power_up/pci_reattach) instead of using gNOI Reboot RPC against potentially unresponsive DPUs. Admin-down DPUs are left offline. Add reset_count row to the DB transition table with a note about chassisd-restart zeroing. - Fix 'Table of Content' typo and add Existing/New DB entries sub-entries to the table of contents. - Replace literal pipe inside backticks in the state table cell with HTML entity so the markdown table renders correctly on GitHub. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 57 ++++++++++++------- 1 file changed, 36 insertions(+), 21 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 1661e655cdf..55caf2615fb 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -1,6 +1,6 @@ # Smart Switch: PMON: Enhance DPU Robustness # -## Table of Content ## +## Table of Contents ## - [Revision](#revision) - [Scope](#scope) @@ -10,10 +10,12 @@ - [Critical Processes for DPU Management](#critical-processes-for-dpu-management) - [Timers and Thresholds](#timers-and-thresholds) - [DPU Status DB Info](#dpu-status-db-info) + - [Existing DB entries](#existing-db-entries) + - [New DB entries](#new-db-entries) - [DPU Recovery State Machine](#dpu-recovery-state-machine) - [DPU Software Failures](#dpu-software-failures) - - [Critical Process Restart on DPU](#critical-process-restart-on-dpu) - - [Critical Process Persistently Down on DPU](#critical-process-persistently-down-on-dpu) + - [Process Restart on DPU](#process-restart-on-dpu) + - [Process Persistently Down on DPU](#process-persistently-down-on-dpu) - [pmon Crash on NPU](#pmon-crash-on-npu) - [databasedpu Crash on NPU](#databasedpu-crash-on-npu) - [DPU Hardware Failures](#dpu-hardware-failures) @@ -46,7 +48,7 @@ This document covers the High Level Design for DPU failure scenarios on a SmartS The scope includes: -- DPU software failures (critical process crashes and restarts on DPU; pmon and databasedpu crashes on NPU) +- DPU software failures (process crashes and restarts on DPU; pmon and databasedpu crashes on NPU) - DPU hardware failures (complete DPU down, power failure / unexpected shutdown, PCIe failure) - NPU/switch-level failures (kernel crash, memory exhaustion) - DB state tracking for DPU failure detection and recovery (new and existing DB entries) @@ -126,8 +128,7 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar | Timer / Threshold | Default Value | Configurable | Used By | Description | | ----------------- | :-----------: | :----------: | ------- | ----------- | | `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state` | -| DPU auto-recovery timeout | 60 seconds | Yes (`platform.json`) | `chassisd` | Time allowed for a DPU to recover from a critical process restart before escalating. If `dpu_control_plane_state` or `dpu_midplane_state` remains `down` beyond this timeout, `chassisd` initiates a DPU reset, if DPU is still up and running. | -| DPU power-cycle timeout | 180 seconds | Yes | `chassisd` | Time `chassisd` waits for `dpu_control_plane_state` to return to `up` before issuing a power-cycle | +| `dpu_auto_recovery_timeout` | 60 seconds | Yes (`platform.json`) | `chassisd` | Self-heal grace period after a software failure is first detected. `chassisd` waits this long for the DPU to recover on its own (e.g., container supervisor restarts the failing process). If `dpu_control_plane_state` or `dpu_midplane_link_state` is still `down` when this timer expires, `chassisd` issues a DPU power-cycle. | | `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | | `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | | `reset_limit` | 5 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | @@ -238,14 +239,18 @@ stateDiagram-v2 Booting --> Ready : All states up Ready --> SWFailure : Control plane down - SWFailure --> Ready : Self recovers within 60s - SWFailure --> PowerCycle : 180s timeout expires + SWFailure --> Ready : Self recovers before dpu_auto_recovery_timeout (60s) + SWFailure --> PowerCycle : dpu_auto_recovery_timeout (60s) expires [auto-recovery enabled] + SWFailure --> ManualIntervention : dpu_auto_recovery_timeout (60s) expires [auto-recovery disabled] - Ready --> PowerCycle : HW failure detected + Ready --> PowerCycle : HW failure detected [auto-recovery enabled] + Ready --> ManualIntervention : HW failure detected [auto-recovery disabled] PowerCycle --> Booting : Power cycle issued PowerCycle --> Unrecoverable : reset count >= reset limit + ManualIntervention --> Booting : Operator power-cycle / module startup + Ready --> PlannedShutdown : CLI module shutdown Ready --> PlannedReboot : CLI reboot DPU @@ -263,6 +268,7 @@ stateDiagram-v2 | **Ready** | `true` | `recoverable` | All three states `up` | | **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up` | | **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | +| **ManualIntervention** | `false` | `recoverable` | DPU down; `FEATURE|dpu-auto-recovery` `state` is `disabled` / `always_disabled`; no power-cycle issued; awaits operator action | | **Offline** | `false` | `recoverable` | `oper_status: Offline` | | **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `reset_limit` | @@ -270,17 +276,17 @@ stateDiagram-v2 ## DPU Software Failures ## -### Critical process restart on DPU ### +### Process restart on DPU ### **Description:** -When any process in the `syncd` or `swss` dockers crashes on the DPU, but the container supervisor successfully restarts the process and the DPU recovers on its own within the auto-recovery timeout (default: 60 seconds). No power-cycle is needed. +When any process crashes on the DPU, but the container supervisor successfully restarts the process and the DPU recovers on its own within `dpu_auto_recovery_timeout` (default: 60 seconds). No power-cycle is needed. **Detection (by PMON):** - `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- `chassisd` waits up to 60 seconds for the DPU to self-recover. +- `chassisd` waits up to `dpu_auto_recovery_timeout` (60 seconds) for the DPU to self-recover. - Once `dpu_control_plane_state` transitions back to `up`, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. **DB State Transition:** @@ -294,20 +300,21 @@ When any process in the `syncd` or `swss` dockers crashes on the DPU, but the co --- -### Critical process persistently down on DPU ### +### Process persistently down on DPU ### **Description:** -When any critical process in `syncd`, `swss`, `pmon`, or `database` crashes on the DPU and **remains down beyond the auto-recovery timeout** (i.e., the container supervisor cannot successfully restart it, or the process keeps crash-looping). Unlike a transient restart, this scenario indicates a persistent failure that requires a DPU power-cycle to recover. +When any process crashes on the DPU and **remains down beyond `dpu_auto_recovery_timeout`** (i.e., the container supervisor cannot successfully restart it, or the process keeps crash-looping). Unlike a transient restart, this scenario indicates a persistent failure that requires a DPU power-cycle to recover. **Detection (by PMON):** - `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. -- State remains `down` beyond the 60-second auto-recovery timeout. +- State remains `down` beyond `dpu_auto_recovery_timeout` (default: 60 seconds). **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- After the power-cycle timeout (default: 180 seconds, measured from the time failure is first detected — i.e., total wait is 180 seconds, which includes the initial 60-second auto-recovery window) elapses without recovery, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. +- When `dpu_auto_recovery_timeout` (default: 60 seconds, measured from the time failure is first detected) expires without recovery, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. - Once `dpu_control_plane_state` transitions back to `up`, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. +- **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must reset the DPU manually. **DB State Transition:** @@ -382,10 +389,11 @@ A DPU completely fails due to hardware fault, thermal event, or unrecoverable er **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- `chassisd` power-cycles the DPU **immediately** (no 180-second timeout — the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. +- `chassisd` power-cycles the DPU **immediately** (skipping `dpu_auto_recovery_timeout` — the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. - After power-cycle, DPU goes through full boot sequence: midplane attach → PCIe rescan → SONiC boot → container startup. - `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. +- **When auto-recovery is disabled:** `chassisd` skips the immediate power-cycle. The DPU remains in **ManualIntervention** with `oper_status: Offline` and `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -411,8 +419,9 @@ The DPU loses power unexpectedly or shuts down without graceful notification (e. **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- `chassisd` power-cycles the DPU **immediately** (no 180-second timeout — midplane and control plane are already confirmed down) and increments `reset_count`. +- `chassisd` power-cycles the DPU **immediately** (skipping `dpu_auto_recovery_timeout` — midplane and control plane are already confirmed down) and increments `reset_count`. - After power-cycle, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. +- **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -438,10 +447,11 @@ The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable f **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- `chassisd` power-cycles the DPU **immediately** (no 180-second timeout — midplane link loss and PCIe detach confirm the DPU is unreachable) and increments `reset_count`. +- `chassisd` power-cycles the DPU **immediately** (skipping `dpu_auto_recovery_timeout` — midplane link loss and PCIe detach confirm the DPU is unreachable) and increments `reset_count`. - After power-cycle, PCIe rescan is performed: - Platform vendor API: `pci_reattach()` (provided by `sonic_platform`). - `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. +- **When auto-recovery is disabled:** `chassisd` skips both the power-cycle and the PCIe reattach. `PCIE_DETACH_INFO|DPU|dpu_state` remains `detached` and the DPU stays in **ManualIntervention** until the operator triggers recovery. **DB State Transition:** @@ -469,8 +479,10 @@ The entire switch (NPU + all DPUs) goes down due to kernel panic or memory exhau **PMON Action:** - On recovery, `chassisd` initializes all DPU states as `down`, sets `ready_status` to `false`, and updates `last_down_time` for all DPUs. - `chassisd` re-establishes midplane connectivity and polls each DPU's state. -- If a DPU is still running and healthy (midplane, control plane, data plane all `up`), `chassisd` sets `ready_status` back to `true` and updates `last_ready_time`. -- If a DPU is unresponsive or in a bad state, `chassisd` sends gNOI Reboot RPC to reset it. Each such DPU then goes through: midplane attach → PCIe rescan → SONiC boot → container startup. +- For every admin-up DPU, irrespective of its observed state (healthy, degraded, or unresponsive), `chassisd` issues a platform vendor power-cycle (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) to guarantee a known-good starting state after the NPU crash, and increments `reset_count`. +- Admin-down DPUs (`oper_status: Offline`) are left powered off; `chassisd` does not reset them. +- After each DPU comes back, `chassisd` verifies all DPU states (midplane, control plane, data plane) and, on success, sets `ready_status` back to `true` and updates `last_ready_time`. +- **When auto-recovery is disabled:** `chassisd` skips the unconditional power-cycle for admin-up DPUs. Each DPU is left in its post-crash state with `ready_status: false` and remains in **ManualIntervention** awaiting operator action. **DB State Transition:** @@ -479,6 +491,9 @@ The entire switch (NPU + all DPUs) goes down due to kernel panic or memory exhau | `ready_status` (all DPUs) | `true` | `false` | `true` (per DPU) | | `last_down_time` (all DPUs) | — | `` | — | | `last_ready_time` (all DPUs) | — | — | `` (per DPU) | +| `reset_count` (per admin-up DPU) | N | N | N+1 | + +> **Note:** `reset_count` is reset to 0 on `chassisd` startup (per the field definition), so the "Before Crash" value above is the count as observed by the freshly restarted `chassisd` after the NPU comes back — effectively starting from 0. --- From b43d0412f933e0df4a2df06be5f6754f7666c715 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Thu, 30 Apr 2026 05:53:52 +0000 Subject: [PATCH 04/14] [doc][smart-switch][pmon]: Reset DPU immediately on control-plane-down Drop the dpu_auto_recovery_timeout self-heal grace period. chassisd now initiates a DPU power-cycle as soon as it observes dpu_control_plane_state (or dpu_midplane_link_state) as down on its next 10s health poll, regardless of whether the failure is a transient process restart or a persistent crash-loop. - Remove dpu_auto_recovery_timeout from the timer table; clarify chassisd health poll interval description to state immediate power-cycle on detection. - Combine 'Process restart on DPU' and 'Process persistently down on DPU' into a single 'Process crash/restart on DPU' section since chassisd applies the same recovery path in both cases. Update TOC and DB transition table accordingly. - State machine: keep SWFailure as a transient state on control-plane-down, branching directly into PowerCycle (auto-recovery enabled) or ManualIntervention (auto-recovery disabled) without any timer wait. HW-failure path goes directly from Ready to PowerCycle / ManualIntervention. - Drop 'skipping dpu_auto_recovery_timeout' parentheticals from HW Failure / Power Failure / PCIe Failure scenarios. Update Scenario DB State Summary row for control plane restart to reflect the immediate power-cycle behavior. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 54 +++++-------------- 1 file changed, 13 insertions(+), 41 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 55caf2615fb..f925f75e2f1 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -14,8 +14,7 @@ - [New DB entries](#new-db-entries) - [DPU Recovery State Machine](#dpu-recovery-state-machine) - [DPU Software Failures](#dpu-software-failures) - - [Process Restart on DPU](#process-restart-on-dpu) - - [Process Persistently Down on DPU](#process-persistently-down-on-dpu) + - [Process Crash/Restart on DPU](#process-crashrestart-on-dpu) - [pmon Crash on NPU](#pmon-crash-on-npu) - [databasedpu Crash on NPU](#databasedpu-crash-on-npu) - [DPU Hardware Failures](#dpu-hardware-failures) @@ -127,8 +126,7 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar | Timer / Threshold | Default Value | Configurable | Used By | Description | | ----------------- | :-----------: | :----------: | ------- | ----------- | -| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state` | -| `dpu_auto_recovery_timeout` | 60 seconds | Yes (`platform.json`) | `chassisd` | Self-heal grace period after a software failure is first detected. `chassisd` waits this long for the DPU to recover on its own (e.g., container supervisor restarts the failing process). If `dpu_control_plane_state` or `dpu_midplane_link_state` is still `down` when this timer expires, `chassisd` issues a DPU power-cycle. | +| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state`. As soon as `dpu_control_plane_state` or `dpu_midplane_link_state` is observed as `down`, `chassisd` initiates a DPU power-cycle (no self-heal grace period). | | `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | | `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | | `reset_limit` | 5 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | @@ -239,9 +237,8 @@ stateDiagram-v2 Booting --> Ready : All states up Ready --> SWFailure : Control plane down - SWFailure --> Ready : Self recovers before dpu_auto_recovery_timeout (60s) - SWFailure --> PowerCycle : dpu_auto_recovery_timeout (60s) expires [auto-recovery enabled] - SWFailure --> ManualIntervention : dpu_auto_recovery_timeout (60s) expires [auto-recovery disabled] + SWFailure --> PowerCycle : auto-recovery enabled + SWFailure --> ManualIntervention : auto-recovery disabled Ready --> PowerCycle : HW failure detected [auto-recovery enabled] Ready --> ManualIntervention : HW failure detected [auto-recovery disabled] @@ -266,7 +263,7 @@ stateDiagram-v2 | ----- | :------------: | :----------------: | ----------------- | | **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down` | | **Ready** | `true` | `recoverable` | All three states `up` | -| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up` | +| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag | | **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | | **ManualIntervention** | `false` | `recoverable` | DPU down; `FEATURE|dpu-auto-recovery` `state` is `disabled` / `always_disabled`; no power-cycle issued; awaits operator action | | **Offline** | `false` | `recoverable` | `oper_status: Offline` | @@ -276,43 +273,18 @@ stateDiagram-v2 ## DPU Software Failures ## -### Process restart on DPU ### +### Process crash/restart on DPU ### **Description:** -When any process crashes on the DPU, but the container supervisor successfully restarts the process and the DPU recovers on its own within `dpu_auto_recovery_timeout` (default: 60 seconds). No power-cycle is needed. +Any process crashes on the DPU and `dpu_control_plane_state` transitions to `down`. `chassisd` does not wait for the container supervisor to self-heal — it issues a DPU power-cycle as soon as it observes `dpu_control_plane_state: down` on its next poll. **Detection (by PMON):** - `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- `chassisd` waits up to `dpu_auto_recovery_timeout` (60 seconds) for the DPU to self-recover. -- Once `dpu_control_plane_state` transitions back to `up`, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - -**DB State Transition:** - -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | -| `dpu_control_plane_state` | `up` | `down` | `up` | -| `ready_status` | `true` | `false` | `true` | -| `last_down_time` | — | `` | — | -| `last_ready_time` | — | — | `` | - ---- - -### Process persistently down on DPU ### - -**Description:** -When any process crashes on the DPU and **remains down beyond `dpu_auto_recovery_timeout`** (i.e., the container supervisor cannot successfully restart it, or the process keeps crash-looping). Unlike a transient restart, this scenario indicates a persistent failure that requires a DPU power-cycle to recover. - -**Detection (by PMON):** -- `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. -- State remains `down` beyond `dpu_auto_recovery_timeout` (default: 60 seconds). - -**PMON Action:** -- `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- When `dpu_auto_recovery_timeout` (default: 60 seconds, measured from the time failure is first detected) expires without recovery, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. -- Once `dpu_control_plane_state` transitions back to `up`, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. +- `chassisd` immediately issues a power-cycle of the DPU and increments `reset_count`. +- After the DPU comes back, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. - **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must reset the DPU manually. @@ -389,7 +361,7 @@ A DPU completely fails due to hardware fault, thermal event, or unrecoverable er **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- `chassisd` power-cycles the DPU **immediately** (skipping `dpu_auto_recovery_timeout` — the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. +- `chassisd` immediately power-cycles the DPU (the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. - After power-cycle, DPU goes through full boot sequence: midplane attach → PCIe rescan → SONiC boot → container startup. - `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. @@ -419,7 +391,7 @@ The DPU loses power unexpectedly or shuts down without graceful notification (e. **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- `chassisd` power-cycles the DPU **immediately** (skipping `dpu_auto_recovery_timeout` — midplane and control plane are already confirmed down) and increments `reset_count`. +- `chassisd` immediately power-cycles the DPU (midplane and control plane are already confirmed down) and increments `reset_count`. - After power-cycle, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. - **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. @@ -447,7 +419,7 @@ The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable f **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- `chassisd` power-cycles the DPU **immediately** (skipping `dpu_auto_recovery_timeout` — midplane link loss and PCIe detach confirm the DPU is unreachable) and increments `reset_count`. +- `chassisd` immediately power-cycles the DPU (midplane link loss and PCIe detach confirm the DPU is unreachable) and increments `reset_count`. - After power-cycle, PCIe rescan is performed: - Platform vendor API: `pci_reattach()` (provided by `sonic_platform`). - `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. @@ -610,7 +582,7 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All | DPU up after crash | up | up | true | Set `ready_status=true` after verifying all states | | DPU stuck (lost connectivity) | down | down | false | Power-cycle DPU; increment `reset_count` | | DPU up after losing connectivity / reboot | up | up | true | Set `ready_status=true` after verifying all states | -| DPU control plane restart – critical services | down → up | up | false → true | Wait for auto-recovery; set `ready_status=true` on recovery | +| DPU control plane restart – critical services | down → up | up | false → true | Power-cycle DPU; increment `reset_count`; set `ready_status=true` on recovery | | NPU/DPU OS upgrade | down → up | up | false → true | Re-poll DPU states on NPU recovery | | DPU dead – power cycle | down | down | false | Power-cycle DPU; increment `reset_count` | | DPU dead – unrecoverable | down | down | false | `reset_count` reached `reset_limit`; `recovery_status` set to `"unrecoverable"`; raise alert | From a91d038650083b7b79a018003d2534ab7b2479d9 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Tue, 26 May 2026 22:31:40 +0000 Subject: [PATCH 05/14] [doc][smart-switch][pmon]: Address review comments on DPU robustness HLD - Clarify databasedpu crash detection: chassisd detects indirectly via dpu_control_plane_state going down, not by monitoring databasedpuN Redis instances directly. - Add auto-recovery trigger disambiguation: document that chassisd checks state_transition_in_progress before triggering recovery, skipping auto-recovery during planned shutdown/reboot operations. - Add Testing section with sonic-mgmt test plan covering all failure mode scenarios (8 test classes) and test infrastructure details. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 44 +++++++++++++++++-- 1 file changed, 40 insertions(+), 4 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index f925f75e2f1..0478376de65 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -28,6 +28,7 @@ - [DPU Cold Reboot](#dpu-cold-reboot) - [Full SmartSwitch Reboot](#full-smartswitch-reboot) - [Scenario DB State Summary](#scenario-db-state-summary) +- [Testing](#testing) - [Repository Change Summary](#repository-change-summary) - [References](#references) @@ -133,6 +134,8 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar > **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. +> **Auto-recovery trigger vs. planned operations:** `chassisd` auto-recovery is triggered **only** for unplanned failures. During planned operations (graceful shutdown via `config chassis module shutdown` or DPU reboot via `reboot -d`), the `state_transition_in_progress` field in `CHASSIS_MODULE_TABLE|DPU` is set to `True` **before** the DPU control plane goes down. When `chassisd` observes `dpu_control_plane_state: down`, it checks `state_transition_in_progress`: if `True`, `chassisd` skips auto-recovery because the shutdown/reboot is intentional. Auto-recovery is only initiated when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down` **and** no planned transition is in progress (`state_transition_in_progress == False`). There is no additional timeout configured for this check — the distinction is purely flag-based. + --- ## DPU Status DB Info ## @@ -330,22 +333,30 @@ The `pmon` (Platform Monitor) daemon on the NPU crashes. This is a **critical** ### databasedpu crash on NPU ### **Description:** -The `databasedpu` (per-DPU Redis database instance) on the NPU crashes. Each DPU has a dedicated Redis instance on the NPU (port 6381 + DPU ID, bound to midplane bridge IP 169.254.200.254). +The `databasedpu` (per-DPU Redis database instance) on the NPU crashes. Each DPU has a dedicated Redis instance on the NPU (port 6381 + DPU ID, bound to midplane bridge IP 169.254.200.254). These per-DPU Redis instances host the DPU's APPL_DB, CONFIG_DB, STATE_DB, etc. — they are **not** the same as CHASSIS_STATE_DB. The DPU's `orchagent` and `syncd` read/write from these instances, while `chassisd` monitors DPU health via CHASSIS_STATE_DB (a separate Redis instance on the NPU). **Detection (by PMON):** -- `chassisd` cannot read DPU state from the corresponding Redis instance. +- `chassisd` does **not** directly detect the `databasedpuN` service crash. The detection is indirect: + 1. When `databasedpuN` crashes, the DPU's `orchagent` and other services lose DB connectivity. + 2. Critical services on the DPU fail, causing `SYSTEM_READY` on the DPU to go `false`. + 3. The DPU updates its `dpu_control_plane_state` to `down` in CHASSIS_STATE_DB (which is a separate Redis instance and remains accessible). + 4. `chassisd` observes `dpu_control_plane_state: down` on its next poll cycle. +- If the `databasedpuN` crash is caused by a midplane failure, then CHASSIS_STATE_DB also becomes inaccessible from the DPU side, and `chassisd` detects the failure via `dpu_midplane_link_state: down` instead. **PMON Action:** -- `chassisd` detects loss of DPU state, sets `ready_status` to `false`, and updates `last_down_time`. -- After `systemd` restarts the Redis instance and DPU reconnects, `chassisd` polls DPU state, sets `ready_status` back to `true`, and updates `last_ready_time` once all states are verified. +- `chassisd` sets `ready_status` to `false` and updates `last_down_time`. +- If `dpu_control_plane_state` or `dpu_midplane_link_state` is observed as `down`, `chassisd` initiates a DPU power-cycle (same as any other unplanned failure). +- After `systemd` restarts the Redis instance and the DPU recovers (or after power-cycle recovery), `chassisd` polls DPU state, sets `ready_status` back to `true`, and updates `last_ready_time` once all states are verified. **DB State Transition:** | DB Field | Before | During Failure | After Recovery | | -------- | :----: | :------------: | :------------: | +| `dpu_control_plane_state` | `up` | `down` | `up` | | `ready_status` | `true` | `false` | `true` | | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | +| `reset_count` | N | N | N+1 | --- @@ -590,12 +601,37 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All --- +## Testing ## + +All DPU failure mode tests run with **auto-recovery enabled** by default (the production configuration). Specific tests explicitly disable and re-enable auto-recovery to validate the `ManualIntervention` path. The existing SmartSwitch test suites (e.g., `test_reload_dpu`) continue to run unmodified — auto-recovery does not interfere with planned reboot/shutdown tests because `chassisd` checks `state_transition_in_progress` before triggering recovery. + +Test implementation: [`tests/smartswitch/platform_tests/test_dpu_failure_modes.py`](https://github.com/sonic-net/sonic-mgmt/blob/master/tests/smartswitch/platform_tests/test_dpu_failure_modes.py) + +| Test Class | Scenario | Validates | +| ---------- | -------- | --------- | +| `TestDatabaseDpuCrash` | Kill per-DPU Redis instance (`databasedpuN`) on NPU | `chassisd` detects loss (`ready_status=false`), `systemd` restarts service, `chassisd` recovers (`ready_status=true`) | +| `TestPcieFailure` | Remove DPU PCIe device via sysfs | `pcied` detects detach (`PCIE_DETACH_INFO dpu_state=detached`), `chassisd` marks DPU not-ready, power-cycles, performs PCIe rescan, recovers | +| `TestControlPlaneOnlyDown` | Stop critical container (`swss`) on DPU | `dpu_control_plane_state=down` while midplane stays up; `chassisd` detects and power-cycles DPU | +| `TestAutoRecoveryDisabled` | Disable `FEATURE\|dpu-auto-recovery`, trigger failure | Confirms `chassisd` does NOT power-cycle (ManualIntervention); re-enable and verify recovery | +| `TestUnrecoverableState` | Repeatedly trigger failures until `reset_count` ≥ `reset_limit` | `recovery_status=unrecoverable`; `chassisd` stops retrying | +| `TestStateMachineTransitions` | Planned shutdown → offline → startup → ready | `last_down_time` and `last_ready_time` updated correctly; `recovery_status` stays `recoverable` | +| `TestShutdownDuringAutoRecovery` | Issue module shutdown while `chassisd` is mid-recovery | `chassisd` aborts auto-recovery, DPU transitions to Offline cleanly | +| `TestDpuFailureAfterConfigReload` | Config reload on NPU, then trigger DPU failure | `chassisd` recovery works post-reload; `reset_count` increments | + +**Test infrastructure:** +- Shared `ensure_all_dpus_ready` fixture (in `tests/smartswitch/conftest.py`) ensures all testable DPUs are admin-up, online, and DB-ready before each test, and recovers any offline DPUs in teardown. +- Tests use `assert_dpu_db_state_ready()` helper to verify full DPU readiness (`ready_status=true`, `recovery_status=recoverable`, all planes up). +- Topology: `smartswitch` — requires NPU DUT with DPU SSH access via `dpuhosts`. + +--- + ## Repository Change Summary ## | Repository | Component | Changes | | ---------- | --------- | ------- | | [sonic-platform-daemons](https://github.com/sonic-net/sonic-platform-daemons) | `chassisd` | DPU failure detection, automated power-cycle recovery, new CHASSIS_STATE_DB fields (`ready_status`, `recovery_status`, `reset_count`, `last_down_time`, `last_ready_time`) | | [sonic-buildimage](https://github.com/sonic-net/sonic-buildimage) | PMON container | Configuration updates for new `chassisd` failure recovery features | +| [sonic-mgmt](https://github.com/sonic-net/sonic-mgmt) | --- | DPU failure mode tests (`test_dpu_failure_modes.py`) | --- From be62b4d229add0a2a5cb16db6544c74bc8229481 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Thu, 28 May 2026 04:29:17 +0000 Subject: [PATCH 06/14] [doc][smart-switch][pmon]: Address review round 2 on DPU robustness HLD - Clarify recovery timing: power-cycle triggered on same poll cycle that detects failure (no additional timeout beyond 10s poll interval). - Add CLI section: show chassis modules status extended to display ready_status, recovery_status, reset_count, last_down_time, last_ready_time from CHASSIS_STATE_DB. - Fix PCIe failure recovery: since DPU is already offline, chassisd updates the status and does not power-cycle. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 32 +++++++++++++++---- 1 file changed, 26 insertions(+), 6 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 0478376de65..6ec2aae1aa2 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -28,6 +28,7 @@ - [DPU Cold Reboot](#dpu-cold-reboot) - [Full SmartSwitch Reboot](#full-smartswitch-reboot) - [Scenario DB State Summary](#scenario-db-state-summary) +- [CLI](#cli) - [Testing](#testing) - [Repository Change Summary](#repository-change-summary) - [References](#references) @@ -286,7 +287,7 @@ Any process crashes on the DPU and `dpu_control_plane_state` transitions to `dow **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- `chassisd` immediately issues a power-cycle of the DPU and increments `reset_count`. +- On the **same poll cycle** that detects `dpu_control_plane_state: down`, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. There is no additional timeout or grace period — the only detection latency is the 10-second poll interval itself. - After the DPU comes back, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. - **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must reset the DPU manually. @@ -430,11 +431,7 @@ The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable f **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- `chassisd` immediately power-cycles the DPU (midplane link loss and PCIe detach confirm the DPU is unreachable) and increments `reset_count`. -- After power-cycle, PCIe rescan is performed: - - Platform vendor API: `pci_reattach()` (provided by `sonic_platform`). -- `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. -- **When auto-recovery is disabled:** `chassisd` skips both the power-cycle and the PCIe reattach. `PCIE_DETACH_INFO|DPU|dpu_state` remains `detached` and the DPU stays in **ManualIntervention** until the operator triggers recovery. +- If the DPU is already offline/unreachable (midplane link down, PCIe detached), `chassisd` does **not** issue an immediate power-cycle. `chassisd `sets `ready_status` back to `false`, and updates `last_down_time`. **DB State Transition:** @@ -601,6 +598,29 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All --- +## CLI ## + +The new DPU recovery state fields are exposed to operators via the existing `show chassis modules status` command, extended to include recovery information: + +``` +admin@sonic:~$ show chassis modules status + Name Description Physical-Slot Oper-Status Admin-Status Ready-Status Recovery-Status Reset-Count Last-Down-Time Last-Ready-Time +------ ------------- --------------- ------------- -------------- -------------- ----------------- ------------- -------------------------- -------------------------- + DPU0 DPU Module 0 1 Online up true recoverable 2 2026-05-28 10:15:30 UTC 2026-05-28 10:18:45 UTC + DPU1 DPU Module 1 2 Online up true recoverable 0 — 2026-05-28 09:00:12 UTC + DPU2 DPU Module 2 3 Offline down false unrecoverable 5 2026-05-28 11:02:00 UTC — +``` + +| Column | Source DB Field | Description | +| ------ | --------------- | ----------- | +| `Ready-Status` | `CHASSIS_STATE_DB: DPU_STATE\|DPU: ready_status` | Whether the DPU is fully up and serving traffic | +| `Recovery-Status` | `CHASSIS_STATE_DB: DPU_STATE\|DPU: recovery_status` | `recoverable` or `unrecoverable` | +| `Reset-Count` | `CHASSIS_STATE_DB: DPU_STATE\|DPU: reset_count` | Number of unplanned power-cycles since last `chassisd` restart | +| `Last-Down-Time` | `CHASSIS_STATE_DB: DPU_STATE\|DPU: last_down_time` | UTC timestamp of last DPU failure | +| `Last-Ready-Time` | `CHASSIS_STATE_DB: DPU_STATE\|DPU: last_ready_time` | UTC timestamp of last successful DPU recovery | + +--- + ## Testing ## All DPU failure mode tests run with **auto-recovery enabled** by default (the production configuration). Specific tests explicitly disable and re-enable auto-recovery to validate the `ManualIntervention` path. The existing SmartSwitch test suites (e.g., `test_reload_dpu`) continue to run unmodified — auto-recovery does not interfere with planned reboot/shutdown tests because `chassisd` checks `state_transition_in_progress` before triggering recovery. From ff62ae3df655e6fa49e8fe7459795298455363a6 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Thu, 28 May 2026 17:28:37 +0000 Subject: [PATCH 07/14] [doc][smart-switch][pmon]: Address review round 3 on DPU robustness HLD - Add dpu_boot_timeout timer (300s default) to handle stuck-boot scenarios - Update state machine: Booting transitions to PowerCycle/ManualIntervention on timeout - Split CLI: keep 'show chassis modules status' lean, add 'show chassis modules recovery' - Include Ready-Status in both CLI outputs - Rename reset_limit to dpu_reset_limit for consistency Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 49 ++++++++++++------- 1 file changed, 32 insertions(+), 17 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 6ec2aae1aa2..ba3ef3e0df5 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -131,7 +131,8 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar | `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state`. As soon as `dpu_control_plane_state` or `dpu_midplane_link_state` is observed as `down`, `chassisd` initiates a DPU power-cycle (no self-heal grace period). | | `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | | `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | -| `reset_limit` | 5 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | +| `dpu_boot_timeout` | 300 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to reach `Ready` state (all planes up) after a power-cycle. If the DPU does not become ready within this timeout, `chassisd` treats it as a boot failure and initiates another power-cycle (incrementing `reset_count`). Covers broken-image and stuck-boot scenarios. | +| `dpu_reset_limit` | 5 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | > **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. @@ -204,7 +205,7 @@ DPU_STATE|DPU: | Field | Description | Set by | Cleared by | | ----- | ----------- | ------ | ---------- | | `ready_status` | Set to `"true"` when the DPU is fully up and ready (midplane, control plane, data plane all up). Set to `"false"` when the DPU goes down or undergoes a reset. | `chassisd` | `chassisd` (set to `"false"` on failure/reset) | -| `recovery_status` | Set to `"recoverable"` on initialization. Set to `"unrecoverable"` when `reset_count` reaches `reset_limit`. | `chassisd` | `chassisd` (reset to `"recoverable"` on planned restart) | +| `recovery_status` | Set to `"recoverable"` on initialization. Set to `"unrecoverable"` when `reset_count` reaches `dpu_reset_limit`. | `chassisd` | `chassisd` (reset to `"recoverable"` on planned restart) | | `reset_count` | Number of unplanned DPU resets. Reset to 0 on `chassisd` reset on NPU (e.g., NPU reboot, `pmon` restart). | `chassisd` | `chassisd` | | `last_down_time` | UTC timestamp of the last time the DPU went down | `chassisd` | — | | `last_ready_time` | UTC timestamp of the last time the DPU became ready | `chassisd` | — | @@ -239,6 +240,8 @@ stateDiagram-v2 [*] --> Booting : DPU power on Booting --> Ready : All states up + Booting --> PowerCycle : boot timeout expired [auto-recovery enabled] + Booting --> ManualIntervention : boot timeout expired [auto-recovery disabled] Ready --> SWFailure : Control plane down SWFailure --> PowerCycle : auto-recovery enabled @@ -265,13 +268,13 @@ stateDiagram-v2 | State | `ready_status` | `recovery_status` | Key DB Indicators | | ----- | :------------: | :----------------: | ----------------- | -| **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down` | +| **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down`; `chassisd` starts `dpu_boot_timeout` timer — if DPU does not reach Ready before timeout, triggers PowerCycle (or ManualIntervention if auto-recovery disabled) | | **Ready** | `true` | `recoverable` | All three states `up` | | **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag | | **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | | **ManualIntervention** | `false` | `recoverable` | DPU down; `FEATURE|dpu-auto-recovery` `state` is `disabled` / `always_disabled`; no power-cycle issued; awaits operator action | | **Offline** | `false` | `recoverable` | `oper_status: Offline` | -| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `reset_limit` | +| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit` | --- @@ -289,7 +292,7 @@ Any process crashes on the DPU and `dpu_control_plane_state` transitions to `dow - `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. - On the **same poll cycle** that detects `dpu_control_plane_state: down`, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. There is no additional timeout or grace period — the only detection latency is the 10-second poll interval itself. - After the DPU comes back, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. -- If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. +- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. - **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must reset the DPU manually. **DB State Transition:** @@ -301,7 +304,7 @@ Any process crashes on the DPU and `dpu_control_plane_state` transitions to `dow | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | | `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `reset_limit`) | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | --- @@ -376,7 +379,7 @@ A DPU completely fails due to hardware fault, thermal event, or unrecoverable er - `chassisd` immediately power-cycles the DPU (the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. - After power-cycle, DPU goes through full boot sequence: midplane attach → PCIe rescan → SONiC boot → container startup. - `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. -- If `reset_count` reaches `reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. +- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. - **When auto-recovery is disabled:** `chassisd` skips the immediate power-cycle. The DPU remains in **ManualIntervention** with `oper_status: Offline` and `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -388,7 +391,7 @@ A DPU completely fails due to hardware fault, thermal event, or unrecoverable er | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | | `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `reset_limit`) | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | --- @@ -575,7 +578,7 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All **Error handling:** - If a DPU does not respond to gNOI Reboot RPC within the timeout: NPU proceeds with PCIe detach and continues the reboot. The unresponsive DPU is cold-booted on NPU recovery. -- If a DPU fails to come back after the full switch reboot: `chassisd` retries power-cycle up to `reset_limit` (tracked via `reset_count`). If still unresponsive, `chassisd` sets `recovery_status` to `"unrecoverable"`. +- If a DPU fails to come back after the full switch reboot: `chassisd` retries power-cycle up to `dpu_reset_limit` (tracked via `reset_count`). If still unresponsive, `chassisd` sets `recovery_status` to `"unrecoverable"`. - If the NPU reboot is initiated while a DPU graceful shutdown is in progress: the graceful shutdown completes first, then the NPU reboot proceeds. --- @@ -593,22 +596,34 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All | DPU control plane restart – critical services | down → up | up | false → true | Power-cycle DPU; increment `reset_count`; set `ready_status=true` on recovery | | NPU/DPU OS upgrade | down → up | up | false → true | Re-poll DPU states on NPU recovery | | DPU dead – power cycle | down | down | false | Power-cycle DPU; increment `reset_count` | -| DPU dead – unrecoverable | down | down | false | `reset_count` reached `reset_limit`; `recovery_status` set to `"unrecoverable"`; raise alert | +| DPU dead – unrecoverable | down | down | false | `reset_count` reached `dpu_reset_limit`; `recovery_status` set to `"unrecoverable"`; raise alert | | Full SmartSwitch reboot (planned) | down → up | down → up | false → true | gNOI halt; power-cycle; re-verify | --- ## CLI ## -The new DPU recovery state fields are exposed to operators via the existing `show chassis modules status` command, extended to include recovery information: +The existing `show chassis modules status` command is extended to include a `Ready-Status` column: ``` admin@sonic:~$ show chassis modules status - Name Description Physical-Slot Oper-Status Admin-Status Ready-Status Recovery-Status Reset-Count Last-Down-Time Last-Ready-Time ------- ------------- --------------- ------------- -------------- -------------- ----------------- ------------- -------------------------- -------------------------- - DPU0 DPU Module 0 1 Online up true recoverable 2 2026-05-28 10:15:30 UTC 2026-05-28 10:18:45 UTC - DPU1 DPU Module 1 2 Online up true recoverable 0 — 2026-05-28 09:00:12 UTC - DPU2 DPU Module 2 3 Offline down false unrecoverable 5 2026-05-28 11:02:00 UTC — + Name Description Physical-Slot Oper-Status Admin-Status Serial Ready-Status +------ ---------------------- --------------- ------------- -------------- ------------ -------------- + DPU0 N/A Online up true + DPU1 N/A Online up true + ... +``` + +A new `show chassis modules recovery` command exposes detailed recovery state: + +``` +admin@sonic:~$ show chassis modules recovery + Name Ready-Status Recovery-Status Reset-Count Last-Down-Time Last-Ready-Time +------ -------------- ----------------- ------------- -------------------------- -------------------------- + DPU0 true recoverable 2 2026-05-28 10:15:30 UTC 2026-05-28 10:18:45 UTC + DPU1 true recoverable 0 — 2026-05-28 09:00:12 UTC + DPU2 false unrecoverable 5 2026-05-28 11:02:00 UTC — + .. ``` | Column | Source DB Field | Description | @@ -633,7 +648,7 @@ Test implementation: [`tests/smartswitch/platform_tests/test_dpu_failure_modes.p | `TestPcieFailure` | Remove DPU PCIe device via sysfs | `pcied` detects detach (`PCIE_DETACH_INFO dpu_state=detached`), `chassisd` marks DPU not-ready, power-cycles, performs PCIe rescan, recovers | | `TestControlPlaneOnlyDown` | Stop critical container (`swss`) on DPU | `dpu_control_plane_state=down` while midplane stays up; `chassisd` detects and power-cycles DPU | | `TestAutoRecoveryDisabled` | Disable `FEATURE\|dpu-auto-recovery`, trigger failure | Confirms `chassisd` does NOT power-cycle (ManualIntervention); re-enable and verify recovery | -| `TestUnrecoverableState` | Repeatedly trigger failures until `reset_count` ≥ `reset_limit` | `recovery_status=unrecoverable`; `chassisd` stops retrying | +| `TestUnrecoverableState` | Repeatedly trigger failures until `reset_count` ≥ `dpu_reset_limit` | `recovery_status=unrecoverable`; `chassisd` stops retrying | | `TestStateMachineTransitions` | Planned shutdown → offline → startup → ready | `last_down_time` and `last_ready_time` updated correctly; `recovery_status` stays `recoverable` | | `TestShutdownDuringAutoRecovery` | Issue module shutdown while `chassisd` is mid-recovery | `chassisd` aborts auto-recovery, DPU transitions to Offline cleanly | | `TestDpuFailureAfterConfigReload` | Config reload on NPU, then trigger DPU failure | `chassisd` recovery works post-reload; `reset_count` increments | From 57283fe8ffa01e0d3deb08b2d694e7a2ea2b1f64 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Thu, 28 May 2026 18:25:03 +0000 Subject: [PATCH 08/14] [doc][smart-switch][pmon]: Fix contradictions and add missing corner cases - Fix PCIe Failure: chassisd does power-cycle (midplane down implies recovery) - Fix pmon crash: document reset_count reset to 0 on chassisd restart - Fix DPU Power Failure: add recovery_status and dpu_reset_limit handling - Clarify SWFailure: both planes down = HW failure path (skips SWFailure) - Add boot timeout note for planned reboot scenarios - Add state machine edges: Booting->Offline, Unrecoverable->Booting via operator - Clarify recovery_status clear: chassisd restart or operator module startup - Add syslog warning for stuck data plane (control up, data down) - Add multi-DPU sequential recovery note with optional parallel flag - Add race condition: shutdown during Booting cancels timer, goes to Offline - Change dpu_reset_limit default from 5 to 2 - Clarify data plane warning applies only during Booting state - Fix Ready->data-plane-down: ready_status set to false (no recovery action) - Add recovery_status to databasedpu crash DB state table - Fix scenario summary: midplane transitions down->up during initial boot Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 44 +++++++++++++------ 1 file changed, 30 insertions(+), 14 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index ba3ef3e0df5..dab85e22c72 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -132,9 +132,9 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar | `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | | `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | | `dpu_boot_timeout` | 300 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to reach `Ready` state (all planes up) after a power-cycle. If the DPU does not become ready within this timeout, `chassisd` treats it as a boot failure and initiates another power-cycle (incrementing `reset_count`). Covers broken-image and stuck-boot scenarios. | -| `dpu_reset_limit` | 5 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | +| `dpu_reset_limit` | 2 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | -> **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. +> **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. During the **Booting** state, if `dpu_control_plane_state` transitions to `up` but `dpu_data_plane_state` remains `down` when `dpu_boot_timeout` expires, `chassisd` logs a WARNING-level syslog (`DPU: data plane not up after s`) but does **not** trigger a power-cycle. The `ready_status` remains `false` until the operator investigates or the data plane recovers. This warning only applies during boot — if a DPU is already in **Ready** state and `dpu_data_plane_state` drops to `down` while control plane stays `up`, `chassisd` sets `ready_status` to `false` but does not trigger a power-cycle or timeout. The DPU stays operational (no recovery action) until the data plane recovers or control plane also goes down. > **Auto-recovery trigger vs. planned operations:** `chassisd` auto-recovery is triggered **only** for unplanned failures. During planned operations (graceful shutdown via `config chassis module shutdown` or DPU reboot via `reboot -d`), the `state_transition_in_progress` field in `CHASSIS_MODULE_TABLE|DPU` is set to `True` **before** the DPU control plane goes down. When `chassisd` observes `dpu_control_plane_state: down`, it checks `state_transition_in_progress`: if `True`, `chassisd` skips auto-recovery because the shutdown/reboot is intentional. Auto-recovery is only initiated when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down` **and** no planned transition is in progress (`state_transition_in_progress == False`). There is no additional timeout configured for this check — the distinction is purely flag-based. @@ -205,7 +205,7 @@ DPU_STATE|DPU: | Field | Description | Set by | Cleared by | | ----- | ----------- | ------ | ---------- | | `ready_status` | Set to `"true"` when the DPU is fully up and ready (midplane, control plane, data plane all up). Set to `"false"` when the DPU goes down or undergoes a reset. | `chassisd` | `chassisd` (set to `"false"` on failure/reset) | -| `recovery_status` | Set to `"recoverable"` on initialization. Set to `"unrecoverable"` when `reset_count` reaches `dpu_reset_limit`. | `chassisd` | `chassisd` (reset to `"recoverable"` on planned restart) | +| `recovery_status` | Set to `"recoverable"` on initialization. Set to `"unrecoverable"` when `reset_count` reaches `dpu_reset_limit`. Reset back to `"recoverable"` (and `reset_count` to 0) on: (1) `chassisd` restart (pmon crash / NPU reboot), or (2) operator-initiated `config chassis module startup DPU` on an unrecoverable DPU. | `chassisd` | `chassisd` (reset to `"recoverable"` on chassisd restart or operator module startup) | | `reset_count` | Number of unplanned DPU resets. Reset to 0 on `chassisd` reset on NPU (e.g., NPU reboot, `pmon` restart). | `chassisd` | `chassisd` | | `last_down_time` | UTC timestamp of the last time the DPU went down | `chassisd` | — | | `last_ready_time` | UTC timestamp of the last time the DPU became ready | `chassisd` | — | @@ -243,18 +243,19 @@ stateDiagram-v2 Booting --> PowerCycle : boot timeout expired [auto-recovery enabled] Booting --> ManualIntervention : boot timeout expired [auto-recovery disabled] - Ready --> SWFailure : Control plane down + Ready --> SWFailure : Control plane down (midplane up) SWFailure --> PowerCycle : auto-recovery enabled SWFailure --> ManualIntervention : auto-recovery disabled - Ready --> PowerCycle : HW failure detected [auto-recovery enabled] - Ready --> ManualIntervention : HW failure detected [auto-recovery disabled] + Ready --> PowerCycle : HW failure detected (midplane down) [auto-recovery enabled] + Ready --> ManualIntervention : HW failure detected (midplane down) [auto-recovery disabled] PowerCycle --> Booting : Power cycle issued - PowerCycle --> Unrecoverable : reset count >= reset limit + PowerCycle --> Unrecoverable : reset count >= dpu_reset_limit ManualIntervention --> Booting : Operator power-cycle / module startup + Booting --> Offline : CLI module shutdown during boot Ready --> PlannedShutdown : CLI module shutdown Ready --> PlannedReboot : CLI reboot DPU @@ -263,18 +264,18 @@ stateDiagram-v2 Offline --> Booting : CLI module startup - Unrecoverable --> Booting : chassisd reset on NPU + Unrecoverable --> Booting : Operator module startup or chassisd restart ``` | State | `ready_status` | `recovery_status` | Key DB Indicators | | ----- | :------------: | :----------------: | ----------------- | | **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down`; `chassisd` starts `dpu_boot_timeout` timer — if DPU does not reach Ready before timeout, triggers PowerCycle (or ManualIntervention if auto-recovery disabled) | | **Ready** | `true` | `recoverable` | All three states `up` | -| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag | +| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag. If **both** control plane and midplane are down, this is treated as a HW failure — skips SWFailure and goes directly to PowerCycle/ManualIntervention. | | **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | | **ManualIntervention** | `false` | `recoverable` | DPU down; `FEATURE|dpu-auto-recovery` `state` is `disabled` / `always_disabled`; no power-cycle issued; awaits operator action | | **Offline** | `false` | `recoverable` | `oper_status: Offline` | -| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit` | +| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit`; operator can recover via `config chassis module startup DPU` which resets `recovery_status` to `recoverable` and `reset_count` to 0 | --- @@ -318,7 +319,7 @@ The `pmon` (Platform Monitor) daemon on the NPU crashes. This is a **critical** - DPU health state updates to `CHASSIS_STATE_DB` stop during the outage. **PMON Action:** -- On `chassisd` bringup sequence after restart, `chassisd` sets `ready_status` to `false` and updates `last_down_time` for **all** DPUs. +- On `chassisd` bringup sequence after restart, `chassisd` resets `reset_count` to 0 for **all** DPUs, sets `ready_status` to `false`, and updates `last_down_time` for **all** DPUs. - `chassisd` re-polls all DPU states and updates `CHASSIS_STATE_DB` with current values. - For each DPU found healthy, `chassisd` sets `ready_status` back to `true` and updates `last_ready_time`. @@ -329,6 +330,9 @@ The `pmon` (Platform Monitor) daemon on the NPU crashes. This is a **critical** | `ready_status` (all DPUs) | `true` | stale | `false` → `true` (per DPU) | | `last_down_time` (all DPUs) | — | — | `` | | `last_ready_time` (all DPUs) | — | — | `` (per DPU) | +| `reset_count` (all DPUs) | N | stale | 0 | + +> **Note:** `reset_count` is reset to 0 on `chassisd` restart. This means a DPU that was close to `dpu_reset_limit` gets a fresh retry budget after a pmon crash. This is by design — the pmon restart itself represents operator-level intervention, and persistent hardware issues will be caught again within `dpu_reset_limit` attempts. > **Note:** If only `chassisd` crashes within the `pmon` container (while `pmon` itself stays running), `supervisord` inside `pmon` restarts `chassisd` automatically. The recovery behavior is identical to the full `pmon` crash case described above — `chassisd` re-initializes all DPU states on startup. @@ -361,6 +365,7 @@ The `databasedpu` (per-DPU Redis database instance) on the NPU crashe | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | | `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | --- @@ -408,6 +413,7 @@ The DPU loses power unexpectedly or shuts down without graceful notification (e. - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. - `chassisd` immediately power-cycles the DPU (midplane and control plane are already confirmed down) and increments `reset_count`. - After power-cycle, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. +- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. - **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -420,6 +426,7 @@ The DPU loses power unexpectedly or shuts down without graceful notification (e. | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | | `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | --- @@ -434,7 +441,10 @@ The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable f **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- If the DPU is already offline/unreachable (midplane link down, PCIe detached), `chassisd` does **not** issue an immediate power-cycle. `chassisd `sets `ready_status` back to `false`, and updates `last_down_time`. +- Since midplane is down (PCIe detached implies midplane unreachable), `chassisd` initiates a DPU power-cycle (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) and increments `reset_count`. +- After power-cycle and PCIe rescan, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. +- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. +- **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -446,6 +456,7 @@ The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable f | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | | `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | --- @@ -514,6 +525,7 @@ Orderly shutdown of a DPU via CLI command: `config chassis module shutdown DPU`. 7. DPU boots, services start, reports `dpu_control_plane_state=up`. 8. `chassisd` verifies all DPU states and sets `ready_status` to `true`. +> **Note:** `dpu_boot_timeout` applies to the Booting phase after a planned reboot as well. If the DPU fails to reach `Ready` within the timeout (e.g., broken image installed during upgrade), `chassisd` treats it as a boot failure and initiates another power-cycle, incrementing `reset_count`. The planned reboot's `state_transition_in_progress` flag is already cleared by step 5, so it does not suppress the boot-timeout recovery. + **DB State Transition:** | DB Field | Before | During Reboot | After Recovery | @@ -567,6 +581,8 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All 7. `chassisd` power-cycles each DPU and performs PCIe reattach. 8. Each DPU boots: midplane attach → SONiC boot → container startup → reports `dpu_control_plane_state=up`. +> **Note — Multiple DPU recovery:** When multiple (or all) DPUs need recovery simultaneously, `chassisd` issues power-cycles sequentially (one DPU at a time) to avoid power-rail overload and PCIe bus contention. The `dpu_boot_timeout` is tracked per-DPU independently. If a platform supports parallel DPU power-cycle (declared in `platform.json` via `parallel_dpu_recovery: true`), `chassisd` may issue power-cycles in parallel batches. + **DB State Transition:** | DB Field | Before | During Reboot | After Recovery | @@ -587,7 +603,7 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All | DPU Scenario | `dpu_control_plane_state` | `dpu_midplane_link_state` | `ready_status` | PMON Action | | ------------ | :-----------------------: | :-----------------------: | :-----------: | ----------- | -| DPU booting – initial state | down | down | false | `chassisd` polls; waiting for DPU to come up | +| DPU booting – initial state | down | down → up | false | `chassisd` polls; midplane comes up first, then waits for control plane and data plane within `dpu_boot_timeout` | | DPU healthy and running – first boot | up | up | true | Set `ready_status=true` after verifying all states | | DPU crash / unplanned reboot | down | down | false | Power-cycle DPU; increment `reset_count` | | DPU up after crash | up | up | true | Set `ready_status=true` after verifying all states | @@ -622,7 +638,7 @@ admin@sonic:~$ show chassis modules recovery ------ -------------- ----------------- ------------- -------------------------- -------------------------- DPU0 true recoverable 2 2026-05-28 10:15:30 UTC 2026-05-28 10:18:45 UTC DPU1 true recoverable 0 — 2026-05-28 09:00:12 UTC - DPU2 false unrecoverable 5 2026-05-28 11:02:00 UTC — + DPU2 false unrecoverable 2 2026-05-28 11:02:00 UTC — .. ``` From de5e500dca46401d162a45fa1e04560962298fab Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Fri, 29 May 2026 00:09:29 +0000 Subject: [PATCH 09/14] [doc][smart-switch][pmon]: Add DPU self-recovery (HW watchdog) section Add HW watchdog-aware recovery for midplane-down events. When dpu_midplane_link_state goes down, chassisd enters WaitForWatchdog and waits dpu_boot_timeout (600s) for DPU to self-recover via HW watchdog before issuing its own power-cycle. Validates reboot cause (Kernel Panic, Memory Exhaustion, Watchdog) to accept self-recovery. dpu_boot_timeout is reused for both Booting (wait after power-cycle) and WaitForWatchdog (wait for HW watchdog self-recovery). Also fixes DPU Power Failure and PCIe Failure sections to go through WaitForWatchdog instead of immediate power-cycle, consistent with the state machine. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 117 ++++++++++++++++-- 1 file changed, 107 insertions(+), 10 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index dab85e22c72..9af969c90ab 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -23,6 +23,7 @@ - [PCIe Failure](#pcie-failure) - [NPU / Switch Level Failures](#npu--switch-level-failures) - [NPU Kernel Crash / Memory Exhaustion](#npu-kernel-crash--memory-exhaustion) +- [DPU Kernel Panic / Memory Exhaustion (HW Watchdog)](#dpu-kernel-panic--memory-exhaustion-hw-watchdog) - [Planned Operations](#planned-operations) - [DPU Graceful Shutdown](#dpu-graceful-shutdown) - [DPU Cold Reboot](#dpu-cold-reboot) @@ -128,10 +129,10 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar | Timer / Threshold | Default Value | Configurable | Used By | Description | | ----------------- | :-----------: | :----------: | ------- | ----------- | -| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state`. As soon as `dpu_control_plane_state` or `dpu_midplane_link_state` is observed as `down`, `chassisd` initiates a DPU power-cycle (no self-heal grace period). | +| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state`. As soon as `dpu_control_plane_state` is observed as `down` (with midplane still up), `chassisd` initiates a DPU power-cycle (no self-heal grace period). When `dpu_midplane_link_state` goes down, `chassisd` enters **WaitForWatchdog** and waits `dpu_boot_timeout` for the DPU to self-recover via HW watchdog before power-cycling. | | `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | | `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | -| `dpu_boot_timeout` | 300 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to reach `Ready` state (all planes up) after a power-cycle. If the DPU does not become ready within this timeout, `chassisd` treats it as a boot failure and initiates another power-cycle (incrementing `reset_count`). Covers broken-image and stuck-boot scenarios. | +| `dpu_boot_timeout` | 600 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to reach `Ready` state (all planes up) after a power-cycle or after entering **WaitForWatchdog** (midplane down). In **Booting** state: if the DPU does not become ready within this timeout, `chassisd` treats it as a boot failure and initiates another power-cycle (incrementing `reset_count`). In **WaitForWatchdog** state: if the DPU does not self-recover within this timeout, `chassisd` issues its own power-cycle. If the DPU comes back within this timeout AND reports a recognized reboot cause (`Kernel Panic`, `Memory Exhaustion`, or `Watchdog`), `chassisd` accepts the self-recovery without issuing its own power-cycle. | | `dpu_reset_limit` | 2 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | > **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. During the **Booting** state, if `dpu_control_plane_state` transitions to `up` but `dpu_data_plane_state` remains `down` when `dpu_boot_timeout` expires, `chassisd` logs a WARNING-level syslog (`DPU: data plane not up after s`) but does **not** trigger a power-cycle. The `ready_status` remains `false` until the operator investigates or the data plane recovers. This warning only applies during boot — if a DPU is already in **Ready** state and `dpu_data_plane_state` drops to `down` while control plane stays `up`, `chassisd` sets `ready_status` to `false` but does not trigger a power-cycle or timeout. The DPU stays operational (no recovery action) until the data plane recovers or control plane also goes down. @@ -247,8 +248,11 @@ stateDiagram-v2 SWFailure --> PowerCycle : auto-recovery enabled SWFailure --> ManualIntervention : auto-recovery disabled - Ready --> PowerCycle : HW failure detected (midplane down) [auto-recovery enabled] - Ready --> ManualIntervention : HW failure detected (midplane down) [auto-recovery disabled] + Ready --> WaitForWatchdog : HW failure (midplane down) [auto-recovery enabled] + Ready --> ManualIntervention : HW failure (midplane down) [auto-recovery disabled] + + WaitForWatchdog --> Ready : DPU self-recovered, reboot cause valid + WaitForWatchdog --> PowerCycle : timeout expired OR reboot cause invalid PowerCycle --> Booting : Power cycle issued PowerCycle --> Unrecoverable : reset count >= dpu_reset_limit @@ -271,8 +275,9 @@ stateDiagram-v2 | ----- | :------------: | :----------------: | ----------------- | | **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down`; `chassisd` starts `dpu_boot_timeout` timer — if DPU does not reach Ready before timeout, triggers PowerCycle (or ManualIntervention if auto-recovery disabled) | | **Ready** | `true` | `recoverable` | All three states `up` | -| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag. If **both** control plane and midplane are down, this is treated as a HW failure — skips SWFailure and goes directly to PowerCycle/ManualIntervention. | +| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag. If **both** control plane and midplane are down, this is treated as a HW failure — skips SWFailure and goes to **WaitForWatchdog** (if auto-recovery enabled). | | **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | +| **WaitForWatchdog** | `false` | `recoverable` | Midplane down detected; `chassisd` waiting up to `dpu_boot_timeout` (600s) for DPU HW watchdog to power-cycle and bring DPU back. On DPU return, validates `dpu_reboot_cause` — accepts `Kernel Panic`, `Memory Exhaustion`, or `Watchdog`. | | **ManualIntervention** | `false` | `recoverable` | DPU down; `FEATURE|dpu-auto-recovery` `state` is `disabled` / `always_disabled`; no power-cycle issued; awaits operator action | | **Offline** | `false` | `recoverable` | `oper_status: Offline` | | **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit`; operator can recover via `config chassis module startup DPU` which resets `recovery_status` to `recoverable` and `reset_count` to 0 | @@ -411,10 +416,12 @@ The DPU loses power unexpectedly or shuts down without graceful notification (e. **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- `chassisd` immediately power-cycles the DPU (midplane and control plane are already confirmed down) and increments `reset_count`. -- After power-cycle, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. +- Since `dpu_midplane_link_state` is down, `chassisd` enters the **WaitForWatchdog** state and starts the `dpu_boot_timeout` timer (600 seconds) to allow the DPU to self-recover via HW watchdog. +- If the DPU self-recovers within the timeout and reports a valid reboot cause (`Kernel Panic`, `Memory Exhaustion`, or `Watchdog`), `chassisd` accepts the recovery without issuing its own power-cycle. +- If the timeout expires or the reboot cause is invalid, `chassisd` power-cycles the DPU and increments `reset_count`. +- After recovery, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. -- **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. +- **When auto-recovery is disabled:** `chassisd` skips the WaitForWatchdog and power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -441,10 +448,12 @@ The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable f **PMON Action:** - `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- Since midplane is down (PCIe detached implies midplane unreachable), `chassisd` initiates a DPU power-cycle (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) and increments `reset_count`. +- Since midplane is down (PCIe detached implies midplane unreachable), `chassisd` enters the **WaitForWatchdog** state and starts the `dpu_boot_timeout` timer (600 seconds) to allow the DPU to self-recover via HW watchdog. +- If the DPU self-recovers within the timeout and reports a valid reboot cause (`Kernel Panic`, `Memory Exhaustion`, or `Watchdog`), `chassisd` accepts the recovery without issuing its own power-cycle. +- If the timeout expires or the reboot cause is invalid, `chassisd` power-cycles the DPU (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) and increments `reset_count`. - After power-cycle and PCIe rescan, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. - If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. -- **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. +- **When auto-recovery is disabled:** `chassisd` skips the WaitForWatchdog and power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. **DB State Transition:** @@ -491,6 +500,93 @@ The entire switch (NPU + all DPUs) goes down due to kernel panic or memory exhau --- +## DPU Kernel Panic / Memory Exhaustion (HW Watchdog) ## + +**Description:** +A DPU experiences a kernel panic or memory exhaustion, causing the Linux kernel to crash. If a hardware watchdog is enabled on the DPU, the watchdog timer fires and power-cycles the DPU automatically without NPU intervention. The DPU goes through a full cold boot and comes back to a healthy state on its own. + +Without HW watchdog awareness, `chassisd` would detect `dpu_midplane_link_state: down` and immediately issue its own power-cycle of the DPU — which is redundant and disruptive (it interrupts the DPU's in-progress self-recovery via watchdog). + +**Platform Configuration:** + +The `dpu_boot_timeout` in `platform.json` controls the WaitForWatchdog grace period (same timer used for boot-after-power-cycle): + +```json +{ + "dpu_boot_timeout": 600 +} +``` + +The 600-second default must be long enough for a full DPU HW watchdog trigger + power-cycle + cold boot sequence. The HW watchdog itself may have a timeout of 30–120s before it fires, plus the DPU boot time. + +**Detection (by PMON):** +- `chassisd` detects `dpu_midplane_link_state: down` on its next poll cycle — same as any other HW failure. + +**PMON Action (when midplane goes down):** +1. `chassisd` sets `ready_status` to `false` and updates `last_down_time`. +2. `chassisd` enters the **WaitForWatchdog** state and starts the `dpu_boot_timeout` timer (default 600 seconds). +3. `chassisd` continues polling the DPU state every 10 seconds during this period. +4. **If the DPU comes back** (midplane up → control plane up) within the timeout: + - `chassisd` reads the DPU's previous reboot cause from `CHASSIS_STATE_DB: DPU_STATE|DPU: dpu_reboot_cause` (written by DPU's `determine-reboot-cause` service on boot). + - **If reboot cause is `Kernel Panic`, `Memory Exhaustion`, or `Watchdog`:** `chassisd` accepts the self-recovery — transitions to **Ready**, sets `ready_status` to `true`, updates `last_ready_time`. `reset_count` is **not** incremented (the HW watchdog handled recovery autonomously). + - **If reboot cause is anything else** (e.g., `Unknown`, `Software`, `Power Loss`): `chassisd` does **not** trust the self-recovery. It proceeds with a full power-cycle (`PowerCycle` state), incrementing `reset_count`, to guarantee a clean DPU state. +5. **If the timeout expires** (DPU not back after 600 seconds): + - `chassisd` transitions to **PowerCycle** (if auto-recovery enabled) or **ManualIntervention** (if disabled). + - Standard recovery flow applies: power-cycle, increment `reset_count`, wait for boot via `dpu_boot_timeout`. +6. If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"`. + +**DPU Reboot Cause Reporting:** + +The DPU reports its reboot cause to `CHASSIS_STATE_DB` on the NPU after every boot: + +``` +DPU_STATE|DPU: +{ + "dpu_reboot_cause": "Kernel Panic" | "Memory Exhaustion" | "Watchdog" | "Power Loss" | "Software" | "Unknown" +} +``` + +The DPU's `determine-reboot-cause` service reads from `/host/reboot-cause/reboot-cause.txt` on boot and publishes the cause to CHASSIS_STATE_DB via the midplane. `chassisd` reads this field after the DPU's `dpu_control_plane_state` transitions back to `up`. + +**DB State Transition (DPU self-recovers — reboot cause is valid: Kernel Panic / Memory Exhaustion / Watchdog):** + +| DB Field | Before | During WaitForWatchdog | After Self-Recovery | +| -------- | :----: | :-------------------: | :-----------------: | +| `dpu_midplane_link_state` | `up` | `down` | `up` | +| `dpu_control_plane_state` | `up` | `down` | `up` | +| `ready_status` | `true` | `false` | `true` | +| `last_down_time` | — | `` | — | +| `last_ready_time` | — | — | `` | +| `reset_count` | N | N | N (unchanged) | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` | +| `dpu_reboot_cause` | (previous) | (stale) | `Kernel Panic` / `Memory Exhaustion` / `Watchdog` | + +**DB State Transition (DPU self-recovers — reboot cause invalid → power-cycle anyway):** + +| DB Field | Before | During WaitForWatchdog | After Power-Cycle | +| -------- | :----: | :-------------------: | :---------------: | +| `dpu_midplane_link_state` | `up` | `down` → `up` (self-recovered) → `down` (power-cycle) | `up` | +| `ready_status` | `true` | `false` | `true` | +| `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | + +**DB State Transition (DPU does NOT self-recover — timeout expires):** + +| DB Field | Before | During WaitForWatchdog | After Power-Cycle | +| -------- | :----: | :-------------------: | :---------------: | +| `dpu_midplane_link_state` | `up` | `down` | `up` | +| `ready_status` | `true` | `false` | `true` | +| `reset_count` | N | N | N+1 | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | + +> **Note:** The `dpu_boot_timeout` (600s) is used for both **Booting** (wait for DPU to come up after power-cycle) and **WaitForWatchdog** (wait for DPU to self-recover via HW watchdog). The 600-second value accounts for the HW watchdog trigger delay (30–120s) + DPU power-cycle + cold boot sequence. + +> **Note:** During **WaitForWatchdog**, if a planned operation (`config chassis module shutdown`) is requested, `chassisd` cancels the watchdog wait timer, powers down the DPU, and transitions to **Offline**. + +> **Note:** This feature only applies to **HW failures** (midplane down). For SW failures (control plane down, midplane still up), `chassisd` still power-cycles immediately because the HW watchdog would not fire in a software-only failure scenario (the DPU hardware is still running). + +--- + ## Planned Operations ## ### DPU Graceful Shutdown ### @@ -614,6 +710,7 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All | DPU dead – power cycle | down | down | false | Power-cycle DPU; increment `reset_count` | | DPU dead – unrecoverable | down | down | false | `reset_count` reached `dpu_reset_limit`; `recovery_status` set to `"unrecoverable"`; raise alert | | Full SmartSwitch reboot (planned) | down → up | down → up | false → true | gNOI halt; power-cycle; re-verify | +| DPU kernel panic / mem exhaustion (HW watchdog) | down → up | down → up | false → true | Wait `dpu_boot_timeout` (600s); if DPU self-recovers AND reboot cause is `Kernel Panic`/`Memory Exhaustion`/`Watchdog`, accept recovery (`reset_count` unchanged); otherwise power-cycle | --- From 34cbf3068a241b987c11b37503b563e73fd0521e Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Fri, 29 May 2026 19:53:56 +0000 Subject: [PATCH 10/14] Simplify failure scenarios into unified DPU recovery model - Add dpu_self_recovery_timeout (300s) for DPU self-recovery grace period - Consolidate all DPU failure types into single 'DPU Failure' category with unified WaitForSelfRecovery state - Consolidate NPU failures into 'NPU Ungraceful Reboot' category - Update state machine: replace SWFailure/WaitForWatchdog with WaitForSelfRecovery state - Make Key DB Indicators column explicit with exact DB field conditions - Remove unused auto_restart/high_mem_alert from dpu-auto-recovery feature - Simplify chassisd health poll interval and dpu_boot_timeout descriptions - Fix gRPC abbreviation expansion Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 386 ++++-------------- 1 file changed, 75 insertions(+), 311 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 9af969c90ab..50a5d8c7634 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -13,17 +13,9 @@ - [Existing DB entries](#existing-db-entries) - [New DB entries](#new-db-entries) - [DPU Recovery State Machine](#dpu-recovery-state-machine) -- [DPU Software Failures](#dpu-software-failures) - - [Process Crash/Restart on DPU](#process-crashrestart-on-dpu) - - [pmon Crash on NPU](#pmon-crash-on-npu) - - [databasedpu Crash on NPU](#databasedpu-crash-on-npu) -- [DPU Hardware Failures](#dpu-hardware-failures) - - [DPU Hardware Failure (Complete DPU Down)](#dpu-hardware-failure-complete-dpu-down) - - [DPU Power Failure / Unexpected Shutdown](#dpu-power-failure--unexpected-shutdown) - - [PCIe Failure](#pcie-failure) -- [NPU / Switch Level Failures](#npu--switch-level-failures) - - [NPU Kernel Crash / Memory Exhaustion](#npu-kernel-crash--memory-exhaustion) -- [DPU Kernel Panic / Memory Exhaustion (HW Watchdog)](#dpu-kernel-panic--memory-exhaustion-hw-watchdog) +- [Unplanned Failures](#unplanned-failures) + - [DPU Failure](#dpu-failure) + - [NPU Ungraceful Reboot](#npu-ungraceful-reboot) - [Planned Operations](#planned-operations) - [DPU Graceful Shutdown](#dpu-graceful-shutdown) - [DPU Cold Reboot](#dpu-cold-reboot) @@ -50,9 +42,8 @@ This document covers the High Level Design for DPU failure scenarios on a SmartS The scope includes: -- DPU software failures (process crashes and restarts on DPU; pmon and databasedpu crashes on NPU) -- DPU hardware failures (complete DPU down, power failure / unexpected shutdown, PCIe failure) -- NPU/switch-level failures (kernel crash, memory exhaustion) +- DPU failures (any event causing control plane or midplane to go down — process crashes, hardware faults, power loss, PCIe failures, kernel panics) +- NPU ungraceful reboot (kernel panic, unknown reboot cause — triggers power-cycle of all DPUs) - DB state tracking for DPU failure detection and recovery (new and existing DB entries) - DB state tracking for planned operations - PMON critical process definitions and criticality levels @@ -70,7 +61,7 @@ The scope includes: | DB | Redis Database | | DPU | Data Processing Unit | | gNOI | gRPC Network Operations Interface | -| gRPC | Google Remote Procedure Call | +| gRPC | gRPC Remote Procedure Call | | NPU | Network Processing Unit | | PCIe | PCI Express (Peripheral Component Interconnect Express) | | PMON | Platform Monitor | @@ -96,9 +87,9 @@ This document enumerates all failure scenarios that can occur on a DPU or its su | chassisd | Chassis daemon running inside `pmon` on the NPU; monitors DPU health states, manages DPU power-cycle and reset operations | | pmon | Platform Monitor daemon on NPU; hosts `chassisd` and other hardware monitoring sub-daemons | | syncd | Sync daemon; manages SAI API calls to DPU ASIC | -| control plane state | DPU SONiC is booted up, all containers are up, interfaces are up, and DPU is ready to accept configuration. Derived from SYSTEM_READY in STATE_DB. Values: `"up"`, `"down"`. | -| midplane link state | The PCIe link between the NPU and DPU is operational. Monitored and updated by NPU pmon `chassisd` via the `is_midplane_reachable` platform API. Values: `"up"`, `"down"`. | -| dataplane state | Configuration is downloaded, pipeline stages are up, and DPU hardware (port/ASIC) is ready to take traffic. Values: `"up"`, `"down"`. | +| control plane state | Indicates whether DPU SONiC is booted up, all containers are up, and DPU is ready to accept configuration. Derived from SYSTEM_READY in STATE_DB. Values: `"up"`, `"down"`. | +| midplane link state | Indicates whether the PCIe link between the NPU and DPU is operational. Monitored and updated by NPU pmon `chassisd` via the `is_midplane_reachable` platform API. Values: `"up"`, `"down"`. | +| dataplane state | Indicates whether configuration is downloaded, pipeline stages are up, and DPU hardware (port/ASIC) is ready to take traffic. Values: `"up"`, `"down"`. | --- @@ -129,13 +120,14 @@ All timers and thresholds used by PMON for DPU failure detection and recovery ar | Timer / Threshold | Default Value | Configurable | Used By | Description | | ----------------- | :-----------: | :----------: | ------- | ----------- | -| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state`. As soon as `dpu_control_plane_state` is observed as `down` (with midplane still up), `chassisd` initiates a DPU power-cycle (no self-heal grace period). When `dpu_midplane_link_state` goes down, `chassisd` enters **WaitForWatchdog** and waits `dpu_boot_timeout` for the DPU to self-recover via HW watchdog before power-cycling. | +| `chassisd` health poll interval | 10 seconds | No | `chassisd` | Interval at which `chassisd` polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state`. | | `pcied` PCIe poll interval | 60 seconds | No | `pcied` | Interval at which `pcied` checks PCIe link status for all DPUs. A PCIe failure may go undetected for up to 60 seconds. | | `dpu_halt_services_timeout` | 60 seconds | Yes (`platform.json`) | `gnoi_reboot_daemon.py` | Maximum time to wait for DPU services to halt gracefully during reboot/shutdown | -| `dpu_boot_timeout` | 600 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to reach `Ready` state (all planes up) after a power-cycle or after entering **WaitForWatchdog** (midplane down). In **Booting** state: if the DPU does not become ready within this timeout, `chassisd` treats it as a boot failure and initiates another power-cycle (incrementing `reset_count`). In **WaitForWatchdog** state: if the DPU does not self-recover within this timeout, `chassisd` issues its own power-cycle. If the DPU comes back within this timeout AND reports a recognized reboot cause (`Kernel Panic`, `Memory Exhaustion`, or `Watchdog`), `chassisd` accepts the self-recovery without issuing its own power-cycle. | +| `dpu_boot_timeout` | 600 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to reach `Ready` state (all planes up) after a power-cycle. | +| `dpu_self_recovery_timeout` | 300 seconds | Yes (`platform.json`) | `chassisd` | Maximum time to wait for a DPU to self-recover before `chassisd` initiates a power-cycle. This grace period allows the DPU to recover from transient failures without external intervention. | | `dpu_reset_limit` | 2 | Yes (`platform.json`) | `chassisd` | Maximum number of consecutive unplanned power-cycle attempts before marking DPU as unrecoverable | -> **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. During the **Booting** state, if `dpu_control_plane_state` transitions to `up` but `dpu_data_plane_state` remains `down` when `dpu_boot_timeout` expires, `chassisd` logs a WARNING-level syslog (`DPU: data plane not up after s`) but does **not** trigger a power-cycle. The `ready_status` remains `false` until the operator investigates or the data plane recovers. This warning only applies during boot — if a DPU is already in **Ready** state and `dpu_data_plane_state` drops to `down` while control plane stays `up`, `chassisd` sets `ready_status` to `false` but does not trigger a power-cycle or timeout. The DPU stays operational (no recovery action) until the data plane recovers or control plane also goes down. +> **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. A data-plane-down with control-plane-up scenario indicates that the DPU SONiC stack is running but the data plane pipeline has not converged — this is expected during initial programming or after a configuration change. Recovery is triggered only when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. The `dpu_data_plane_state` is used by `chassisd` solely to determine full DPU readiness for setting `ready_status` to `true`. > **Auto-recovery trigger vs. planned operations:** `chassisd` auto-recovery is triggered **only** for unplanned failures. During planned operations (graceful shutdown via `config chassis module shutdown` or DPU reboot via `reboot -d`), the `state_transition_in_progress` field in `CHASSIS_MODULE_TABLE|DPU` is set to `True` **before** the DPU control plane goes down. When `chassisd` observes `dpu_control_plane_state: down`, it checks `state_transition_in_progress`: if `True`, `chassisd` skips auto-recovery because the shutdown/reboot is intentional. Auto-recovery is only initiated when `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down` **and** no planned transition is in progress (`state_transition_in_progress == False`). There is no additional timeout configured for this check — the distinction is purely flag-based. @@ -216,19 +208,15 @@ DPU_STATE|DPU: ``` FEATURE|dpu-auto-recovery: { - "state": "enabled" | "disabled" | "always_disabled", - "auto_restart": "enabled" | "disabled", - "high_mem_alert": "disabled" + "state": "enabled" | "disabled" | "always_disabled" } ``` | Field | Default | Description | | ----- | ------- | ----------- | | `state` | `enabled` | Enable or disable the DPU auto-recovery feature. When `disabled` or `always_disabled`, `chassisd` will not automatically power-cycle DPUs on failure. | -| `auto_restart` | `enabled` | Standard SONiC FEATURE table field — enables `systemd` to restart the feature's associated service if it crashes. | -| `high_mem_alert` | `disabled` | Standard SONiC FEATURE table field — high memory usage alert threshold. | -> **Note:** `dpu-auto-recovery` is **not** a separate service or container. It is a feature flag entry in CONFIG_DB's `FEATURE` table, read by `chassisd` (running inside the `pmon` container) to determine whether automatic DPU power-cycle recovery is enabled. The `auto_restart` and `high_mem_alert` fields are standard SONiC FEATURE table fields required by the feature infrastructure; they do not govern `chassisd` itself. When `state` is `disabled`, `chassisd` still monitors and updates DPU states in CHASSIS_STATE_DB, but will not initiate automatic power-cycle recovery. Manual intervention is required to recover failed DPUs. +> **Note:** `dpu-auto-recovery` is **not** a separate service or container. It is a feature flag entry in CONFIG_DB's `FEATURE` table, read by `chassisd` (running inside the `pmon` container) to determine whether automatic DPU power-cycle recovery is enabled. When `state` is `disabled`, `chassisd` still monitors and updates DPU states in CHASSIS_STATE_DB, but will not initiate automatic power-cycle recovery. Manual intervention is required to recover failed DPUs. --- @@ -244,15 +232,11 @@ stateDiagram-v2 Booting --> PowerCycle : boot timeout expired [auto-recovery enabled] Booting --> ManualIntervention : boot timeout expired [auto-recovery disabled] - Ready --> SWFailure : Control plane down (midplane up) - SWFailure --> PowerCycle : auto-recovery enabled - SWFailure --> ManualIntervention : auto-recovery disabled + Ready --> WaitForSelfRecovery : Control plane OR midplane down [auto-recovery enabled] + Ready --> ManualIntervention : Control plane OR midplane down [auto-recovery disabled] - Ready --> WaitForWatchdog : HW failure (midplane down) [auto-recovery enabled] - Ready --> ManualIntervention : HW failure (midplane down) [auto-recovery disabled] - - WaitForWatchdog --> Ready : DPU self-recovered, reboot cause valid - WaitForWatchdog --> PowerCycle : timeout expired OR reboot cause invalid + WaitForSelfRecovery --> Booting : DPU self-recovered (midplane up OR control plane up) + WaitForSelfRecovery --> PowerCycle : self-recovery timeout expired PowerCycle --> Booting : Power cycle issued PowerCycle --> Unrecoverable : reset count >= dpu_reset_limit @@ -273,219 +257,87 @@ stateDiagram-v2 | State | `ready_status` | `recovery_status` | Key DB Indicators | | ----- | :------------: | :----------------: | ----------------- | -| **Booting** | `false` | `recoverable` | `dpu_control_plane_state: down`; `chassisd` starts `dpu_boot_timeout` timer — if DPU does not reach Ready before timeout, triggers PowerCycle (or ManualIntervention if auto-recovery disabled) | -| **Ready** | `true` | `recoverable` | All three states `up` | -| **SWFailure** | `false` | `recoverable` | `dpu_control_plane_state: down`, `dpu_midplane_link_state: up`; transient state before `chassisd` selects `PowerCycle` or `ManualIntervention` based on the auto-recovery feature flag. If **both** control plane and midplane are down, this is treated as a HW failure — skips SWFailure and goes to **WaitForWatchdog** (if auto-recovery enabled). | -| **PowerCycle** | `false` | `recoverable` | `chassisd` issuing power-cycle; `reset_count` incremented | -| **WaitForWatchdog** | `false` | `recoverable` | Midplane down detected; `chassisd` waiting up to `dpu_boot_timeout` (600s) for DPU HW watchdog to power-cycle and bring DPU back. On DPU return, validates `dpu_reboot_cause` — accepts `Kernel Panic`, `Memory Exhaustion`, or `Watchdog`. | -| **ManualIntervention** | `false` | `recoverable` | DPU down; `FEATURE|dpu-auto-recovery` `state` is `disabled` / `always_disabled`; no power-cycle issued; awaits operator action | -| **Offline** | `false` | `recoverable` | `oper_status: Offline` | -| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit`; operator can recover via `config chassis module startup DPU` which resets `recovery_status` to `recoverable` and `reset_count` to 0 | +| **Booting** | `false` | `recoverable` | `dpu_midplane_link_state: up`, `dpu_control_plane_state: down`; `dpu_boot_timeout` timer running | +| **Ready** | `true` | `recoverable` | `dpu_midplane_link_state: up`, `dpu_control_plane_state: up`, `dpu_data_plane_state: up` | +| **WaitForSelfRecovery** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `dpu_self_recovery_timeout` timer running | +| **PowerCycle** | `false` | `recoverable` | `dpu_midplane_link_state: down`, `dpu_control_plane_state: down`; `reset_count` incremented; power-cycle in progress | +| **ManualIntervention** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `FEATURE|dpu-auto-recovery` `state`: `disabled` / `always_disabled` | +| **Offline** | `false` | `recoverable` | `oper_status: Offline`; DPU admin-down via `config chassis module shutdown` | +| **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit`; `dpu_control_plane_state: down`; `dpu_midplane_link_state: down` | --- -## DPU Software Failures ## +## Unplanned Failures ## -### Process crash/restart on DPU ### +### DPU Failure ### **Description:** -Any process crashes on the DPU and `dpu_control_plane_state` transitions to `down`. `chassisd` does not wait for the container supervisor to self-heal — it issues a DPU power-cycle as soon as it observes `dpu_control_plane_state: down` on its next poll. +Any unplanned event that causes `dpu_control_plane_state` or `dpu_midplane_link_state` to transition to `down`. This covers all DPU failure scenarios: process crashes on DPU, DPU hardware faults, power loss, PCIe failures, kernel panics, and memory exhaustion. **Detection (by PMON):** -- `chassisd` on the NPU polls `dpu_control_plane_state` every 10 seconds and observes it as `down`. +- `chassisd` on the NPU polls `dpu_control_plane_state`, `dpu_data_plane_state`, and `dpu_midplane_link_state` every 10 seconds. +- A failure is detected when either `dpu_control_plane_state` or `dpu_midplane_link_state` transitions to `down`. **PMON Action:** -- `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- On the **same poll cycle** that detects `dpu_control_plane_state: down`, `chassisd` issues a power-cycle of the DPU and increments `reset_count`. There is no additional timeout or grace period — the only detection latency is the 10-second poll interval itself. -- After the DPU comes back, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. -- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. -- **When auto-recovery is disabled:** `chassisd` skips the power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must reset the DPU manually. - -**DB State Transition:** - -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | +1. `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the affected DPU. +2. `chassisd` enters the **WaitForSelfRecovery** state and starts the `dpu_self_recovery_timeout` timer (default 300 seconds). This grace period allows the DPU to recover from transient failures (e.g., process restart, HW watchdog-triggered reboot) without external intervention. +3. `chassisd` continues polling the DPU state every 10 seconds during this period. +4. **If the DPU self-recovers** (midplane comes back up OR control plane comes back up) within the `dpu_self_recovery_timeout`: + - `chassisd` transitions to the **Booting** state and starts the `dpu_boot_timeout` timer (default 600 seconds) to wait for full DPU readiness (all planes up). + - Once all states are up, `chassisd` sets `ready_status` to `true` and updates `last_ready_time`. `reset_count` is **not** incremented (the DPU recovered autonomously). + - If the DPU does not reach full readiness within `dpu_boot_timeout`, `chassisd` treats it as a boot failure and initiates a power-cycle (incrementing `reset_count`). +5. **If both `dpu_control_plane_state` and `dpu_midplane_link_state` remain down** after `dpu_self_recovery_timeout` expires: + - `chassisd` transitions to **PowerCycle**, issues a power-cycle of the DPU, and increments `reset_count`. + - After the power-cycle, `chassisd` transitions to **Booting** and starts the `dpu_boot_timeout` timer. + - Once the DPU reaches full readiness, `chassisd` sets `ready_status` to `true` and updates `last_ready_time`. +6. If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. +7. **When auto-recovery is disabled:** `chassisd` skips the self-recovery wait and power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must reset the DPU manually. + +**DB State Transition (DPU self-recovers within `dpu_self_recovery_timeout`):** + +| DB Field | Before | During WaitForSelfRecovery | After Recovery | +| -------- | :----: | :------------------------: | :------------: | | `dpu_control_plane_state` | `up` | `down` | `up` | +| `dpu_midplane_link_state` | `up` | `up` or `down` | `up` | | `ready_status` | `true` | `false` | `true` | | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | - ---- - -### pmon crash on NPU ### - -**Description:** -The `pmon` (Platform Monitor) daemon on the NPU crashes. This is a **critical** PMON failure — `chassisd` and all other PMON sub-daemons stop, halting all DPU health monitoring. - -**Detection (by PMON):** -- Not self-detectable. `systemd` detects the `pmon` container is down and restarts it. -- DPU health state updates to `CHASSIS_STATE_DB` stop during the outage. - -**PMON Action:** -- On `chassisd` bringup sequence after restart, `chassisd` resets `reset_count` to 0 for **all** DPUs, sets `ready_status` to `false`, and updates `last_down_time` for **all** DPUs. -- `chassisd` re-polls all DPU states and updates `CHASSIS_STATE_DB` with current values. -- For each DPU found healthy, `chassisd` sets `ready_status` back to `true` and updates `last_ready_time`. - -**DB State Transition:** - -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | -| `ready_status` (all DPUs) | `true` | stale | `false` → `true` (per DPU) | -| `last_down_time` (all DPUs) | — | — | `` | -| `last_ready_time` (all DPUs) | — | — | `` (per DPU) | -| `reset_count` (all DPUs) | N | stale | 0 | - -> **Note:** `reset_count` is reset to 0 on `chassisd` restart. This means a DPU that was close to `dpu_reset_limit` gets a fresh retry budget after a pmon crash. This is by design — the pmon restart itself represents operator-level intervention, and persistent hardware issues will be caught again within `dpu_reset_limit` attempts. - -> **Note:** If only `chassisd` crashes within the `pmon` container (while `pmon` itself stays running), `supervisord` inside `pmon` restarts `chassisd` automatically. The recovery behavior is identical to the full `pmon` crash case described above — `chassisd` re-initializes all DPU states on startup. - ---- - -### databasedpu crash on NPU ### - -**Description:** -The `databasedpu` (per-DPU Redis database instance) on the NPU crashes. Each DPU has a dedicated Redis instance on the NPU (port 6381 + DPU ID, bound to midplane bridge IP 169.254.200.254). These per-DPU Redis instances host the DPU's APPL_DB, CONFIG_DB, STATE_DB, etc. — they are **not** the same as CHASSIS_STATE_DB. The DPU's `orchagent` and `syncd` read/write from these instances, while `chassisd` monitors DPU health via CHASSIS_STATE_DB (a separate Redis instance on the NPU). - -**Detection (by PMON):** -- `chassisd` does **not** directly detect the `databasedpuN` service crash. The detection is indirect: - 1. When `databasedpuN` crashes, the DPU's `orchagent` and other services lose DB connectivity. - 2. Critical services on the DPU fail, causing `SYSTEM_READY` on the DPU to go `false`. - 3. The DPU updates its `dpu_control_plane_state` to `down` in CHASSIS_STATE_DB (which is a separate Redis instance and remains accessible). - 4. `chassisd` observes `dpu_control_plane_state: down` on its next poll cycle. -- If the `databasedpuN` crash is caused by a midplane failure, then CHASSIS_STATE_DB also becomes inaccessible from the DPU side, and `chassisd` detects the failure via `dpu_midplane_link_state: down` instead. - -**PMON Action:** -- `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- If `dpu_control_plane_state` or `dpu_midplane_link_state` is observed as `down`, `chassisd` initiates a DPU power-cycle (same as any other unplanned failure). -- After `systemd` restarts the Redis instance and the DPU recovers (or after power-cycle recovery), `chassisd` polls DPU state, sets `ready_status` back to `true`, and updates `last_ready_time` once all states are verified. +| `reset_count` | N | N | N (unchanged) | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` | -**DB State Transition:** +**DB State Transition (DPU does NOT self-recover — power-cycle issued):** -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | +| DB Field | Before | During WaitForSelfRecovery | After Power-Cycle Recovery | +| -------- | :----: | :------------------------: | :------------------------: | | `dpu_control_plane_state` | `up` | `down` | `up` | -| `ready_status` | `true` | `false` | `true` | -| `last_down_time` | — | `` | — | -| `last_ready_time` | — | — | `` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | - ---- - -## DPU Hardware Failures ## - -### DPU Hardware Failure (Complete DPU Down) ### - -**Description:** -A DPU completely fails due to hardware fault, thermal event, or unrecoverable error. The DPU is no longer responsive on the midplane or back-panel ports. - -**Detection (by PMON):** -- NPU: Oper state of the DPU `CHASSIS_MODULE_TABLE|DPU|oper_status` is set to `offline`. - -**PMON Action:** -- `chassisd` sets `ready_status` to `false` and updates `last_down_time` for the corresponding DPU. -- `chassisd` immediately power-cycles the DPU (the DPU is already confirmed non-functional via `oper_status: Offline`) and increments `reset_count`. -- After power-cycle, DPU goes through full boot sequence: midplane attach → PCIe rescan → SONiC boot → container startup. -- `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. -- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. -- **When auto-recovery is disabled:** `chassisd` skips the immediate power-cycle. The DPU remains in **ManualIntervention** with `oper_status: Offline` and `ready_status: false`; operator must trigger recovery. - -**DB State Transition:** - -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | -| `oper_status` | `Online` | `Offline` | `Online` | -| `ready_status` | `true` | `false` | `true` | -| `last_down_time` | — | `` | — | -| `last_ready_time` | — | — | `` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | - ---- - -### DPU Power Failure / Unexpected Shutdown ### - -**Description:** -The DPU loses power unexpectedly or shuts down without graceful notification (e.g., voltage regulator failure, firmware crash). - -**Detection (by PMON):** -- NPU `pmon` detects midplane ping failure → `dpu_midplane_link_state` set to `down`. -- `dpu_control_plane_state` transitions to `down`. - -**PMON Action:** -- `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- Since `dpu_midplane_link_state` is down, `chassisd` enters the **WaitForWatchdog** state and starts the `dpu_boot_timeout` timer (600 seconds) to allow the DPU to self-recover via HW watchdog. -- If the DPU self-recovers within the timeout and reports a valid reboot cause (`Kernel Panic`, `Memory Exhaustion`, or `Watchdog`), `chassisd` accepts the recovery without issuing its own power-cycle. -- If the timeout expires or the reboot cause is invalid, `chassisd` power-cycles the DPU and increments `reset_count`. -- After recovery, `chassisd` verifies all DPU states, sets `ready_status` back to `true`, and updates `last_ready_time`. -- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. -- **When auto-recovery is disabled:** `chassisd` skips the WaitForWatchdog and power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. - -**DB State Transition:** - -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | | `dpu_midplane_link_state` | `up` | `down` | `up` | -| `dpu_control_plane_state` | `up` | `down` | `up` | | `ready_status` | `true` | `false` | `true` | | `last_down_time` | — | `` | — | | `last_ready_time` | — | — | `` | | `reset_count` | N | N | N+1 | | `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | ---- +> **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. The `dpu_data_plane_state` is used solely to determine full DPU readiness for setting `ready_status` to `true`. -### PCIe Failure ### - -**Description:** -The PCIe bus between the NPU and a local DPU fails, making the DPU unreachable from the NPU. The DPU may still be running internally but is disconnected from the NPU. - -**Detection (by PMON):** -- `pcied` detects PCIe link down and updates `PCIE_DETACH_INFO|DPU` in STATE_DB with `dpu_state: detached`. -- Independently, `chassisd` detects midplane loss via `is_midplane_reachable()` polling and updates `dpu_midplane_link_state` → `down` in CHASSIS_STATE_DB. - -**PMON Action:** -- `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -- Since midplane is down (PCIe detached implies midplane unreachable), `chassisd` enters the **WaitForWatchdog** state and starts the `dpu_boot_timeout` timer (600 seconds) to allow the DPU to self-recover via HW watchdog. -- If the DPU self-recovers within the timeout and reports a valid reboot cause (`Kernel Panic`, `Memory Exhaustion`, or `Watchdog`), `chassisd` accepts the recovery without issuing its own power-cycle. -- If the timeout expires or the reboot cause is invalid, `chassisd` power-cycles the DPU (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) and increments `reset_count`. -- After power-cycle and PCIe rescan, `chassisd` verifies all DPU states (midplane, control plane, data plane), sets `ready_status` back to `true`, and updates `last_ready_time`. -- If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"` and stops further automatic power-cycle attempts. -- **When auto-recovery is disabled:** `chassisd` skips the WaitForWatchdog and power-cycle. The DPU remains in **ManualIntervention** with `ready_status: false`; operator must trigger recovery. - -**DB State Transition:** - -| DB Field | Before | During Failure | After Recovery | -| -------- | :----: | :------------: | :------------: | -| `dpu_midplane_link_state` | `up` | `down` | `up` | -| `PCIE_DETACH_INFO` `dpu_state` | `reattached` | `detached` | `reattached` | -| `ready_status` | `true` | `false` | `true` | -| `last_down_time` | — | `` | — | -| `last_ready_time` | — | — | `` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | +> **Note:** During **WaitForSelfRecovery**, if a planned operation (`config chassis module shutdown`) is requested, `chassisd` cancels the self-recovery timer, powers down the DPU, and transitions to **Offline**. --- -## NPU / Switch Level Failures ## - -### NPU Kernel Crash / Memory Exhaustion ### +### NPU Ungraceful Reboot ### **Description:** -The entire switch (NPU + all DPUs) goes down due to kernel panic or memory exhaustion. All DPUs on the switch are impacted simultaneously. +The NPU reboots unexpectedly due to kernel panic, memory exhaustion, or other unplanned event. All DPUs on the switch are potentially impacted. **Detection (by PMON):** -- On NPU recovery, `chassisd` reads the reboot cause from `/host/reboot-cause/reboot-cause.txt`. If the reboot cause indicates a kernel crash or memory exhaustion (e.g., `Kernel Panic`), `chassisd` treats all DPU states as potentially stale and triggers re-initialization. +- On NPU recovery, `chassisd` reads the reboot cause from `/host/reboot-cause/reboot-cause.txt`. +- If the reboot cause is `Kernel Panic` or `Unknown`, `chassisd` treats all DPU states as potentially stale and triggers recovery for all DPUs. **PMON Action:** -- On recovery, `chassisd` initializes all DPU states as `down`, sets `ready_status` to `false`, and updates `last_down_time` for all DPUs. -- `chassisd` re-establishes midplane connectivity and polls each DPU's state. -- For every admin-up DPU, irrespective of its observed state (healthy, degraded, or unresponsive), `chassisd` issues a platform vendor power-cycle (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) to guarantee a known-good starting state after the NPU crash, and increments `reset_count`. +- On recovery, `chassisd` resets `reset_count` to 0 for all DPUs, sets `ready_status` to `false`, and updates `last_down_time` for all DPUs. +- For every admin-up DPU, `chassisd` issues a power-cycle (`power_down()` → `pci_detach()` → `power_up()` → `pci_reattach()`) to guarantee a known-good starting state after the ungraceful reboot, and increments `reset_count`. - Admin-down DPUs (`oper_status: Offline`) are left powered off; `chassisd` does not reset them. - After each DPU comes back, `chassisd` verifies all DPU states (midplane, control plane, data plane) and, on success, sets `ready_status` back to `true` and updates `last_ready_time`. -- **When auto-recovery is disabled:** `chassisd` skips the unconditional power-cycle for admin-up DPUs. Each DPU is left in its post-crash state with `ready_status: false` and remains in **ManualIntervention** awaiting operator action. +- **When auto-recovery is disabled:** `chassisd` skips the power-cycle for admin-up DPUs. Each DPU is left in its post-crash state with `ready_status: false` and remains in **ManualIntervention** awaiting operator action. **DB State Transition:** @@ -494,96 +346,11 @@ The entire switch (NPU + all DPUs) goes down due to kernel panic or memory exhau | `ready_status` (all DPUs) | `true` | `false` | `true` (per DPU) | | `last_down_time` (all DPUs) | — | `` | — | | `last_ready_time` (all DPUs) | — | — | `` (per DPU) | -| `reset_count` (per admin-up DPU) | N | N | N+1 | - -> **Note:** `reset_count` is reset to 0 on `chassisd` startup (per the field definition), so the "Before Crash" value above is the count as observed by the freshly restarted `chassisd` after the NPU comes back — effectively starting from 0. - ---- - -## DPU Kernel Panic / Memory Exhaustion (HW Watchdog) ## - -**Description:** -A DPU experiences a kernel panic or memory exhaustion, causing the Linux kernel to crash. If a hardware watchdog is enabled on the DPU, the watchdog timer fires and power-cycles the DPU automatically without NPU intervention. The DPU goes through a full cold boot and comes back to a healthy state on its own. - -Without HW watchdog awareness, `chassisd` would detect `dpu_midplane_link_state: down` and immediately issue its own power-cycle of the DPU — which is redundant and disruptive (it interrupts the DPU's in-progress self-recovery via watchdog). - -**Platform Configuration:** - -The `dpu_boot_timeout` in `platform.json` controls the WaitForWatchdog grace period (same timer used for boot-after-power-cycle): - -```json -{ - "dpu_boot_timeout": 600 -} -``` - -The 600-second default must be long enough for a full DPU HW watchdog trigger + power-cycle + cold boot sequence. The HW watchdog itself may have a timeout of 30–120s before it fires, plus the DPU boot time. - -**Detection (by PMON):** -- `chassisd` detects `dpu_midplane_link_state: down` on its next poll cycle — same as any other HW failure. - -**PMON Action (when midplane goes down):** -1. `chassisd` sets `ready_status` to `false` and updates `last_down_time`. -2. `chassisd` enters the **WaitForWatchdog** state and starts the `dpu_boot_timeout` timer (default 600 seconds). -3. `chassisd` continues polling the DPU state every 10 seconds during this period. -4. **If the DPU comes back** (midplane up → control plane up) within the timeout: - - `chassisd` reads the DPU's previous reboot cause from `CHASSIS_STATE_DB: DPU_STATE|DPU: dpu_reboot_cause` (written by DPU's `determine-reboot-cause` service on boot). - - **If reboot cause is `Kernel Panic`, `Memory Exhaustion`, or `Watchdog`:** `chassisd` accepts the self-recovery — transitions to **Ready**, sets `ready_status` to `true`, updates `last_ready_time`. `reset_count` is **not** incremented (the HW watchdog handled recovery autonomously). - - **If reboot cause is anything else** (e.g., `Unknown`, `Software`, `Power Loss`): `chassisd` does **not** trust the self-recovery. It proceeds with a full power-cycle (`PowerCycle` state), incrementing `reset_count`, to guarantee a clean DPU state. -5. **If the timeout expires** (DPU not back after 600 seconds): - - `chassisd` transitions to **PowerCycle** (if auto-recovery enabled) or **ManualIntervention** (if disabled). - - Standard recovery flow applies: power-cycle, increment `reset_count`, wait for boot via `dpu_boot_timeout`. -6. If `reset_count` reaches `dpu_reset_limit`, `chassisd` sets `recovery_status` to `"unrecoverable"`. - -**DPU Reboot Cause Reporting:** - -The DPU reports its reboot cause to `CHASSIS_STATE_DB` on the NPU after every boot: - -``` -DPU_STATE|DPU: -{ - "dpu_reboot_cause": "Kernel Panic" | "Memory Exhaustion" | "Watchdog" | "Power Loss" | "Software" | "Unknown" -} -``` - -The DPU's `determine-reboot-cause` service reads from `/host/reboot-cause/reboot-cause.txt` on boot and publishes the cause to CHASSIS_STATE_DB via the midplane. `chassisd` reads this field after the DPU's `dpu_control_plane_state` transitions back to `up`. - -**DB State Transition (DPU self-recovers — reboot cause is valid: Kernel Panic / Memory Exhaustion / Watchdog):** - -| DB Field | Before | During WaitForWatchdog | After Self-Recovery | -| -------- | :----: | :-------------------: | :-----------------: | -| `dpu_midplane_link_state` | `up` | `down` | `up` | -| `dpu_control_plane_state` | `up` | `down` | `up` | -| `ready_status` | `true` | `false` | `true` | -| `last_down_time` | — | `` | — | -| `last_ready_time` | — | — | `` | -| `reset_count` | N | N | N (unchanged) | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` | -| `dpu_reboot_cause` | (previous) | (stale) | `Kernel Panic` / `Memory Exhaustion` / `Watchdog` | - -**DB State Transition (DPU self-recovers — reboot cause invalid → power-cycle anyway):** - -| DB Field | Before | During WaitForWatchdog | After Power-Cycle | -| -------- | :----: | :-------------------: | :---------------: | -| `dpu_midplane_link_state` | `up` | `down` → `up` (self-recovered) → `down` (power-cycle) | `up` | -| `ready_status` | `true` | `false` | `true` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | +| `reset_count` (all admin-up DPUs) | N | 0 | 1 | -**DB State Transition (DPU does NOT self-recover — timeout expires):** +> **Note:** `reset_count` is reset to 0 on `chassisd` startup. This means a DPU that was close to `dpu_reset_limit` gets a fresh retry budget after an NPU reboot. This is by design — persistent hardware issues will be caught again within `dpu_reset_limit` attempts. -| DB Field | Before | During WaitForWatchdog | After Power-Cycle | -| -------- | :----: | :-------------------: | :---------------: | -| `dpu_midplane_link_state` | `up` | `down` | `up` | -| `ready_status` | `true` | `false` | `true` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | - -> **Note:** The `dpu_boot_timeout` (600s) is used for both **Booting** (wait for DPU to come up after power-cycle) and **WaitForWatchdog** (wait for DPU to self-recover via HW watchdog). The 600-second value accounts for the HW watchdog trigger delay (30–120s) + DPU power-cycle + cold boot sequence. - -> **Note:** During **WaitForWatchdog**, if a planned operation (`config chassis module shutdown`) is requested, `chassisd` cancels the watchdog wait timer, powers down the DPU, and transitions to **Offline**. - -> **Note:** This feature only applies to **HW failures** (midplane down). For SW failures (control plane down, midplane still up), `chassisd` still power-cycles immediately because the HW watchdog would not fire in a software-only failure scenario (the DPU hardware is still running). +> **Note — Multiple DPU recovery:** When multiple (or all) DPUs need recovery simultaneously, `chassisd` issues power-cycles sequentially (one DPU at a time) to avoid power-rail overload and PCIe bus contention. The `dpu_boot_timeout` is tracked per-DPU independently. If a platform supports parallel DPU power-cycle (declared in `platform.json` via `parallel_dpu_recovery: true`), `chassisd` may issue power-cycles in parallel batches. --- @@ -700,17 +467,14 @@ Planned reboot of the entire SmartSwitch (NPU + all DPUs) via CLI: `reboot`. All | DPU Scenario | `dpu_control_plane_state` | `dpu_midplane_link_state` | `ready_status` | PMON Action | | ------------ | :-----------------------: | :-----------------------: | :-----------: | ----------- | | DPU booting – initial state | down | down → up | false | `chassisd` polls; midplane comes up first, then waits for control plane and data plane within `dpu_boot_timeout` | -| DPU healthy and running – first boot | up | up | true | Set `ready_status=true` after verifying all states | -| DPU crash / unplanned reboot | down | down | false | Power-cycle DPU; increment `reset_count` | -| DPU up after crash | up | up | true | Set `ready_status=true` after verifying all states | -| DPU stuck (lost connectivity) | down | down | false | Power-cycle DPU; increment `reset_count` | -| DPU up after losing connectivity / reboot | up | up | true | Set `ready_status=true` after verifying all states | -| DPU control plane restart – critical services | down → up | up | false → true | Power-cycle DPU; increment `reset_count`; set `ready_status=true` on recovery | -| NPU/DPU OS upgrade | down → up | up | false → true | Re-poll DPU states on NPU recovery | -| DPU dead – power cycle | down | down | false | Power-cycle DPU; increment `reset_count` | +| DPU healthy and running | up | up | true | Set `ready_status=true` after verifying all states | +| DPU failure (control plane or midplane down) | down | up or down | false | Enter **WaitForSelfRecovery**; wait `dpu_self_recovery_timeout` (300s) for self-recovery | +| DPU self-recovers within timeout | down → up | down → up | false → true | Transition to **Booting**; wait `dpu_boot_timeout` for full readiness; `reset_count` unchanged | +| DPU does NOT self-recover (timeout expires) | down | down | false | Power-cycle DPU; increment `reset_count`; wait `dpu_boot_timeout` | +| DPU up after power-cycle | up | up | true | Set `ready_status=true` after verifying all states | | DPU dead – unrecoverable | down | down | false | `reset_count` reached `dpu_reset_limit`; `recovery_status` set to `"unrecoverable"`; raise alert | +| NPU ungraceful reboot (Kernel Panic / Unknown) | down → up | down → up | false → true | Power-cycle all admin-up DPUs; increment `reset_count`; wait `dpu_boot_timeout` | | Full SmartSwitch reboot (planned) | down → up | down → up | false → true | gNOI halt; power-cycle; re-verify | -| DPU kernel panic / mem exhaustion (HW watchdog) | down → up | down → up | false → true | Wait `dpu_boot_timeout` (600s); if DPU self-recovers AND reboot cause is `Kernel Panic`/`Memory Exhaustion`/`Watchdog`, accept recovery (`reset_count` unchanged); otherwise power-cycle | --- @@ -757,9 +521,9 @@ Test implementation: [`tests/smartswitch/platform_tests/test_dpu_failure_modes.p | Test Class | Scenario | Validates | | ---------- | -------- | --------- | -| `TestDatabaseDpuCrash` | Kill per-DPU Redis instance (`databasedpuN`) on NPU | `chassisd` detects loss (`ready_status=false`), `systemd` restarts service, `chassisd` recovers (`ready_status=true`) | +| `TestDatabaseDpuCrash` | Kill per-DPU Redis instance (`databasedpuN`) on NPU | `chassisd` detects loss (`ready_status=false`), enters **WaitForSelfRecovery**; `systemd` restarts service; DPU self-recovers within timeout (`ready_status=true`) | | `TestPcieFailure` | Remove DPU PCIe device via sysfs | `pcied` detects detach (`PCIE_DETACH_INFO dpu_state=detached`), `chassisd` marks DPU not-ready, power-cycles, performs PCIe rescan, recovers | -| `TestControlPlaneOnlyDown` | Stop critical container (`swss`) on DPU | `dpu_control_plane_state=down` while midplane stays up; `chassisd` detects and power-cycles DPU | +| `TestControlPlaneOnlyDown` | Stop critical container (`swss`) on DPU | `dpu_control_plane_state=down` while midplane stays up; `chassisd` enters **WaitForSelfRecovery**, waits `dpu_self_recovery_timeout`, then power-cycles DPU | | `TestAutoRecoveryDisabled` | Disable `FEATURE\|dpu-auto-recovery`, trigger failure | Confirms `chassisd` does NOT power-cycle (ManualIntervention); re-enable and verify recovery | | `TestUnrecoverableState` | Repeatedly trigger failures until `reset_count` ≥ `dpu_reset_limit` | `recovery_status=unrecoverable`; `chassisd` stops retrying | | `TestStateMachineTransitions` | Planned shutdown → offline → startup → ready | `last_down_time` and `last_ready_time` updated correctly; `recovery_status` stays `recoverable` | From 981447770bd4a28d4d6540f603a2f891c69b95e3 Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Fri, 29 May 2026 22:34:31 +0000 Subject: [PATCH 11/14] Address PR review: clarify Booting state DB indicators Update the Booting state's Key DB Indicators to use timer-based condition: 'dpu_boot_timeout timer running AND NOT (midplane up AND control plane up)' instead of assuming specific intermediate link states. Signed-off-by: Vasundhara Volam --- doc/smart-switch/pmon/enhance-dpu-robustness.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 50a5d8c7634..8a5d7b91707 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -257,7 +257,7 @@ stateDiagram-v2 | State | `ready_status` | `recovery_status` | Key DB Indicators | | ----- | :------------: | :----------------: | ----------------- | -| **Booting** | `false` | `recoverable` | `dpu_midplane_link_state: up`, `dpu_control_plane_state: down`; `dpu_boot_timeout` timer running | +| **Booting** | `false` | `recoverable` | `dpu_boot_timeout` timer running AND NOT (`dpu_midplane_link_state: up` AND `dpu_control_plane_state: up`) | | **Ready** | `true` | `recoverable` | `dpu_midplane_link_state: up`, `dpu_control_plane_state: up`, `dpu_data_plane_state: up` | | **WaitForSelfRecovery** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `dpu_self_recovery_timeout` timer running | | **PowerCycle** | `false` | `recoverable` | `dpu_midplane_link_state: down`, `dpu_control_plane_state: down`; `reset_count` incremented; power-cycle in progress | From d4218aca843330b4c064f50a5cb9face2eff4fbc Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Sat, 30 May 2026 01:57:40 +0000 Subject: [PATCH 12/14] Fix table rendering: escape pipe in FEATURE|dpu-auto-recovery Escape the pipe character in the ManualIntervention row's Key DB Indicators column to prevent GitHub Markdown from breaking the table. Signed-off-by: Vasundhara Volam --- doc/smart-switch/pmon/enhance-dpu-robustness.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 8a5d7b91707..9b633946fa1 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -261,7 +261,7 @@ stateDiagram-v2 | **Ready** | `true` | `recoverable` | `dpu_midplane_link_state: up`, `dpu_control_plane_state: up`, `dpu_data_plane_state: up` | | **WaitForSelfRecovery** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `dpu_self_recovery_timeout` timer running | | **PowerCycle** | `false` | `recoverable` | `dpu_midplane_link_state: down`, `dpu_control_plane_state: down`; `reset_count` incremented; power-cycle in progress | -| **ManualIntervention** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `FEATURE|dpu-auto-recovery` `state`: `disabled` / `always_disabled` | +| **ManualIntervention** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `FEATURE\|dpu-auto-recovery` `state`: `disabled` / `always_disabled` | | **Offline** | `false` | `recoverable` | `oper_status: Offline`; DPU admin-down via `config chassis module shutdown` | | **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit`; `dpu_control_plane_state: down`; `dpu_midplane_link_state: down` | From 9877b27ce0943dfe8ffafd5466aa63da21be3fec Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Mon, 1 Jun 2026 18:22:43 +0000 Subject: [PATCH 13/14] Address PR review: fix state table, diagram, and step ordering MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Booting state: add dpu_data_plane_state to NOT condition - Split 'After Power-Cycle Recovery' column into success vs. limit reached to clarify unrecoverable state - Add WaitForSelfRecovery/PowerCycle → Offline transitions to Mermaid diagram for CLI module shutdown during recovery - Reorder graceful shutdown steps: power_down → pci_detach → clear state_transition_in_progress → sensor config - Fix Full SmartSwitch Reboot: chassisd detects reboot cause on startup (DPUs not guaranteed to reattach before NPU goes down) Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 32 ++++++++++--------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index 9b633946fa1..d111b0dd1e7 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -237,9 +237,11 @@ stateDiagram-v2 WaitForSelfRecovery --> Booting : DPU self-recovered (midplane up OR control plane up) WaitForSelfRecovery --> PowerCycle : self-recovery timeout expired + WaitForSelfRecovery --> Offline : CLI module shutdown PowerCycle --> Booting : Power cycle issued PowerCycle --> Unrecoverable : reset count >= dpu_reset_limit + PowerCycle --> Offline : CLI module shutdown ManualIntervention --> Booting : Operator power-cycle / module startup @@ -257,7 +259,7 @@ stateDiagram-v2 | State | `ready_status` | `recovery_status` | Key DB Indicators | | ----- | :------------: | :----------------: | ----------------- | -| **Booting** | `false` | `recoverable` | `dpu_boot_timeout` timer running AND NOT (`dpu_midplane_link_state: up` AND `dpu_control_plane_state: up`) | +| **Booting** | `false` | `recoverable` | `dpu_boot_timeout` timer running AND NOT (`dpu_midplane_link_state: up` AND `dpu_control_plane_state: up` AND `dpu_data_plane_state: up`) | | **Ready** | `true` | `recoverable` | `dpu_midplane_link_state: up`, `dpu_control_plane_state: up`, `dpu_data_plane_state: up` | | **WaitForSelfRecovery** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `dpu_self_recovery_timeout` timer running | | **PowerCycle** | `false` | `recoverable` | `dpu_midplane_link_state: down`, `dpu_control_plane_state: down`; `reset_count` incremented; power-cycle in progress | @@ -307,15 +309,15 @@ Any unplanned event that causes `dpu_control_plane_state` or `dpu_midplane_link_ **DB State Transition (DPU does NOT self-recover — power-cycle issued):** -| DB Field | Before | During WaitForSelfRecovery | After Power-Cycle Recovery | -| -------- | :----: | :------------------------: | :------------------------: | -| `dpu_control_plane_state` | `up` | `down` | `up` | -| `dpu_midplane_link_state` | `up` | `down` | `up` | -| `ready_status` | `true` | `false` | `true` | -| `last_down_time` | — | `` | — | -| `last_ready_time` | — | — | `` | -| `reset_count` | N | N | N+1 | -| `recovery_status` | `recoverable` | `recoverable` | `recoverable` (or `unrecoverable` if N+1 ≥ `dpu_reset_limit`) | +| DB Field | Before | During WaitForSelfRecovery | After Power-Cycle (success) | After Power-Cycle (limit reached) | +| -------- | :----: | :------------------------: | :-------------------------: | :-------------------------------: | +| `dpu_control_plane_state` | `up` | `down` | `up` | `down` | +| `dpu_midplane_link_state` | `up` | `down` | `up` | `down` | +| `ready_status` | `true` | `false` | `true` | `false` | +| `last_down_time` | — | `` | — | `` | +| `last_ready_time` | — | — | `` | — | +| `reset_count` | N | N | N+1 | N+1 (= `dpu_reset_limit`) | +| `recovery_status` | `recoverable` | `recoverable` | `recoverable` | `unrecoverable` | > **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. The `dpu_data_plane_state` is used solely to determine full DPU readiness for setting `ready_status` to `true`. @@ -372,9 +374,9 @@ Orderly shutdown of a DPU via CLI command: `config chassis module shutdown DPU **Note — Multiple DPU recovery:** When multiple (or all) DPUs need recovery simultaneously, `chassisd` issues power-cycles sequentially (one DPU at a time) to avoid power-rail overload and PCIe bus contention. The `dpu_boot_timeout` is tracked per-DPU independently. If a platform supports parallel DPU power-cycle (declared in `platform.json` via `parallel_dpu_recovery: true`), `chassisd` may issue power-cycles in parallel batches. From 28949dd329085998d1b90070367f1878aec1213c Mon Sep 17 00:00:00 2001 From: Vasundhara Volam Date: Mon, 1 Jun 2026 18:34:42 +0000 Subject: [PATCH 14/14] Merge PlannedShutdown and Offline into single AdminDown state MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Simplify the state machine by combining PlannedShutdown and Offline into a single AdminDown state. The state machine now waits for admin state up to transition back to Booting. The actual planned shutdown actions (gNOI HALT, power_down, pci_detach) are tracked separately via state_transition_in_progress flag. Also update PlannedReboot to be a direct Ready → Booting transition. Signed-off-by: Vasundhara Volam --- .../pmon/enhance-dpu-robustness.md | 26 +++++++++---------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/doc/smart-switch/pmon/enhance-dpu-robustness.md b/doc/smart-switch/pmon/enhance-dpu-robustness.md index d111b0dd1e7..7262248a860 100644 --- a/doc/smart-switch/pmon/enhance-dpu-robustness.md +++ b/doc/smart-switch/pmon/enhance-dpu-robustness.md @@ -237,22 +237,20 @@ stateDiagram-v2 WaitForSelfRecovery --> Booting : DPU self-recovered (midplane up OR control plane up) WaitForSelfRecovery --> PowerCycle : self-recovery timeout expired - WaitForSelfRecovery --> Offline : CLI module shutdown + WaitForSelfRecovery --> AdminDown : CLI module shutdown + WaitForSelfRecovery --> Booting : CLI reboot DPU (cancel timer, power-cycle) PowerCycle --> Booting : Power cycle issued PowerCycle --> Unrecoverable : reset count >= dpu_reset_limit - PowerCycle --> Offline : CLI module shutdown + PowerCycle --> AdminDown : CLI module shutdown ManualIntervention --> Booting : Operator power-cycle / module startup - Booting --> Offline : CLI module shutdown during boot - Ready --> PlannedShutdown : CLI module shutdown - Ready --> PlannedReboot : CLI reboot DPU + Booting --> AdminDown : CLI module shutdown during boot + Ready --> AdminDown : CLI module shutdown (gNOI HALT then power down) + Ready --> Booting : CLI reboot DPU (gNOI HALT then power cycle) - PlannedShutdown --> Offline : gNOI HALT then power down - PlannedReboot --> Booting : gNOI HALT then power cycle - - Offline --> Booting : CLI module startup + AdminDown --> Booting : CLI module startup Unrecoverable --> Booting : Operator module startup or chassisd restart ``` @@ -264,7 +262,7 @@ stateDiagram-v2 | **WaitForSelfRecovery** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `dpu_self_recovery_timeout` timer running | | **PowerCycle** | `false` | `recoverable` | `dpu_midplane_link_state: down`, `dpu_control_plane_state: down`; `reset_count` incremented; power-cycle in progress | | **ManualIntervention** | `false` | `recoverable` | `dpu_control_plane_state: down` OR `dpu_midplane_link_state: down`; `FEATURE\|dpu-auto-recovery` `state`: `disabled` / `always_disabled` | -| **Offline** | `false` | `recoverable` | `oper_status: Offline`; DPU admin-down via `config chassis module shutdown` | +| **AdminDown** | `false` | `recoverable` | `oper_status: Offline`; DPU admin-down via `config chassis module shutdown` | | **Unrecoverable** | `false` | `unrecoverable` | `reset_count` ≥ `dpu_reset_limit`; `dpu_control_plane_state: down`; `dpu_midplane_link_state: down` | --- @@ -321,7 +319,7 @@ Any unplanned event that causes `dpu_control_plane_state` or `dpu_midplane_link_ > **Note:** `chassisd` polls `dpu_data_plane_state` alongside `dpu_control_plane_state` and `dpu_midplane_link_state`, but `dpu_data_plane_state` alone does not trigger recovery actions. The `dpu_data_plane_state` is used solely to determine full DPU readiness for setting `ready_status` to `true`. -> **Note:** During **WaitForSelfRecovery**, if a planned operation (`config chassis module shutdown`) is requested, `chassisd` cancels the self-recovery timer, powers down the DPU, and transitions to **Offline**. +> **Note:** During **WaitForSelfRecovery**, if a planned operation (`config chassis module shutdown`) is requested, `chassisd` cancels the self-recovery timer, powers down the DPU, and transitions to **AdminDown**. --- @@ -390,7 +388,7 @@ Orderly shutdown of a DPU via CLI command: `config chassis module shutdown DPU