[pmon]: HLD: Enhance DPU Robustness in Smart Switch#2310
Open
vvolam wants to merge 4 commits into
Open
Conversation
Add High Level Design document covering DPU failure scenarios for Smart Switch, including software failures, hardware failures, and NPU/switch level failures. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
DPU control plane, midplane, and data plane states are always 'down' during booting, never 'unknown'. Update terminology, state machine table, and scenario summary accordingly. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
…covery gating - Replace the ambiguous two-timer model (60s auto-recovery + 180s power-cycle) with a single, clearly-named dpu_auto_recovery_timeout (60s). Update the timer table, state machine edge labels, and all DPU software/hardware failure scenarios to use the consistent name. - Rename 'Critical process' subsections to 'Process' for accuracy; update TOC anchors and Scope wording accordingly. - Add ManualIntervention state to the DPU recovery state machine and gate SWFailure/HW-failure transitions on the auto-recovery feature flag. Add a global note plus per-scenario 'When auto-recovery is disabled' bullets so the FEATURE|dpu-auto-recovery=disabled behavior is consistent across every failure scenario. - Rework NPU Kernel Crash recovery: chassisd unconditionally power-cycles every admin-up DPU via the platform vendor path (power_down/pci_detach/power_up/pci_reattach) instead of using gNOI Reboot RPC against potentially unresponsive DPUs. Admin-down DPUs are left offline. Add reset_count row to the DB transition table with a note about chassisd-restart zeroing. - Fix 'Table of Content' typo and add Existing/New DB entries sub-entries to the table of contents. - Replace literal pipe inside backticks in the state table cell with HTML entity so the markdown table renders correctly on GitHub. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
Drop the dpu_auto_recovery_timeout self-heal grace period. chassisd now initiates a DPU power-cycle as soon as it observes dpu_control_plane_state (or dpu_midplane_link_state) as down on its next 10s health poll, regardless of whether the failure is a transient process restart or a persistent crash-loop. - Remove dpu_auto_recovery_timeout from the timer table; clarify chassisd health poll interval description to state immediate power-cycle on detection. - Combine 'Process restart on DPU' and 'Process persistently down on DPU' into a single 'Process crash/restart on DPU' section since chassisd applies the same recovery path in both cases. Update TOC and DB transition table accordingly. - State machine: keep SWFailure as a transient state on control-plane-down, branching directly into PowerCycle (auto-recovery enabled) or ManualIntervention (auto-recovery disabled) without any timer wait. HW-failure path goes directly from Ready to PowerCycle / ManualIntervention. - Drop 'skipping dpu_auto_recovery_timeout' parentheticals from HW Failure / Power Failure / PCIe Failure scenarios. Update Scenario DB State Summary row for control plane restart to reflect the immediate power-cycle behavior. Signed-off-by: Vasundhara Volam <vvolam@microsoft.com>
Collaborator
|
/azp run |
|
No pipelines are associated with this pull request. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What I did
Add a High Level Design document for DPU failure scenarios on SmartSwitch from the PMON (Platform Monitor) perspective.
Why I did it
SmartSwitch DPU lifecycle management requires clear specification of failure detection, DB state tracking, and recovery actions performed by
chassisdand other PMON sub-daemons. This HLD documents all failure and planned operation scenarios to guide implementation.How I did it
Added
doc/smart-switch/pmon/enhance-dpu-robustness.mdcovering:ready_status,recovery_status,reset_count,last_down_time,last_ready_timein CHASSIS_STATE_DBFEATURE|dpu-auto-recoveryin CONFIG_DBplatform.jsonHow to verify it
Review the HLD document for completeness and correctness of failure scenarios, DB state transitions, and recovery actions.