Skip to content

Commit dbc0b72

Browse files
csplintereasymrgr
authored andcommitted
Clarify well-known XIDs, add repair for MNG nodes that don't join, clarify defaults can be overridden
cr: https://code.amazon.com/reviews/CR-254939183
1 parent 518e553 commit dbc0b72

3 files changed

Lines changed: 3 additions & 3 deletions

File tree

latest/ug/nodes/node-health-nma.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ The monitoring condition is `AcceleratedHardwareReady` for issues in the followi
143143

144144
The node monitoring agent detects NVIDIA XID errors from GPU kernel logs. XID errors fall into two categories:
145145

146-
* *Well-known XID codes* – Critical errors that set a node condition (`AcceleratedHardwareReady=False`) and trigger auto repair when enabled. The reason code format is `NvidiaXID[Code]Error`.
146+
* *Well-known XID codes* – Critical errors that set a node condition (`AcceleratedHardwareReady=False`) and trigger auto repair when enabled. The reason code format is `NvidiaXID[Code]Error`. The well-known XID codes that the EKS node monitoring agent detects may not represent the full list of NVIDIA XID codes that require repair actions.
147147
* *Unknown XID codes* – Logged as Kubernetes events only. These don't trigger auto repair. The reason code format is `NvidiaXID[Code]Warning`. To investigate unknown XID errors, review your kernel logs with `dmesg | grep -i nvrm`.
148148
149149
For more information on XID errors, see link:https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_1[Xid Errors] in the _NVIDIA GPU Deployment and Management Documentation_. For more information on the individual XID messages, see link:https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#understanding-xid-messages[Understanding Xid Messages] in the _NVIDIA GPU Deployment and Management Documentation_.

latest/ug/nodes/node-health.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ With the EKS node monitoring agent, the following categories of node health issu
6565

6666
EKS automatic node repair continuously monitors node health, reacts to detected problems, and replaces or reboots nodes when possible. This improves cluster reliability with minimal manual intervention and helps reduce application downtime.
6767

68-
By itself, EKS automatic node repair reacts to the `Ready` conditions of the kubelet and any manually deleted node objects. When EKS automatic node repair is enabled with the node monitoring agent installed, EKS automatic node repair reacts to additional node conditions: `AcceleratedHardwareReady`, `ContainerRuntimeReady`, `KernelReady`, `NetworkingReady`, and `StorageReady`.
68+
By itself, EKS automatic node repair reacts to the `Ready` conditions of the kubelet, any manually deleted node objects, and EKS managed node group instances that fail to join the cluster. When EKS automatic node repair is enabled with the node monitoring agent installed, EKS automatic node repair reacts to additional node conditions: `AcceleratedHardwareReady`, `ContainerRuntimeReady`, `KernelReady`, `NetworkingReady`, and `StorageReady`.
6969

7070
EKS automatic node repair does not react to standard Kubernetes `DiskPressure`, `MemoryPressure`, or `PIDPressure` node conditions. These conditions often indicate issues with application behavior, workload configuration, or resource limits rather than node-level failures, making it difficult to determine an appropriate default repair action. In these scenarios, workloads are subject to the Kubernetes link:https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction[node pressure eviction behavior].
7171

latest/ug/nodes/node-repair.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ For a detailed list of node health issues detected by the EKS node monitoring ag
6666

6767
|===
6868

69-
EKS automatic node repair actions are disabled in the following scenarios. In-progress node repair actions continue in each scenario.
69+
EKS automatic node repair actions are disabled in the following scenarios by default. In-progress node repair actions continue in each scenario. See <<configure-node-repair>> for how to override these default settings.
7070

7171
*EKS managed node groups*
7272

0 commit comments

Comments
 (0)