When a GPU node goes into Gardener maintenance, surface that context in the DriverReady condition instead of showing a generic "not ready" status that looks like a real failure.
Without this, every Gardener maintenance window looks like a broken cluster. Platform teams open support tickets, waste time investigating, only to find it was scheduled maintenance. This feature eliminates that noise.
User experience
Today - maintenance window looks like a real outage:
NAME READY REASON MESSAGE
gpu False DriverNotReady nvidia-driver-daemonset: 0/1 nodes ready
Platform team panics, opens ticket.
After - maintenance is transparent:
NAME READY REASON MESSAGE
gpu Unknown NodeInMaintenance Node gpu-node-1 is in scheduled Gardener maintenance, driver will recover automatically
Platform team sees it, understands it, does nothing.
How it works
Gardener annotates or taints nodes during maintenance. On every reconcile, before setting DriverReady=Unknown, we check if affected GPU nodes have Gardener maintenance annotations. If yes, use Reason=NodeInMaintenance instead of the generic Reason=Progressing.
Open question
Confirm exact Gardener annotation/taint key set on nodes during maintenance - needs verification against Gardener internals before implementation.
Acceptance criteria
- GPU node in Gardener maintenance → DriverReady=Unknown, Reason=NodeInMaintenance with clear message
- Maintenance completes, driver recovers → condition returns to DriverReady=True automatically
- Non-maintenance driver failure → existing Reason=Progressing unchanged, no false positives
When a GPU node goes into Gardener maintenance, surface that context in the DriverReady condition instead of showing a generic "not ready" status that looks like a real failure.
Without this, every Gardener maintenance window looks like a broken cluster. Platform teams open support tickets, waste time investigating, only to find it was scheduled maintenance. This feature eliminates that noise.
User experience
Today - maintenance window looks like a real outage:
NAME READY REASON MESSAGE
gpu False DriverNotReady nvidia-driver-daemonset: 0/1 nodes ready
Platform team panics, opens ticket.
After - maintenance is transparent:
NAME READY REASON MESSAGE
gpu Unknown NodeInMaintenance Node gpu-node-1 is in scheduled Gardener maintenance, driver will recover automatically
Platform team sees it, understands it, does nothing.
How it works
Gardener annotates or taints nodes during maintenance. On every reconcile, before setting DriverReady=Unknown, we check if affected GPU nodes have Gardener maintenance annotations. If yes, use Reason=NodeInMaintenance instead of the generic Reason=Progressing.
Open question
Confirm exact Gardener annotation/taint key set on nodes during maintenance - needs verification against Gardener internals before implementation.
Acceptance criteria