Skip to content

Gardener-aware status #72

@vrdc-sap

Description

@vrdc-sap

When a GPU node goes into Gardener maintenance, surface that context in the DriverReady condition instead of showing a generic "not ready" status that looks like a real failure.

Without this, every Gardener maintenance window looks like a broken cluster. Platform teams open support tickets, waste time investigating, only to find it was scheduled maintenance. This feature eliminates that noise.

User experience

Today - maintenance window looks like a real outage:
NAME READY REASON MESSAGE
gpu False DriverNotReady nvidia-driver-daemonset: 0/1 nodes ready
Platform team panics, opens ticket.

After - maintenance is transparent:
NAME READY REASON MESSAGE
gpu Unknown NodeInMaintenance Node gpu-node-1 is in scheduled Gardener maintenance, driver will recover automatically
Platform team sees it, understands it, does nothing.

How it works

Gardener annotates or taints nodes during maintenance. On every reconcile, before setting DriverReady=Unknown, we check if affected GPU nodes have Gardener maintenance annotations. If yes, use Reason=NodeInMaintenance instead of the generic Reason=Progressing.

Open question

Confirm exact Gardener annotation/taint key set on nodes during maintenance - needs verification against Gardener internals before implementation.

Acceptance criteria

  • GPU node in Gardener maintenance → DriverReady=Unknown, Reason=NodeInMaintenance with clear message
  • Maintenance completes, driver recovers → condition returns to DriverReady=True automatically
  • Non-maintenance driver failure → existing Reason=Progressing unchanged, no false positives

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions