Skip to content

GPU capacity and allocation in status #69

@vrdc-sap

Description

@vrdc-sap

todo after time-slicing feature is added

kubectl get gpu shows total GPUs, how many in use, across which namespaces

Before
NAME READY REASON DRIVER VERSION NODES READY AGE
gpu True Ready 590 1 5m

After (with time-slicing, shipped together)
NAME READY REASON DRIVER VERSION NODES READY TOTAL GPUs ALLOCATED AGE
gpu True Ready 590 3 12 8 5m

TOTAL GPUs
Read from node labels that NVIDIA sets automatically after driver installation - nvidia.com/gpu.count per node, summed across all GPU nodes. We don't set this, we read it. When time-slicing is active NVIDIA advertises virtual GPUs instead of physical ones, so this number already reflects the sharing configuration. No user action needed.

ALLOCATED
Computed by listing all running pods across all namespaces and summing nvidia.com/gpu resource requests. A pod requesting nvidia.com/gpu: 2 contributes 2 to the count. We don't set this either - we derive it from the cluster state on every reconcile. Reflects current demand, not capacity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions