You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add xpumd health source support for GPU plugin (#2279)
The GPU plugin currently gets health data from the Level-Zero sidecar, which requires a privileged container. XPUMD 2.x exposes a local gRPC streaming API (WatchDeviceHealth) that provides equivalent health info without needing a privileged sidecar.
Signed-off-by: Tuomas Katila <tuomas.katila@intel.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Copy file name to clipboardExpand all lines: cmd/gpu_plugin/README.md
+18Lines changed: 18 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,6 +19,7 @@ Table of Contents
19
19
*[CDI support](#cdi-support)
20
20
*[KMD and UMD](#kmd-and-umd)
21
21
*[Health management](#health-management)
22
+
*[xpumd health source](#xpumd-health-source)
22
23
*[by-path mounting](#by-path-mounting)
23
24
*[Issues with media workloads on multi-GPU setups](#issues-with-media-workloads-on-multi-gpu-setups)
24
25
*[Workaround for QSV and VA-API](#workaround-for-qsv-and-va-api)
@@ -59,6 +60,7 @@ For workloads on different KMDs, see [KMD and UMD](#kmd-and-umd).
59
60
| -enable-monitoring | - | disabled | Enable '*_monitoring' resource that provides access to all Intel GPU devices on the node, [see use](./monitoring.md)|
60
61
| -monitoring-mode | string | single | How monitoring resources are registered: single or split |
61
62
| -health-management | - | disabled | Enable health management by requesting data from oneAPI/Level-Zero interface. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. See [health management](#health-management)|
63
+
| -xpumd-endpoint | string | "" | Unix socket path for xpumd health service (e.g. `/run/xpumd/intelxpuinfo.sock`). When set, xpumd is used as the health data source instead of the Level-Zero sidecar. Cannot be combined with `-health-management`. Temperature limits are specified in xpumd service configuration, not with GPU plugin flags. See [xpumd health source](#xpumd-health-source)|
62
64
| -wsl | - | disabled | Adapt plugin to run in the WSL environment. Requires [GPU Level-Zero](../gpu_levelzero/) sidecar. |
63
65
| -shared-dev-num | int | 1 | Number of containers that can share the same GPU device |
64
66
| -allow-ids | string | "" | A list of PCI Device IDs that are allowed to be registered as resources. Default is empty (=all registered). Cannot be used together with `deny-ids`. |
@@ -265,6 +267,22 @@ Kubernetes Device Plugin API allows passing device's healthiness to Kubelet. By
265
267
266
268
Temperature limit can be provided via the command line argument, default is 100C.
267
269
270
+
### xpumd health source
271
+
272
+
As an alternative to the Level-Zero sidecar, GPU plugin can obtain device health data from [Intel XPU Manager (xpumd)](https://github.com/intel/xpumanager) v2.x that provides equivalent health information without requiring a privileged sidecar container.
273
+
274
+
When xpumd is running on the host it creates a unix socket (default: `/run/xpumd/intelxpuinfo.sock`). The GPU plugin connects to this socket and streams health events. A device is reported as `Unhealthy`if any health domain reports severity `WARNING` or higher.
275
+
276
+
To use xpumd as the health source, deploy the plugin with the provided Kustomize overlay:
The overlay mounts `/run/xpumd` from the host (read-only) into the plugin pod and passes the required `-xpumd-endpoint` and `-enable-monitoring` flags automatically.
283
+
284
+
>**Note**: `-xpumd-endpoint` and (sidecar) `-health-management` flags are mutually exclusive. Sidecar specific temperature limit flags (`-temp-limit`, `-gpu-temp-limit`, `-memory-temp-limit`) are not applicable when using `xpumd` as health source.
285
+
268
286
### By-path mounting
269
287
270
288
The DRM devices forthe Intel GPUs register `by-path` symlinks under `/dev/dri/by-path`. For each GPU character device, there is a corresponding symlinkin the by-path directory:
returnfmt.Errorf("failed to validate deny-ids: %w", err)
723
794
}
724
795
725
-
returntrue
796
+
switchopts.monitoringMode {
797
+
casemonitoringModeSingle:
798
+
casemonitoringModeSplit:
799
+
default:
800
+
returnnewArgError(fmt.Sprintf("invalid value for monitoring-mode, valid values: %s, %s",
801
+
monitoringModeSplit, monitoringModeSingle))
802
+
}
803
+
804
+
returnnil
805
+
}
806
+
807
+
funccheckArgs(optscliOptions) error {
808
+
iferr:=checkBasics(opts); err!=nil {
809
+
returnfmt.Errorf("%w", err)
810
+
}
811
+
812
+
ifopts.wslScan {
813
+
ifopts.enableMonitoring {
814
+
returnnewArgError("monitoring is not supported within WSL.")
815
+
}
816
+
817
+
ifopts.healthManagement||opts.xpumdEndpoint!="" {
818
+
returnnewArgError("health management is not supported within WSL.")
819
+
}
820
+
}
821
+
822
+
ifopts.healthManagement&&opts.xpumdEndpoint!="" {
823
+
returnnewArgError("cannot use both Level-Zero sidecar and xpumd for health management.")
824
+
}
825
+
826
+
ifopts.xpumdEndpoint!="" {
827
+
ifopts.globalTempLimit!=defaultTempLimit||
828
+
opts.gpuTempLimit!=defaultTempLimit||
829
+
opts.memoryTempLimit!=defaultTempLimit {
830
+
returnnewArgError("temperature limits do not work with xpumd health source")
831
+
}
832
+
}
833
+
834
+
returnnil
726
835
}
727
836
728
837
funcmain() {
@@ -734,69 +843,57 @@ func main() {
734
843
flag.StringVar(&prefix, "prefix", "", "Prefix for devfs & sysfs paths")
735
844
flag.BoolVar(&opts.enableMonitoring, "enable-monitoring", false, "whether to enable monitoring (= all GPUs) resource(s). See also --monitoring-mode")
736
845
flag.StringVar(&opts.monitoringMode, "monitoring-mode", monitoringModeSingle, "monitoring resource mode when --enable-monitoring is set: single (combined gpu.intel.com/monitoring resource) or split (per-driver i915_monitoring/xe_monitoring resources)")
737
-
flag.BoolVar(&opts.healthManagement, "health-management", false, "enable GPU health management")
846
+
flag.BoolVar(&opts.healthManagement, "health-management", false, "enable Level-Zero sidecar based GPU health management")
847
+
flag.StringVar(&opts.xpumdEndpoint, "xpumd-endpoint", "", "enable xpumd based health management. Argument is unix socket path for the xpumd health service (e.g. /run/xpumd/intelxpuinfo.sock). When set, health data is retrieved from xpumd")
738
848
flag.StringVar(&opts.bypathMount, "bypath", bypathOptionSingle, "DRI device 'by-path/' directory mounting options: single, none, all. Default: single")
739
849
flag.BoolVar(&opts.wslScan, "wsl", false, "scan for / use WSL devices")
740
-
flag.IntVar(&opts.sharedDevNum, "shared-dev-num", 1, "number of containers sharing the same GPU device")
741
-
flag.IntVar(&opts.globalTempLimit, "temp-limit", 100, "Global temperature limit at which device is marked unhealthy")
742
-
flag.IntVar(&opts.gpuTempLimit, "gpu-temp-limit", 100, "GPU temperature limit at which device is marked unhealthy")
743
-
flag.IntVar(&opts.memoryTempLimit, "memory-temp-limit", 100, "Memory temperature limit at which device is marked unhealthy")
850
+
flag.IntVar(&opts.sharedDevNum, "shared-dev-num", 1, "number of containers sharing the same GPU device.")
851
+
flag.IntVar(&opts.globalTempLimit, "temp-limit", defaultTempLimit, "Global temperature limit at which device is marked unhealthy. Use with health-managmement.")
852
+
flag.IntVar(&opts.gpuTempLimit, "gpu-temp-limit", defaultTempLimit, "GPU temperature limit at which device is marked unhealthy. Use with health-managmement.")
853
+
flag.IntVar(&opts.memoryTempLimit, "memory-temp-limit", defaultTempLimit, "Memory temperature limit at which device is marked unhealthy. Use with health-managmement.")
744
854
flag.StringVar(&opts.preferredAllocationPolicy, "allocation-policy", "none", "modes of allocating GPU devices: balanced, packed and none")
745
855
flag.StringVar(&opts.allowIDs, "allow-ids", "", "comma-separated list of device IDs to allow (e.g. 0x49c5,0x49c6)")
746
856
flag.StringVar(&opts.denyIDs, "deny-ids", "", "comma-separated list of device IDs to deny (e.g. 0x49c5,0x49c6)")
747
857
748
858
flag.Parse()
749
859
750
-
ifopts.sharedDevNum<1 {
751
-
klog.Error("The number of containers sharing the same GPU must greater than zero")
0 commit comments