Skip to content

跳过不健康 GPU #123

@zhangzhiqiang123456

Description

@zhangzhiqiang123456

4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.865224875+08:00" level=error msg="Failed to pipe stdout of container "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6"" error="reading from a closed fifo"
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.865314872+08:00" level=error msg="Failed to pipe stderr of container "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6"" error="reading from a closed fifo"
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.867655874+08:00" level=error msg="StartContainer for "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
4月 28 15:08:11 master-02 containerd[3862899]: time="2026-04-28T15:08:11.377688987+08:00" level=info msg="RemoveContainer for "2e7179e792b6a19ff32f9b6dca8ef8ad7c75f6b96246df89c96d61cc7a663dd0""
4月 28 15:08:11 master-02 containerd[3862899]: time="2026-04-28T15:08:11.382539511+08:00" level=info msg="RemoveContainer for "2e7179e792b6a19ff32f9b6dca8ef8ad7c75f6b96246df89c96d61cc7a663dd0" returns successfully"

root@master-02:/etc/containerd# nvidia-smi
Unable to determine the device handle for GPU4: 0000:98:00.0: Unknown Error
Tue Apr 28 15:13:21 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:16:00.0 Off | N/A |
| 0% 26C P8 12W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:36:00.0 Off | N/A |
| 0% 28C P8 2W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 Off | 00000000:46:00.0 Off | N/A |
| 0% 27C P8 9W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 Off | 00000000:56:00.0 Off | N/A |
| 0% 26C P8 11W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 5090 Off | 00000000:B8:00.0 Off | N/A |
| 0% 27C P8 11W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 5090 Off | 00000000:C8:00.0 Off | N/A |
| 0% 26C P8 7W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 5090 Off | 00000000:D8:00.0 Off | N/A |
| 0% 26C P8 3W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@master-02:/etc/containerd#

[二 4月 14 21:58:17 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 14 22:01:57 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 14 22:10:29 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 28 07:02:29 2026] NVRM: GPU at PCI:0000:98:00: GPU-622ea5ee-3562-b92a-fb69-e9ea7cf1937e
[二 4月 28 07:02:29 2026] NVRM: Xid (PCI:0000:98:00): 79, GPU has fallen off the bus.
[二 4月 28 07:02:29 2026] NVRM: GPU 0000:98:00.0: GPU has fallen off the bus.
[二 4月 28 07:02:29 2026] NVRM: GPU 0000:98:00.0: GPU serial number is 0.
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[二 4月 28 07:02:29 2026] NVRM: Xid (PCI:0000:98:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)

上面的错误导致volcano-device-plugin-d8kfw启动不了,能不能让volcano-device-plugin更健壮一点,部分卡故障不影响它使用

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions