4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.865224875+08:00" level=error msg="Failed to pipe stdout of container "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6"" error="reading from a closed fifo"
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.865314872+08:00" level=error msg="Failed to pipe stderr of container "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6"" error="reading from a closed fifo"
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.867655874+08:00" level=error msg="StartContainer for "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
4月 28 15:08:11 master-02 containerd[3862899]: time="2026-04-28T15:08:11.377688987+08:00" level=info msg="RemoveContainer for "2e7179e792b6a19ff32f9b6dca8ef8ad7c75f6b96246df89c96d61cc7a663dd0""
4月 28 15:08:11 master-02 containerd[3862899]: time="2026-04-28T15:08:11.382539511+08:00" level=info msg="RemoveContainer for "2e7179e792b6a19ff32f9b6dca8ef8ad7c75f6b96246df89c96d61cc7a663dd0" returns successfully"
root@master-02:/etc/containerd# nvidia-smi
Unable to determine the device handle for GPU4: 0000:98:00.0: Unknown Error
Tue Apr 28 15:13:21 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:16:00.0 Off | N/A |
| 0% 26C P8 12W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:36:00.0 Off | N/A |
| 0% 28C P8 2W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 Off | 00000000:46:00.0 Off | N/A |
| 0% 27C P8 9W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 Off | 00000000:56:00.0 Off | N/A |
| 0% 26C P8 11W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 5090 Off | 00000000:B8:00.0 Off | N/A |
| 0% 27C P8 11W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 5090 Off | 00000000:C8:00.0 Off | N/A |
| 0% 26C P8 7W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 5090 Off | 00000000:D8:00.0 Off | N/A |
| 0% 26C P8 3W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@master-02:/etc/containerd#
[二 4月 14 21:58:17 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 14 22:01:57 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 14 22:10:29 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 28 07:02:29 2026] NVRM: GPU at PCI:0000:98:00: GPU-622ea5ee-3562-b92a-fb69-e9ea7cf1937e
[二 4月 28 07:02:29 2026] NVRM: Xid (PCI:0000:98:00): 79, GPU has fallen off the bus.
[二 4月 28 07:02:29 2026] NVRM: GPU 0000:98:00.0: GPU has fallen off the bus.
[二 4月 28 07:02:29 2026] NVRM: GPU 0000:98:00.0: GPU serial number is 0.
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[二 4月 28 07:02:29 2026] NVRM: Xid (PCI:0000:98:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
上面的错误导致volcano-device-plugin-d8kfw启动不了,能不能让volcano-device-plugin更健壮一点,部分卡故障不影响它使用
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.865224875+08:00" level=error msg="Failed to pipe stdout of container "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6"" error="reading from a closed fifo"
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.865314872+08:00" level=error msg="Failed to pipe stderr of container "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6"" error="reading from a closed fifo"
4月 28 15:08:10 master-02 containerd[3862899]: time="2026-04-28T15:08:10.867655874+08:00" level=error msg="StartContainer for "8c9ea3e0970fab7df3893a1fca87318bf892d3a8cc85eaf33807f335533b8bb6" failed" error="failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: detection error: nvml error: unknown error: unknown"
4月 28 15:08:11 master-02 containerd[3862899]: time="2026-04-28T15:08:11.377688987+08:00" level=info msg="RemoveContainer for "2e7179e792b6a19ff32f9b6dca8ef8ad7c75f6b96246df89c96d61cc7a663dd0""
4月 28 15:08:11 master-02 containerd[3862899]: time="2026-04-28T15:08:11.382539511+08:00" level=info msg="RemoveContainer for "2e7179e792b6a19ff32f9b6dca8ef8ad7c75f6b96246df89c96d61cc7a663dd0" returns successfully"
root@master-02:/etc/containerd# nvidia-smi
Unable to determine the device handle for GPU4: 0000:98:00.0: Unknown Error
Tue Apr 28 15:13:21 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:16:00.0 Off | N/A |
| 0% 26C P8 12W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:36:00.0 Off | N/A |
| 0% 28C P8 2W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 Off | 00000000:46:00.0 Off | N/A |
| 0% 27C P8 9W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 Off | 00000000:56:00.0 Off | N/A |
| 0% 26C P8 11W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 5090 Off | 00000000:B8:00.0 Off | N/A |
| 0% 27C P8 11W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 5090 Off | 00000000:C8:00.0 Off | N/A |
| 0% 26C P8 7W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 5090 Off | 00000000:D8:00.0 Off | N/A |
| 0% 26C P8 3W / 575W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@master-02:/etc/containerd#
[二 4月 14 21:58:17 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 14 22:01:57 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 14 22:10:29 2026] Hardware name: Stone Technology SuperSvr G5208 PCIE5/G5208 PCIE5, BIOS EG32.29.03 08/11/2025
[二 4月 28 07:02:29 2026] NVRM: GPU at PCI:0000:98:00: GPU-622ea5ee-3562-b92a-fb69-e9ea7cf1937e
[二 4月 28 07:02:29 2026] NVRM: Xid (PCI:0000:98:00): 79, GPU has fallen off the bus.
[二 4月 28 07:02:29 2026] NVRM: GPU 0000:98:00.0: GPU has fallen off the bus.
[二 4月 28 07:02:29 2026] NVRM: GPU 0000:98:00.0: GPU serial number is 0.
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[二 4月 28 07:02:29 2026] NVRM: Xid (PCI:0000:98:00): 154, GPU recovery action changed from 0x0 (None) to 0x1 (GPU Reset Required)
上面的错误导致volcano-device-plugin-d8kfw启动不了,能不能让volcano-device-plugin更健壮一点,部分卡故障不影响它使用