我使用pytorch2.8+cuda12.8,在h200下,用物理卡能做多卡训练,但使用vgpu就不行
检测到 2 个GPU
[HAMI-core Msg(114:139755231597184:libvgpu.c:839)]: Initializing.....
[HAMI-core Msg(113:140578808201856:libvgpu.c:839)]: Initializing.....
W0105 17:19:48.131000 85 site-packages/torch/multiprocessing/spawn.py:174] Terminating process 96 via signal SIGTERM
Traceback (most recent call last):
File "/root/code/gpu/train.py", line 466, in
main()
File "/root/code/gpu/train.py", line 365, in main
mp.spawn(
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 364, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 320, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 201, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
[HAMI-core Msg(113:140578808201856:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(87:139677377852032:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(86:140498350031488:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(85:139909703299712:multiprocess_memory_limit.c:455)]: Calling exit handler 85
[HAMI-core Msg(114:139755231597184:libvgpu.c:855)]: Initialized
W0105 17:19:48.737000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 86 closing signal SIGTERM
W0105 17:19:48.739000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 87 closing signal SIGTERM
W0105 17:19:48.740000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 88 closing signal SIGTERM
E0105 17:19:48.806000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 85) of binary: /root/.conda/envs/pytorch29/bin/python3.12
Traceback (most recent call last):
File "/root/.conda/envs/pytorch29/bin/torchrun", line 7, in
sys.exit(main())
^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 156, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2026-01-05_17:19:48
host : research-task-dongbiao.hu-ef600a-task-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 85)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[HAMI-core Msg(17:140159803941504:multiprocess_memory_limit.c:455)]: Calling exit handler 17
[HAMI-core Warn(17:140159803941504:multiprocess_memory_limit.c:475)]: Failed to take lock on exit: errno=110
我使用pytorch2.8+cuda12.8,在h200下,用物理卡能做多卡训练,但使用vgpu就不行
检测到 2 个GPU
[HAMI-core Msg(114:139755231597184:libvgpu.c:839)]: Initializing.....
[HAMI-core Msg(113:140578808201856:libvgpu.c:839)]: Initializing.....
W0105 17:19:48.131000 85 site-packages/torch/multiprocessing/spawn.py:174] Terminating process 96 via signal SIGTERM
Traceback (most recent call last):
File "/root/code/gpu/train.py", line 466, in
main()
File "/root/code/gpu/train.py", line 365, in main
mp.spawn(
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 364, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 320, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 201, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
[HAMI-core Msg(113:140578808201856:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(87:139677377852032:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(86:140498350031488:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(85:139909703299712:multiprocess_memory_limit.c:455)]: Calling exit handler 85
[HAMI-core Msg(114:139755231597184:libvgpu.c:855)]: Initialized
W0105 17:19:48.737000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 86 closing signal SIGTERM
W0105 17:19:48.739000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 87 closing signal SIGTERM
W0105 17:19:48.740000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 88 closing signal SIGTERM
E0105 17:19:48.806000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 85) of binary: /root/.conda/envs/pytorch29/bin/python3.12
Traceback (most recent call last):
File "/root/.conda/envs/pytorch29/bin/torchrun", line 7, in
sys.exit(main())
^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 156, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2026-01-05_17:19:48
host : research-task-dongbiao.hu-ef600a-task-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 85)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[HAMI-core Msg(17:140159803941504:multiprocess_memory_limit.c:455)]: Calling exit handler 17
[HAMI-core Warn(17:140159803941504:multiprocess_memory_limit.c:475)]: Failed to take lock on exit: errno=110