Skip to content

h200不能多卡训练 #97

@atriaops

Description

@atriaops

我使用pytorch2.8+cuda12.8,在h200下,用物理卡能做多卡训练,但使用vgpu就不行
检测到 2 个GPU
[HAMI-core Msg(114:139755231597184:libvgpu.c:839)]: Initializing.....
[HAMI-core Msg(113:140578808201856:libvgpu.c:839)]: Initializing.....
W0105 17:19:48.131000 85 site-packages/torch/multiprocessing/spawn.py:174] Terminating process 96 via signal SIGTERM
Traceback (most recent call last):
File "/root/code/gpu/train.py", line 466, in
main()
File "/root/code/gpu/train.py", line 365, in main
mp.spawn(
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 364, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 320, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 201, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
[HAMI-core Msg(113:140578808201856:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(87:139677377852032:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(86:140498350031488:utils.c:32)]: unified_lock locked, waiting 1 second...
[HAMI-core Msg(85:139909703299712:multiprocess_memory_limit.c:455)]: Calling exit handler 85
[HAMI-core Msg(114:139755231597184:libvgpu.c:855)]: Initialized
W0105 17:19:48.737000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 86 closing signal SIGTERM
W0105 17:19:48.739000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 87 closing signal SIGTERM
W0105 17:19:48.740000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:908] Sending process 88 closing signal SIGTERM
E0105 17:19:48.806000 17 site-packages/torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 0 (pid: 85) of binary: /root/.conda/envs/pytorch29/bin/python3.12
Traceback (most recent call last):
File "/root/.conda/envs/pytorch29/bin/torchrun", line 7, in
sys.exit(main())
^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 156, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/pytorch29/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 293, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2026-01-05_17:19:48
host : research-task-dongbiao.hu-ef600a-task-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 85)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

[HAMI-core Msg(17:140159803941504:multiprocess_memory_limit.c:455)]: Calling exit handler 17
[HAMI-core Warn(17:140159803941504:multiprocess_memory_limit.c:475)]: Failed to take lock on exit: errno=110

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions