Skip to content

Example train_ddp.py breaks #295

@kasakun

Description

@kasakun

Hi, I was following the guide in README to run torchft locally.

# start lighthouse
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

# start a replica in another shell
export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

# start another replica
export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

After I ran

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

It immediately failed and I saw

------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-02_07:33:17
  host      : xxxxxxxxxxxxxxxxxxxx
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 47674)
  error_file: /mnt/tmp/torchelastic_4m_9eon3/none_ywq7poit/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/miniforge/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
      return f(*args, **kwargs)
    File "/mnt/task_runtime/train_ddp.py", line 192, in main
      loss.backward()
    File "/miniforge/lib/python3.10/site-packages/torch/_tensor.py", line 625, in backward
      torch.autograd.backward(
    File "/miniforge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 354, in backward
      _engine_run_backward(
    File "/miniforge/lib/python3.10/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
      return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    File "/mnt/task_runtime/torchft/ddp.py", line 78, in _comm_hook
      assert fut._fut
  AttributeError: 'Future' object has no attribute '_fut'

The seems only happens on the latest main: 024f850
After I reset it to 8ef24c0, I no longer see this issue.

Some extra info if it helps

python -c "import torch; print(torch.__version__)"
2.9.1+cu128

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions