Multinode and multi gpu training gets stuck

I have been trying to run example [chess_finetune.py](https://github.com/PrincetonUniversity/multi_gpu_training/blob/main/04_model_parallel_with_fsdp/chess_finetune.py) with fsdp. I have multple node with multipe gpu. Here is my configuration 
```
#SBATCH --job-name=chess_finetune  # create a short name for your job
#SBATCH --output=chess_finetune.out      # file to write stdout
#SBATCH --nodes=2                  # node count
#SBATCH --cpus-per-task=4          # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2               # number of gpus per node
#SBATCH --time=01:00:00            # total run time limit (HH:MM:SS)
#SBATCH --ntasks=2                 # total number of tasks across all nodes
```
Traning seems to start but be stuck without any update, is there any way to debug and check for bugs. 
Logs: 
```
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinode and multi gpu training gets stuck #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multinode and multi gpu training gets stuck #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions