Skip to content

Multinode and multi gpu training gets stuck #3

@OriAlpha

Description

@OriAlpha

I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration

#SBATCH --job-name=chess_finetune  # create a short name for your job
#SBATCH --output=chess_finetune.out      # file to write stdout
#SBATCH --nodes=2                  # node count
#SBATCH --cpus-per-task=4          # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2               # number of gpus per node
#SBATCH --time=01:00:00            # total run time limit (HH:MM:SS)
#SBATCH --ntasks=2                 # total number of tasks across all nodes

Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs:

[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions