I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration
#SBATCH --job-name=chess_finetune # create a short name for your job
#SBATCH --output=chess_finetune.out # file to write stdout
#SBATCH --nodes=2 # node count
#SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --gres=gpu:2 # number of gpus per node
#SBATCH --time=01:00:00 # total run time limit (HH:MM:SS)
#SBATCH --ntasks=2 # total number of tasks across all nodes
Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs:
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING]
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-04-11 15:18:25,356] torch.distributed.run: [WARNING] *****************************************
I have been trying to run example chess_finetune.py with fsdp. I have multple node with multipe gpu. Here is my configuration
Traning seems to start but be stuck without any update, is there any way to debug and check for bugs.
Logs: