This folder contains an end-to-end example of training a UNet model using the DALI library with Zarr data format. The code for this example is parallelized using PyTorch DDP (Distributed Data Parallel) and can be run on multiple GPUs or nodes.
To run on 1 GPU, use the following command:
module load conda
conda activate gpuhackathonTo run on 1 GPU, use the following command:
./train_unet.py
To run on single node multi-GPU, use the following command:
torchrun --nnodes=1 --nproc-per-node=4 train_unet.py --distributedTo run with nsys:
module purge
module load ncarenv/23.09
module reset
module load cudaCheck the result of the following matches /glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/bin/nsys:
which nsysThen run with nsys profiling with
nsys profile -t nvtx,cuda,osrt --gpu-metrics-device all --force-overwrite=true --output=training_benchmark python train_unet.py
To run without data loading:
export synthetic="True"; nsys profile -t nvtx,cuda,osrt --gpu-metrics-device all --force-overwrite=true --output=training_benchmark python train_unet.py
To run on multi-node multi-GPU, use the following command (full example here):
MASTER_ADDR=$head_node_ip MASTER_PORT=1234 mpiexec -np 8 ./train_unet.py --distributedIn the above command, replace $head_node_ip with the IP address of the head node.
For example using PBS:
# Determine the number of nodes:
nnodes=$(< $PBS_NODEFILE wc -l)
nodes=( $( cat $PBS_NODEFILE ) )
head_node=${nodes[0]}
head_node_ip=$(ssh $head_node hostname -i | awk '{print $1}')Average throughput on 1 GPU (A100-40GB): 92.34 samples/sec, ~1000 seconds per epoch, (~35 minutes)
Average throughput on 4 GPUs (A100-40GB): 800.5 samples/sec, 500 seconds per epoch , ( 16 minutes)