@@ -126,7 +126,6 @@ python main_multi_gpu.py \
126126</details >
127127
128128## Training
129- ### Training with single GPU
130129To train the CycleMLP model on ImageNet2012 with single GPUs, run the following script using command line:
131130``` shell
132131sh run_train.sh
@@ -141,9 +140,11 @@ python main_single_gpu.py \
141140 -data_path=' /dataset/imagenet' \
142141```
143142
144- ### Training with multi-GPU
145- Run training using multi-GPUs:
143+ <details >
146144
145+ <summary >
146+ Run training using multi-GPUs:
147+ </summary >
147148
148149
149150``` shell
@@ -159,55 +160,9 @@ python main_multi_gpu.py \
159160 -data_path=' /dataset/imagenet' \
160161```
161162
162-
163-
164- ### Training with multi-node
165- PaddleVit also supports multi-node distributed training under collective mode.
166-
167- Suppose you have 2 hosts (denoted as node) with 4 gpus on each machine.
168- Nodes IP addresses are ` 192.168.0.16 ` and ` 192.168.0.17 ` .
169-
170- Then some lines of ` run_train_multi_node.sh ` should be modified:
171- ``` shell
172- CUDA_VISIBLE_DEVICES=0,1,2,3 # number of gpus
173-
174- -ips= ' 192.168.0.16, 192.168.0.17' # seperated by comma
175- ```
176- Run training script in every node:
177- ``` shell
178- sh run_train_multi.sh
179- ```
180-
181- <details >
182- <summary >It is possible to train with multi-node even when you have only one machine</summary >
183-
184- 1 . Install docker and paddle. For more details, please refer to
185- [ here] ( https://www.paddlepaddle.org.cn/documentation/docs/zh/install/docker/fromdocker.html ) .
186-
187- 2 . Create a network between docker containers.
188- ``` shell
189- docker network create -d bridge paddle_net
190- ```
191- 3. Create multiple containers as virtual hosts/nodes. Suppose creating 2 containers
192- with 2 gpus on each node.
193- ` ` ` shell
194- docker run --name paddle0 -it -d --gpus " device=0,1" --network paddle_net\
195- paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
196- docker run --name paddle1 -it -d --gpus " device=2,3" --network paddle_net\
197- paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
198- ` ` `
199- > Noted:
200- > 1. One can assign same gpu device to different containers. But it may occur OOM since multiple models will run on the same gpu.
201- > 2. One should use ` -v` to bind PaddleViT repository to container.
202-
203- 4. Modify ` run_train_multi_node.sh` as described above and run the training script on every container.
204-
205- > Noted: One can use ` ping` or ` ip -a` bash command to check containers' ip addresses.
206-
207163</details >
208164
209165
210-
211166## Visualization Attention Map
212167** (coming soon)**
213168
0 commit comments