Skip to content

Commit 901f3ed

Browse files
authored
Merge pull request #60 from priyankakinij/main
Update enroot/pyxis doc with GPU partitioning content.
2 parents 4bb47eb + 13bc244 commit 901f3ed

1 file changed

Lines changed: 44 additions & 0 deletions

File tree

docs/container-runtime/enroot-pyxis-installation.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -187,3 +187,47 @@ Device Node IDs Temp Power Partitions SCLK
187187

188188
================================================== End of ROCm SMI Log ===================================================
189189
```
190+
## Enroot and Pyxis with GPU partition
191+
We can use partitioned GPUs just like any other unpartitioned GPU when we use enroot and pyxis. But for this, slurm first needs to identify partitioned GPUS as the generic resources.
192+
Some config changes:
193+
1. Add the below line to /etc/slurm/gres.conf file, so that whenever GPUS are partitioned, the slurm automatically detects the number of gres resources.
194+
AutoDetect=rsmi
195+
Example gres.conf file :
196+
```bash
197+
AutoDetect=rsmi
198+
Name=gpu File=/dev/dri/renderD128
199+
Name=gpu File=/dev/dri/renderD136
200+
Name=gpu File=/dev/dri/renderD144
201+
Name=gpu File=/dev/dri/renderD152
202+
Name=gpu File=/dev/dri/renderD160
203+
Name=gpu File=/dev/dri/renderD168
204+
Name=gpu File=/dev/dri/renderD176
205+
Name=gpu File=/dev/dri/renderD184
206+
```
207+
2. If gres is specified in the node info in /etc/slurm/slurm.conf file, make sure it specifies the correct number of GPUs for that node
208+
Eg :
209+
```bash
210+
NodeName=localhost CPUs=160 Boards=1 SocketsPerBoard=2 CoresPerSocket=80 ThreadsPerCore=1 RealMemory=1285717 Gres=gpu:8
211+
```
212+
Gres=gpu:8 , can be omitted as well if the partitions keep changing.
213+
3. Restart slurm on both worker node and head node.
214+
Head node :
215+
```bash
216+
sudo service slurmctld restart && sudo service slurmd restart
217+
```
218+
Worker node :
219+
```bash
220+
sudo service slurmd restart
221+
```
222+
Now, pyxis would be able to use all the partitioned GPUS as the resources and allocate them as requested.
223+
```bash
224+
root@node2:~# srun --gres=gpu:62 --container-image=./rocm+pytorch+latest.sqsh --pty bash
225+
root@node2:/var/lib/jenkins# python3
226+
Python 3.12.10 | packaged by conda-forge | (main, Apr 10 2025, 22:21:13) [GCC 13.3.0] on linux
227+
Type "help", "copyright", "credits" or "license" for more information.
228+
>>> import torch
229+
>>> torch.cuda.device_count()
230+
62
231+
>>> exit()
232+
```
233+

0 commit comments

Comments
 (0)