Skip to content

Commit bec3363

Browse files
authored
Merge pull request #824 from AaltoSciComp/pytorch-arm
PyTorch container example for ARM GPUs
2 parents 80eaf0c + 89a3252 commit bec3363

1 file changed

Lines changed: 92 additions & 0 deletions

File tree

triton/usage/gracehopper.rst

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,3 +76,95 @@ The hello-world.cu could be e.g.::
7676
return 0;
7777
}
7878

79+
Building a simple PyTorch environment
80+
-------------------------------------
81+
82+
Nvidia provides ARM containers for PyTorch, which you can use as a starting point for your own containers.
83+
This example shows how you can extend such a container by installing additional packages from pip.
84+
85+
A new PyTorch container is built each month, you can browse the selection
86+
in the `Nvidia PyTorch catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch>`__.
87+
The following container definition file selects the February 2026 Nvidia PyTorch container with PyTorch
88+
version 2.11 as a starting point:
89+
90+
.. code-block:: none
91+
92+
Bootstrap: docker
93+
From: nvcr.io/nvidia/pytorch:26.02-py3
94+
95+
%post
96+
pip install transformers==4.57.6 pyyaml==6.0.1
97+
98+
%help
99+
An apptainer image based on Nvidia's PyTorch container with ARM CPU architecture.
100+
101+
The bootstrapped container runs on Ubuntu 24.04, and contains CUDA version 13.1,
102+
OpenMPI 4.1.7, Python 3.12 and PyTorch 2.11.
103+
104+
This image extends the bootstrapped PyTorch container with transformers package
105+
from HuggingFace.
106+
107+
PyYAML is a package required by transformers that has already been installed by the
108+
operating system package manager in the container. Pip will try to update PyYAML
109+
(to 6.0.3 at the time this image was created), which will fail the build because
110+
pip cannot change packages installed by the system package manager.
111+
Thus, PyYAML has to be pinned to the version used by the system package manager.
112+
113+
114+
You can add other packages you need to the ``pip install`` command in the container definition above.
115+
We also recommend documenting your container in the ``%help`` section
116+
(which packages you have added and why).
117+
Save your container definition to a file (here we will use ``pytorch-transformers-arm.def``) and
118+
track it with version control to back it up for reproducibility.
119+
120+
Next, you need to build the container image (SIF-file) from the definition file.
121+
Building needs to happen on an ARM-device, which you can achieve for example with
122+
the following sbatch script ``build-pytorch-transformers-arm-container.sh`` like so:
123+
124+
.. code-block:: slurm
125+
126+
#!/bin/bash
127+
#SBATCH --job-name=build-arm-container
128+
#SBATCH --partition=gpu-grace-h200-141g
129+
#SBATCH --cpus-per-task=4
130+
#SBATCH --gpus=1
131+
#SBATCH --time=01:00:00
132+
#SBATCH --mem=128G
133+
134+
# You can replace $WRKDIR with $PWD to create the cache in your current working dir
135+
mkdir -p "$WRKDIR/apptainer_cache"
136+
export APPTAINER_CACHEDIR="$WRKDIR/apptainer_cache"
137+
138+
apptainer build pytorch-transformers-arm.sif pytorch-transformers-arm.def
139+
140+
After you have successfully built your container, you can start using it in your scripts.
141+
Here is a simple example of how to run an imaginary Python training script with two arguments
142+
using the container:
143+
144+
.. code-block:: sh
145+
146+
#!/bin/bash
147+
#SBATCH --job-name=train-script
148+
#SBATCH --partition=gpu-grace-h200-141g
149+
#SBATCH --cpus-per-task=8
150+
#SBATCH --gpus=1
151+
#SBATCH --time=04:00:00
152+
#SBATCH --mem=256G
153+
154+
# The --nv argument makes the GPU available within the container
155+
apptainer exec --nv pytorch-transformers-arm.sif \
156+
python train_script.py \
157+
--arg1 foo \
158+
--arg2 bar
159+
160+
You simply need to prepend calls to your scripts with the apptainer exec command.
161+
For a more comprehensive tutorial on apptainer, please see
162+
`the third lesson <https://coderefinery.github.io/hpc-containers/>`__ of our
163+
`Tuesday Tools & Techniques for HPC (TTT4HPC) course <../../training/scip/ttt4hpc-2024.rst>`__.
164+
Just keep in mind when reading the lesson that it assumes x86 architecture instead of ARM,
165+
so adjust the examples in the tutorial to use ARM.
166+
In other words, be sure to select an ARM container as your starting point,
167+
and run the building script on ARM hardware as shown above.
168+
And if you want or need any help setting up your ARM containers,
169+
you can always join `SciComp garage <../../help/garage.rst>`__ for help.
170+

0 commit comments

Comments
 (0)