@@ -76,3 +76,95 @@ The hello-world.cu could be e.g.::
7676 return 0;
7777 }
7878
79+ Building a simple PyTorch environment
80+ -------------------------------------
81+
82+ Nvidia provides ARM containers for PyTorch, which you can use as a starting point for your own containers.
83+ This example shows how you can extend such a container by installing additional packages from pip.
84+
85+ A new PyTorch container is built each month, you can browse the selection
86+ in the `Nvidia PyTorch catalog <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch >`__.
87+ The following container definition file selects the February 2026 Nvidia PyTorch container with PyTorch
88+ version 2.11 as a starting point:
89+
90+ .. code-block :: none
91+
92+ Bootstrap: docker
93+ From: nvcr.io/nvidia/pytorch:26.02-py3
94+
95+ %post
96+ pip install transformers==4.57.6 pyyaml==6.0.1
97+
98+ %help
99+ An apptainer image based on Nvidia's PyTorch container with ARM CPU architecture.
100+
101+ The bootstrapped container runs on Ubuntu 24.04, and contains CUDA version 13.1,
102+ OpenMPI 4.1.7, Python 3.12 and PyTorch 2.11.
103+
104+ This image extends the bootstrapped PyTorch container with transformers package
105+ from HuggingFace.
106+
107+ PyYAML is a package required by transformers that has already been installed by the
108+ operating system package manager in the container. Pip will try to update PyYAML
109+ (to 6.0.3 at the time this image was created), which will fail the build because
110+ pip cannot change packages installed by the system package manager.
111+ Thus, PyYAML has to be pinned to the version used by the system package manager.
112+
113+
114+ You can add other packages you need to the ``pip install `` command in the container definition above.
115+ We also recommend documenting your container in the ``%help `` section
116+ (which packages you have added and why).
117+ Save your container definition to a file (here we will use ``pytorch-transformers-arm.def ``) and
118+ track it with version control to back it up for reproducibility.
119+
120+ Next, you need to build the container image (SIF-file) from the definition file.
121+ Building needs to happen on an ARM-device, which you can achieve for example with
122+ the following sbatch script ``build-pytorch-transformers-arm-container.sh `` like so:
123+
124+ .. code-block :: slurm
125+
126+ #!/bin/bash
127+ #SBATCH --job-name=build-arm-container
128+ #SBATCH --partition=gpu-grace-h200-141g
129+ #SBATCH --cpus-per-task=4
130+ #SBATCH --gpus=1
131+ #SBATCH --time=01:00:00
132+ #SBATCH --mem=128G
133+
134+ # You can replace $WRKDIR with $PWD to create the cache in your current working dir
135+ mkdir -p "$WRKDIR/apptainer_cache"
136+ export APPTAINER_CACHEDIR="$WRKDIR/apptainer_cache"
137+
138+ apptainer build pytorch-transformers-arm.sif pytorch-transformers-arm.def
139+
140+ After you have successfully built your container, you can start using it in your scripts.
141+ Here is a simple example of how to run an imaginary Python training script with two arguments
142+ using the container:
143+
144+ .. code-block :: sh
145+
146+ #! /bin/bash
147+ # SBATCH --job-name=train-script
148+ # SBATCH --partition=gpu-grace-h200-141g
149+ # SBATCH --cpus-per-task=8
150+ # SBATCH --gpus=1
151+ # SBATCH --time=04:00:00
152+ # SBATCH --mem=256G
153+
154+ # The --nv argument makes the GPU available within the container
155+ apptainer exec --nv pytorch-transformers-arm.sif \
156+ python train_script.py \
157+ --arg1 foo \
158+ --arg2 bar
159+
160+ You simply need to prepend calls to your scripts with the apptainer exec command.
161+ For a more comprehensive tutorial on apptainer, please see
162+ `the third lesson <https://coderefinery.github.io/hpc-containers/ >`__ of our
163+ `Tuesday Tools & Techniques for HPC (TTT4HPC) course <../../training/scip/ttt4hpc-2024.rst >`__.
164+ Just keep in mind when reading the lesson that it assumes x86 architecture instead of ARM,
165+ so adjust the examples in the tutorial to use ARM.
166+ In other words, be sure to select an ARM container as your starting point,
167+ and run the building script on ARM hardware as shown above.
168+ And if you want or need any help setting up your ARM containers,
169+ you can always join `SciComp garage <../../help/garage.rst >`__ for help.
170+
0 commit comments