Skip to content

Commit e7e8ce5

Browse files
authored
Merge pull request #415 from Modalities/apptainer
Apptainer Setup
2 parents dba03a6 + 5bec93c commit e7e8ce5

3 files changed

Lines changed: 156 additions & 1 deletion

File tree

README.md

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ For training and evaluation of a model, feel free to checkout [this](https://git
3030

3131
## Installation
3232

33-
There are two ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source.
33+
There are multiple ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source.
3434

3535
If you want to use Modalities as a library and register your custom components with Modalities, you can install it directly via pip which provides you with the latest stable version.
3636

@@ -88,6 +88,58 @@ uv pip install -e .[tests,linting]
8888
pre-commit install --install-hooks
8989
```
9090

91+
### Option 4: Containerized Setup via Singularity / Apptainer
92+
93+
If you prefer an isolated, reproducible environment or you are deploying to an HPC center that already supports Apptainer / Singularity, you can build and run Modalities using the provided `modalities.def` file in the container folder.
94+
95+
Note: Commands shown with singularity work the same with apptainer. Substitute the command name (e.g. apptainer build ..., apptainer exec ..., apptainer test ...). If both are installed, choose one consistently.
96+
97+
#### 1. Build the image
98+
99+
Use `--fakeroot` if you don't have root but your system enables user namespaces; otherwise omit it.
100+
101+
```sh
102+
singularity build modalities.sif modalities.def # standard build
103+
# or (if allowed / required on your system)
104+
singularity build --fakeroot modalities.sif modalities.def
105+
```
106+
107+
This will:
108+
* Pull the base image `nvcr.io/nvidia/nemo:25.09`.
109+
* Install nightly PyTorch (per the definition file) and flash-attention.
110+
* Clone and install `modalities` inside the container.
111+
112+
#### 2. Run the built-in smoke test
113+
114+
Your `%test` section is executed with:
115+
116+
```sh
117+
singularity test modalities.sif
118+
```
119+
120+
Expected output contains lines similar to:
121+
122+
```
123+
Torch import OK
124+
Modalities import OK
125+
```
126+
127+
If this step fails, the container is not usable yet—inspect the earlier build logs.
128+
129+
#### 3. Launch training inside the container
130+
131+
```sh
132+
singularity exec --nv modalities.sif bash -lc '\
133+
cd /opt/repos/modalities && \
134+
torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 \
135+
src/modalities/__main__.py run \
136+
--config_file_path config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm'
137+
```
138+
139+
To iterate on local code without rebuilding the image, bind‑mount your checkout: e.g. `singularity exec --nv --bind $PWD:/opt/repos/modalities modalities.sif bash` (the host repo then overrides the cloned one inside the container).
140+
141+
For a multinode training with slurm, see the example sbatch-file container/slurm_singularity.sbatch.
142+
91143
## Usage
92144
Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.
93145

container/modalities.def

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
Bootstrap: docker
2+
From: nvcr.io/nvidia/nemo:25.09
3+
4+
%environment
5+
export PYTHONNOUSERSITE=1
6+
7+
%post
8+
set -e
9+
mkdir -p /opt/repos /opt/modalities/config_files/training /e /p /etc/FZJ
10+
python -m pip install --upgrade pip || true
11+
12+
# remove pytorch and install pytorch nightly
13+
rm -rf /usr/local/lib/python3.12/dist-packages/torch* /usr/local/lib/python3.12/dist-packages/pytorch_triton* || true
14+
python -m pip install --pre --no-cache-dir --index-url https://download.pytorch.org/whl/nightly/cu129 torch torchvision
15+
16+
# Clone repos (network required at build time)
17+
git clone --depth 1 https://github.com/Dao-AILab/flash-attention.git /opt/repos/flash-attention
18+
git clone --branch main --depth 1 https://github.com/Modalities/modalities.git /opt/repos/modalities
19+
20+
# Install flash-attention
21+
cd /opt/repos/flash-attention
22+
MAX_JOBS=4 python setup.py install
23+
24+
# Install modalities
25+
cd /opt/repos/modalities
26+
pip install -e .
27+
28+
%test
29+
python - <<'EOF'
30+
import torch
31+
print("Torch import OK")
32+
import modalities
33+
print("Modalities import OK")
34+
EOF

container/slurm_singularity.sbatch

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/bin/bash
2+
# SLURM SUBMIT SCRIPT
3+
#SBATCH --exclusive
4+
#SBATCH --account=your_account
5+
#SBATCH --partition=your_partition
6+
#SBATCH --qos=your_qos
7+
#SBATCH --job-name=modalities
8+
#SBATCH --output=/path/to/logs/log_%j.out
9+
#SBATCH --error=/path/to/logs/log_%j.err
10+
#SBATCH --time=00:10:00
11+
#SBATCH --ntasks-per-node=1
12+
#SBATCH --cpus-per-task=32
13+
#SBATCH --nodes=32
14+
#SBATCH --gres=gpu:4
15+
#SBATCH --mem=0
16+
17+
# Paths (set these to real host locations before submitting):
18+
SINGULARITY_IMAGE=./modalities.sif # Singularity image file.
19+
CONTAINER_HOME=/path/to/container/home/on/host # Acts as $HOME inside container (-H).
20+
MODALITIES_DIR=/path/to/modalities/on/host # Host repo path bind-mounted into container.
21+
22+
#### Environment variables ####
23+
export CXX=g++
24+
export CC=gcc
25+
26+
# NCCL/UCX settings
27+
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
28+
export NCCL_IB_TIMEOUT=50
29+
export UCX_RC_TIMEOUT=4s
30+
export NCCL_SOCKET_IFNAME=ib0
31+
export GLOO_SOCKET_IFNAME=ib0
32+
export NCCL_IB_RETRY_CNT=10
33+
export CUDA_VISIBLE_DEVICES=0,1,2,3
34+
export NCCL_DEBUG=INFO
35+
export NCCL_ASYNC_ERROR_HANDLING=1
36+
37+
# Enable logging
38+
set -x -e
39+
echo "START TIME: $(date)"
40+
41+
##### Network parameters #####
42+
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
43+
MASTER_PORT=6000
44+
45+
echo "START TIME: $(date)"
46+
47+
# Launch the job via srun; Singularity provides GPUs with --nv.
48+
# Additional bind mounts (-B) can be added for datasets, scratch, etc.
49+
srun singularity exec --nv \
50+
-H $CONTAINER_HOME \
51+
-B /dev/infiniband:/dev/infiniband \
52+
-B $MODALITIES_DIR:/opt/modalities \
53+
# bind other directories as needed, e.g. to access data on host system
54+
"$SINGULARITY_IMAGE" bash -lc "
55+
cd /opt/modalities
56+
torchrun \
57+
--node_rank=$SLURM_NODEID \
58+
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
59+
--rdzv-id test_pp \
60+
--nnodes $SLURM_JOB_NUM_NODES \
61+
--nproc_per_node 4 \
62+
--rdzv_backend c10d \
63+
src/modalities/__main__.py run \
64+
--config_file_path config_files/training/config_lorem_ipsum_long_fsdp2.yaml \
65+
--test_comm
66+
"
67+
68+
echo "END TIME: $(date)"
69+
echo "=== FINISHED ==="

0 commit comments

Comments
 (0)