Skip to content

Commit ce10816

Browse files
committed
Refresh workload and registry examples
1 parent 9db02ec commit ce10816

11 files changed

Lines changed: 43 additions & 38 deletions

File tree

config.example/group_vars/all.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -316,5 +316,5 @@ standalone_container_registry_port: "5000"
316316
# Configuration for NGC-Ready playbook #
317317
################################################################################
318318
ngc_ready_cuda_container: "nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04"
319-
ngc_ready_pytorch: "nvcr.io/nvidia/pytorch:24.04-py3"
320-
ngc_ready_tensorflow: "nvcr.io/nvidia/tensorflow:24.04-tf2-py3"
319+
ngc_ready_pytorch: "nvcr.io/nvidia/pytorch:26.04-py3"
320+
ngc_ready_tensorflow: "nvcr.io/nvidia/tensorflow:25.02-tf2-py3"

docs/airgap/ngc-ready.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,8 @@ For instructions on setting up an HTTP mirror, see the [doc on HTTP mirrors](./m
3737
Container images are only needed if you want to run the tests built into the playbook:
3838

3939
- nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
40-
- nvcr.io/nvidia/pytorch:24.04-py3
41-
- nvcr.io/nvidia/tensorflow:24.04-tf2-py3
40+
- nvcr.io/nvidia/pytorch:26.04-py3
41+
- nvcr.io/nvidia/tensorflow:25.02-tf2-py3
4242

4343
For instructions on setting up a Docker registry mirror, see the [doc on Docker mirrors](./mirror-docker-images.md).
4444

@@ -62,8 +62,8 @@ For instructions on setting up an HTTP mirror, see the [doc on HTTP mirrors](./m
6262
Container images (how to mirror) are only needed if you want to run the tests built into the playbook:
6363

6464
- nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
65-
- nvcr.io/nvidia/pytorch:24.04-py3
66-
- nvcr.io/nvidia/tensorflow:24.04-tf2-py3
65+
- nvcr.io/nvidia/pytorch:26.04-py3
66+
- nvcr.io/nvidia/tensorflow:25.02-tf2-py3
6767

6868
For instructions on setting up a Docker registry mirror, see the [doc on Docker mirrors](./mirror-docker-images.md).
6969

@@ -177,8 +177,8 @@ If running the container tests as part of the NGC-Ready playbook, set the follow
177177

178178
```bash
179179
ngc_ready_cuda_container: "<your-container-registry>/nvidia/cuda:12.4.1-base-ubuntu22.04"
180-
ngc_ready_pytorch: "<your-container-registry>/nvidia/pytorch:24.04-py3"
181-
ngc_ready_tensorflow: "<your-container-registry>/nvidia/tensorflow:24.04-tf2-py3"
180+
ngc_ready_pytorch: "<your-container-registry>/nvidia/pytorch:26.04-py3"
181+
ngc_ready_tensorflow: "<your-container-registry>/nvidia/tensorflow:25.02-tf2-py3"
182182
```
183183

184184
## Running the NGC-Ready playbook

docs/container/nginx-docker-cache.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,8 @@ The following variables are the most common configuration you may want to adjust
4242

4343
| Variable | Default value | Description |
4444
| ------------------------------------------ | ---------------------------------------- | ----------------------------------------------------------------------------- |
45-
| `nginx_docker_cache_image` | `"rpardini/docker-registry-proxy:0.6.1"` | Container image used to deploy the proxy |
46-
| `nginx_docker_cache_registry_string` | `"quay.io k8s.gcr.io gcr.io nvcr.io"` | Space-separated list of registries to proxy |
45+
| `nginx_docker_cache_image` | `"rpardini/docker-registry-proxy:0.6.5"` | Container image used to deploy the proxy |
46+
| `nginx_docker_cache_registry_string` | `"registry.k8s.io quay.io k8s.gcr.io gcr.io nvcr.io"` | Space-separated list of registries to proxy; `k8s.gcr.io` is retained for older clusters while current Kubernetes images use `registry.k8s.io` |
4747
| `nginx_docker_cache_manifests` | `"false"` | Flag to determine whether to cache image manifests |
4848
| `nginx_docker_cache_manifest_default_time` | "1h" | If manifests are cached, time to cache them |
4949
| `nginx_docker_cache_hostgroup` | `"cache"` | Ansible inventory host group where proxy is deployed |

docs/k8s-cluster/kubernetes-usage.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Kubernetes Usage Guide
1010

1111
## Introduction
1212

13-
Most of the following examples can be configured and executed through the Kubernetes Dashboard. For a basic run-through on how to leverage the Kubernetes Dashboard, please see the [official documentation](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/). The following examples `kubectl` on the master node instead.
13+
Most of the following examples can be configured and executed through the Kubernetes Dashboard. For a basic run-through on how to leverage the Kubernetes Dashboard, please see the [official documentation](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/). The following examples use `kubectl` on the master node instead.
1414

1515
## Simple Commands
1616

@@ -63,12 +63,12 @@ kubectl get pods --all-namespaces
6363
4. Delete the job (and the corresponding pod).
6464

6565
```bash
66-
kubectl delete job cuda-job
66+
kubectl delete job pytorch-job
6767
```
6868

6969
## Using NGC Containers with Kubernetes and Launching Jobs
7070

71-
[NVIDIA GPU Cloud (NGC)](https://docs.nvidia.com/ngc/ngc-introduction) manages a catalog of fully integrated and optimized DL framework containers that take full advantage of NVIDIA GPUs in both single and multi-GPU configurations. They include NVIDIA CUDA® Toolkit, DIGITS workflow, and the following DL frameworks: NVCaffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, PyTorch, TensorFlow, Theano, and Torch. These framework containers are delivered ready-to-run, including all necessary dependencies such as the CUDA runtime and NVIDIA libraries.
71+
[NVIDIA GPU Cloud (NGC)](https://docs.nvidia.com/ngc/ngc-introduction) manages a catalog of optimized GPU containers for CUDA, PyTorch, TensorFlow, Triton Inference Server, RAPIDS, and other NVIDIA software. Use the NGC catalog and the NVIDIA framework container release notes to choose the current image for your workload.
7272

7373
To access the NGC container registry via Kubernetes, add a secret which will be employed when Kubernetes asks NGC to pull container images from it.
7474

@@ -105,9 +105,9 @@ To access the NGC container registry via Kubernetes, add a secret which will be
105105
- name: nvcr.dgxkey
106106
containers:
107107
- name: pytorch-container
108-
image: nvcr.io/nvidia/pytorch:19.02-py3
108+
image: nvcr.io/nvidia/pytorch:26.04-py3
109109
command: ["/bin/sh"]
110-
args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
110+
args: ["-c", "python -c 'import torch; print(\"cuda_available=\", torch.cuda.is_available()); print(\"device_count=\", torch.cuda.device_count())'"]
111111
resources:
112112
limits:
113113
nvidia.com/gpu: 1

docs/slurm-cluster/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ default parameters that can be overriden:
8787
```bash
8888
# String; Container for nccl performance/validation tests. Either docker
8989
# tag or can be path to sqsh file.
90-
base_container: "nvcr.io/nvidia/tensorflow:21.09-tf2-py3"
90+
base_container: "nvcr.io/nvidia/pytorch:26.04-py3"
9191

9292
# String; Container to be created or one that might exist with nccl tests.
9393
# If `compile_nccl_tests` is True, it must be a sqsh file.
@@ -166,17 +166,17 @@ NOTE: This will use Pyxis to download a container.
166166

167167
```bash
168168
ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
169-
-e '{base_container: nvcr.io/nvidia/pytorch:21.09-py3}' \
169+
-e '{base_container: nvcr.io/nvidia/pytorch:26.04-py3}' \
170170
-e '{nccl_tests_container: "${HOME}/enroot_images/nccl_tests_torch_val.sqsh"}' \
171171
-e '{num_nodes: 2}' \
172172
-e '{srun_exports: "NCCL_DEBUG=INFO,OMPI_MCA_pml=^ucx,OMPI_MCA_coll=^hcoll"}' \
173173
-e '{cleanup: True}'
174174
```
175175

176-
3. Example to run on 1 node using existing NCCL container from a docker repo.
176+
3. Example to run on 1 node using an existing NCCL test container from a site registry.
177177
```bash
178178
ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
179-
-e '{nccl_tests_container: deepops/nccl-tests-tf20.06-ubuntu18.04:latest}' \
179+
-e '{nccl_tests_container: registry.example.com/hpc/nccl-tests:latest}' \
180180
-e '{compile_nccl_tests: False}' \
181181
-e '{num_nodes: 1}'
182182
```

docs/slurm-cluster/slurm-perf-cluster.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ If errors are noticed when running `sinfo -R`, it's also helpful to search the l
254254
sudo journalctl -e | grep slurm
255255
```
256256

257-
To re-run the test manually, from the slurm login node...
257+
To re-run the test manually, from the slurm login node. Replace `registry.example.com/hpc/nccl-tests:latest` with your site's current NCCL tests image or a `.sqsh` image built by `playbooks/slurm-cluster/slurm-validation.yml`.
258258

259259
```bash
260260
# on the slurm login node
@@ -269,7 +269,7 @@ scancel <job_id>
269269
sudo scontrol update nodename=<node_names> state=idle
270270

271271
# run the test again
272-
srun -N <num_nodes> --mpi=pmix --exclusive --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04 --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
272+
srun -N <num_nodes> --mpi=pmix --exclusive --container-image=registry.example.com/hpc/nccl-tests:latest --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
273273
```
274274

275275
### Performance validation test results are suboptimal
@@ -289,7 +289,7 @@ Try running the test from the slurm login node, but with debug output enabled...
289289

290290
```bash
291291
# from the slurm login node
292-
$ NCCL_DEBUG=INFO srun -N <num_nodes> --mpi=pmix --exclusive --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04 --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
292+
$ NCCL_DEBUG=INFO srun -N <num_nodes> --mpi=pmix --exclusive --container-image=registry.example.com/hpc/nccl-tests:latest --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
293293

294294
# examine the output, looking for any mention of `GDRDMA`
295295
# for example: `NET/IB/0/GDRDMA`

docs/slurm-cluster/slurm-single-node.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -368,11 +368,11 @@ compute-session:start_rootless_docker.sh
368368
```
369369
370370
An option “--quiet” can be passed to the “start_rootless_docker.sh” script to
371-
hide rootless docker messages. Pull/run a docker image:
371+
hide rootless docker messages. Pull/run a site-maintained NCCL tests image:
372372
373373
```bash
374374
compute-session:docker run --gpus=all --rm -it \
375-
deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
375+
registry.example.com/hpc/nccl-tests:latest \
376376
mpirun --allow-run-as-root -np 2 all_reduce_perf -b 1M -e 4G -f 2 -g 1
377377
```
378378
@@ -386,7 +386,7 @@ module load rootless-docker
386386
387387
start_rootless_docker.sh --quiet
388388
389-
docker run --gpus=all --rm -t deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
389+
docker run --gpus=all --rm -t registry.example.com/hpc/nccl-tests:latest \
390390
mpirun --allow-run-as-root -np 2 all_reduce_perf -b 1M -e 4G -f 2 -g 1
391391
392392
stop_rootless_docker.sh
@@ -403,7 +403,7 @@ starting the container and checking the number of GPUs and CPUs available.
403403
404404
```bash
405405
compute-session:docker run --gpus=all --rm -it \
406-
deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
406+
registry.example.com/hpc/nccl-tests:latest \
407407
bash -c 'echo NGPUS: $(nvidia-smi -L | wc -l) NCPUS: $(nproc)'
408408
NGPUS: 2 NCPUS: 2
409409
```
@@ -416,7 +416,7 @@ already does not have permission to outside of the container.
416416
417417
```bash
418418
compute-session:docker run --gpus=all --rm -it -v ${PWD}:${PWD} --workdir=${PWD} \
419-
deepops/nccl-tests-tf20.06-ubuntu18.04:latest bash -c 'touch somefile-in-container'
419+
registry.example.com/hpc/nccl-tests:latest bash -c 'touch somefile-in-container'
420420
```
421421
422422
Then outside of the container.
@@ -434,7 +434,7 @@ outside of the container.
434434
435435
```bash
436436
compute-session:docker run --gpus=all --rm -it -v /etc/slurm:/slurm --workdir=${PWD} \
437-
deepops/nccl-tests-tf20.06-ubuntu18.04:latest bash -c 'cat /slurm/slurmdbd.conf'
437+
registry.example.com/hpc/nccl-tests:latest bash -c 'cat /slurm/slurmdbd.conf'
438438
cat: /slurm/slurmdbd.conf: Permission denied
439439
```
440440
@@ -464,13 +464,15 @@ Singularity and enroot could also be deployed via DeepOps. These would be
464464
useful for multi-node jobs if running on more than one DGX system.
465465
Enroot with pyxis can be tested by running:
466466
467+
The examples below use `registry.example.com/hpc/nccl-tests:latest` as a placeholder for a site-maintained NCCL tests image.
468+
467469
```bash
468470
login-session:srun --mpi=pmi2 --ntasks=2 --gpus-per-task=1 \
469-
--container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
471+
--container-image=registry.example.com/hpc/nccl-tests:latest \
470472
all_reduce_perf -b 1M -e 4G -f 2 -g 1
471473
```
472474
473-
The pyxis+enroot is invoked via option “ --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest”
475+
The pyxis+enroot is invoked via option “ --container-image=registry.example.com/hpc/nccl-tests:latest”
474476
to run the “all_reduce_perf” nccl test. Refer to enroot and pyxis documentation
475477
for further details.
476478
@@ -490,7 +492,7 @@ Then invoke as:
490492
491493
```bash
492494
login-session:srun --ntasks=2 --gpus-per-task=1 --no-container-remap-root \
493-
--container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest --container-workdir=${PWD} \
495+
--container-image=registry.example.com/hpc/nccl-tests:latest --container-workdir=${PWD} \
494496
test-allreduce.sh
495497
```
496498
@@ -507,7 +509,7 @@ Singularity could be used in a similar fashion to enroot. Don’t forget the
507509
508510
```bash
509511
login-session:srun --mpi=pmi2 --ntasks=2 --gpus-per-task=1 \
510-
singularity exec --nv docker://deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
512+
singularity exec --nv docker://registry.example.com/hpc/nccl-tests:latest \
511513
all_reduce_perf -b 1M -e 4G -f 2 -g 1
512514
```
513515
@@ -516,7 +518,7 @@ with enroot):
516518
517519
```bash
518520
login-session:srun --ntasks=2 --gpus-per-task=1 \
519-
singularity exec --nv docker://deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
521+
singularity exec --nv docker://registry.example.com/hpc/nccl-tests:latest \
520522
${PWD}/test_allreduce.sh
521523
```
522524

playbooks/slurm-cluster/slurm-validation.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
vars:
1212
# String; Container for nccl performance/validation tests. Either docker
1313
# repo or can be path to sqsh file.
14-
base_container: "nvcr.io/nvidia/tensorflow:21.09-tf2-py3"
14+
base_container: "nvcr.io/nvidia/pytorch:26.04-py3"
1515
# String; Container to be created or one that might exist with nccl tests.
1616
# If `compile_nccl_tests` is True, it must be a sqsh file.
1717
nccl_tests_container: "${HOME}/enroot_images/nccl_tests_slurm_val.sqsh"

roles/nginx-docker-registry-cache/defaults/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ nginx_docker_cache_image: "rpardini/docker-registry-proxy:0.6.5"
55
nginx_docker_cache_mirror_path: "/opt/deepops/nginx-docker-cache/mirror"
66
nginx_docker_cache_ca_path: "/opt/deepops/nginx-docker-cache/ca"
77

8-
nginx_docker_cache_registry_string: "quay.io k8s.gcr.io gcr.io nvcr.io"
8+
nginx_docker_cache_registry_string: "registry.k8s.io quay.io k8s.gcr.io gcr.io nvcr.io"
99
nginx_docker_cache_manifests: "false"
1010
nginx_docker_cache_manifest_default_time: "1h"
1111

workloads/examples/k8s/dask-rapids/docker/Dockerfile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,12 @@ USER root
55

66
RUN apt-get update && \
77
apt-get install -y --no-install-recommends font-manager && \
8+
mkdir -p /opt/rapids/notebooks && \
9+
chown -R rapids:conda /opt/rapids && \
810
rm -rf /var/lib/apt/lists/*
911

1012
USER rapids
13+
WORKDIR /opt/rapids/notebooks
1114

1215
# Copy the parallel sum notebook in
13-
COPY --chown=rapids:conda ParallelSum.ipynb /home/rapids/notebooks/ParallelSum.ipynb
16+
COPY --chown=rapids:conda ParallelSum.ipynb ./ParallelSum.ipynb

0 commit comments

Comments
 (0)