Refresh workload and registry examples

dholt · dholt · commit ce10816bfa8e · 2026-06-02T17:19:48.000-06:00
diff --git a/config.example/group_vars/all.yml b/config.example/group_vars/all.yml
@@ -316,5 +316,5 @@ standalone_container_registry_port: "5000"
 # Configuration for NGC-Ready playbook                                         #
 ################################################################################
 ngc_ready_cuda_container: "nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04"
-ngc_ready_pytorch: "nvcr.io/nvidia/pytorch:24.04-py3"
-ngc_ready_tensorflow: "nvcr.io/nvidia/tensorflow:24.04-tf2-py3"
+ngc_ready_pytorch: "nvcr.io/nvidia/pytorch:26.04-py3"
+ngc_ready_tensorflow: "nvcr.io/nvidia/tensorflow:25.02-tf2-py3"
diff --git a/docs/airgap/ngc-ready.md b/docs/airgap/ngc-ready.md
@@ -37,8 +37,8 @@ For instructions on setting up an HTTP mirror, see the [doc on HTTP mirrors](./m
 Container images are only needed if you want to run the tests built into the playbook:
 
 - nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
-- nvcr.io/nvidia/pytorch:24.04-py3
-- nvcr.io/nvidia/tensorflow:24.04-tf2-py3
+- nvcr.io/nvidia/pytorch:26.04-py3
+- nvcr.io/nvidia/tensorflow:25.02-tf2-py3
 
 For instructions on setting up a Docker registry mirror, see the [doc on Docker mirrors](./mirror-docker-images.md).
 
@@ -62,8 +62,8 @@ For instructions on setting up an HTTP mirror, see the [doc on HTTP mirrors](./m
 Container images (how to mirror) are only needed if you want to run the tests built into the playbook:
 
 - nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
-- nvcr.io/nvidia/pytorch:24.04-py3
-- nvcr.io/nvidia/tensorflow:24.04-tf2-py3
+- nvcr.io/nvidia/pytorch:26.04-py3
+- nvcr.io/nvidia/tensorflow:25.02-tf2-py3
 
 For instructions on setting up a Docker registry mirror, see the [doc on Docker mirrors](./mirror-docker-images.md).
 
@@ -177,8 +177,8 @@ If running the container tests as part of the NGC-Ready playbook, set the follow
 
 ```bash
 ngc_ready_cuda_container: "<your-container-registry>/nvidia/cuda:12.4.1-base-ubuntu22.04"
-ngc_ready_pytorch: "<your-container-registry>/nvidia/pytorch:24.04-py3"
-ngc_ready_tensorflow: "<your-container-registry>/nvidia/tensorflow:24.04-tf2-py3"
+ngc_ready_pytorch: "<your-container-registry>/nvidia/pytorch:26.04-py3"
+ngc_ready_tensorflow: "<your-container-registry>/nvidia/tensorflow:25.02-tf2-py3"
 ```
 
 ## Running the NGC-Ready playbook
diff --git a/docs/container/nginx-docker-cache.md b/docs/container/nginx-docker-cache.md
@@ -42,8 +42,8 @@ The following variables are the most common configuration you may want to adjust
 
 | Variable                                   | Default value                            | Description                                                                   |
 | ------------------------------------------ | ---------------------------------------- | ----------------------------------------------------------------------------- |
-| `nginx_docker_cache_image`                 | `"rpardini/docker-registry-proxy:0.6.1"` | Container image used to deploy the proxy                                      |
-| `nginx_docker_cache_registry_string`       | `"quay.io k8s.gcr.io gcr.io nvcr.io"`    | Space-separated list of registries to proxy                                   |
+| `nginx_docker_cache_image`                 | `"rpardini/docker-registry-proxy:0.6.5"` | Container image used to deploy the proxy                                      |
+| `nginx_docker_cache_registry_string`       | `"registry.k8s.io quay.io k8s.gcr.io gcr.io nvcr.io"` | Space-separated list of registries to proxy; `k8s.gcr.io` is retained for older clusters while current Kubernetes images use `registry.k8s.io` |
 | `nginx_docker_cache_manifests`             | `"false"`                                | Flag to determine whether to cache image manifests                            |
 | `nginx_docker_cache_manifest_default_time` | "1h"                                     | If manifests are cached, time to cache them                                   |
 | `nginx_docker_cache_hostgroup`             | `"cache"`                                | Ansible inventory host group where proxy is deployed                          |
diff --git a/docs/k8s-cluster/kubernetes-usage.md b/docs/k8s-cluster/kubernetes-usage.md
@@ -10,7 +10,7 @@ Kubernetes Usage Guide
 
 ## Introduction
 
-Most of the following examples can be configured and executed through the Kubernetes Dashboard. For a basic run-through on how to leverage the Kubernetes Dashboard, please see the [official documentation](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/). The following examples `kubectl` on the master node instead.
+Most of the following examples can be configured and executed through the Kubernetes Dashboard. For a basic run-through on how to leverage the Kubernetes Dashboard, please see the [official documentation](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/). The following examples use `kubectl` on the master node instead.
 
 ## Simple Commands
 
@@ -63,12 +63,12 @@ kubectl get pods --all-namespaces
 4. Delete the job (and the corresponding pod).
 
    ```bash
-   kubectl delete job cuda-job
+   kubectl delete job pytorch-job
    ```
 
 ## Using NGC Containers with Kubernetes and Launching Jobs
 
-[NVIDIA GPU Cloud (NGC)](https://docs.nvidia.com/ngc/ngc-introduction) manages a catalog of fully integrated and optimized DL framework containers that take full advantage of NVIDIA GPUs in both single and multi-GPU configurations. They include NVIDIA CUDA® Toolkit, DIGITS workflow, and the following DL frameworks: NVCaffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, PyTorch, TensorFlow, Theano, and Torch. These framework containers are delivered ready-to-run, including all necessary dependencies such as the CUDA runtime and NVIDIA libraries.
+[NVIDIA GPU Cloud (NGC)](https://docs.nvidia.com/ngc/ngc-introduction) manages a catalog of optimized GPU containers for CUDA, PyTorch, TensorFlow, Triton Inference Server, RAPIDS, and other NVIDIA software. Use the NGC catalog and the NVIDIA framework container release notes to choose the current image for your workload.
 
 To access the NGC container registry via Kubernetes, add a secret which will be employed when Kubernetes asks NGC to pull container images from it.
 
@@ -105,9 +105,9 @@ To access the NGC container registry via Kubernetes, add a secret which will be
            - name: nvcr.dgxkey
          containers:
            - name: pytorch-container
-             image: nvcr.io/nvidia/pytorch:19.02-py3
+             image: nvcr.io/nvidia/pytorch:26.04-py3
              command: ["/bin/sh"]
-             args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
+             args: ["-c", "python -c 'import torch; print(\"cuda_available=\", torch.cuda.is_available()); print(\"device_count=\", torch.cuda.device_count())'"]
              resources:
                limits:
                  nvidia.com/gpu: 1
diff --git a/docs/slurm-cluster/README.md b/docs/slurm-cluster/README.md
@@ -87,7 +87,7 @@ default parameters that can be overriden:
 ```bash
     # String; Container for nccl performance/validation tests. Either docker
     #   tag or can be path to sqsh file.
-    base_container: "nvcr.io/nvidia/tensorflow:21.09-tf2-py3"
+    base_container: "nvcr.io/nvidia/pytorch:26.04-py3"
 
     # String; Container to be created or one that might exist with nccl tests.
     #   If `compile_nccl_tests` is True, it must be a sqsh file.
@@ -166,17 +166,17 @@ NOTE: This will use Pyxis to download a container.
 
    ```bash
    ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
-     -e '{base_container: nvcr.io/nvidia/pytorch:21.09-py3}' \
+     -e '{base_container: nvcr.io/nvidia/pytorch:26.04-py3}' \
      -e '{nccl_tests_container: "${HOME}/enroot_images/nccl_tests_torch_val.sqsh"}' \
      -e '{num_nodes: 2}' \
      -e '{srun_exports: "NCCL_DEBUG=INFO,OMPI_MCA_pml=^ucx,OMPI_MCA_coll=^hcoll"}' \
      -e '{cleanup: True}'
    ```
 
-3. Example to run on 1 node using existing NCCL container from a docker repo.
+3. Example to run on 1 node using an existing NCCL test container from a site registry.
    ```bash
    ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
-     -e '{nccl_tests_container: deepops/nccl-tests-tf20.06-ubuntu18.04:latest}' \
+     -e '{nccl_tests_container: registry.example.com/hpc/nccl-tests:latest}' \
      -e '{compile_nccl_tests: False}' \
      -e '{num_nodes: 1}'
    ```
diff --git a/docs/slurm-cluster/slurm-perf-cluster.md b/docs/slurm-cluster/slurm-perf-cluster.md
@@ -254,7 +254,7 @@ If errors are noticed when running `sinfo -R`, it's also helpful to search the l
 sudo journalctl -e | grep slurm
 ```
 
-To re-run the test manually, from the slurm login node...
+To re-run the test manually, from the slurm login node. Replace `registry.example.com/hpc/nccl-tests:latest` with your site's current NCCL tests image or a `.sqsh` image built by `playbooks/slurm-cluster/slurm-validation.yml`.
 
 ```bash
 # on the slurm login node
@@ -269,7 +269,7 @@ scancel <job_id>
 sudo scontrol update nodename=<node_names> state=idle
 
 # run the test again
-srun -N <num_nodes> --mpi=pmix --exclusive --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04 --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
+srun -N <num_nodes> --mpi=pmix --exclusive --container-image=registry.example.com/hpc/nccl-tests:latest --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
 ```
 
 ### Performance validation test results are suboptimal
@@ -289,7 +289,7 @@ Try running the test from the slurm login node, but with debug output enabled...
 
 ```bash
 # from the slurm login node
-$ NCCL_DEBUG=INFO srun -N <num_nodes> --mpi=pmix --exclusive --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04 --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
+$ NCCL_DEBUG=INFO srun -N <num_nodes> --mpi=pmix --exclusive --container-image=registry.example.com/hpc/nccl-tests:latest --ntasks-per-node=8 -G <num_nodes x num_gpus_per_node> all_reduce_perf -b 1M -e 4G -f 2 -g <num_gpus_per_node>
 
 # examine the output, looking for any mention of `GDRDMA`
 # for example: `NET/IB/0/GDRDMA`
diff --git a/docs/slurm-cluster/slurm-single-node.md b/docs/slurm-cluster/slurm-single-node.md
@@ -368,11 +368,11 @@ compute-session:start_rootless_docker.sh
 ```
 
 An option “--quiet” can be passed to the “start_rootless_docker.sh” script to
-hide rootless docker messages. Pull/run a docker image:
+hide rootless docker messages. Pull/run a site-maintained NCCL tests image:
 
 ```bash
 compute-session:docker run --gpus=all --rm -it \
-  deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
+  registry.example.com/hpc/nccl-tests:latest \
   mpirun --allow-run-as-root -np 2  all_reduce_perf -b 1M -e 4G -f 2 -g 1
 ```
 
@@ -386,7 +386,7 @@ module load rootless-docker
 
 start_rootless_docker.sh --quiet
 
-docker run --gpus=all --rm -t deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
+docker run --gpus=all --rm -t registry.example.com/hpc/nccl-tests:latest \
   mpirun --allow-run-as-root -np 2  all_reduce_perf -b 1M -e 4G -f 2 -g 1
 
 stop_rootless_docker.sh
@@ -403,7 +403,7 @@ starting the container and checking the number of GPUs and CPUs available.
 
 ```bash
 compute-session:docker run --gpus=all --rm -it \
-  deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
+  registry.example.com/hpc/nccl-tests:latest \
   bash -c 'echo NGPUS: $(nvidia-smi -L | wc -l) NCPUS: $(nproc)'
 NGPUS: 2 NCPUS: 2
 ```
@@ -416,7 +416,7 @@ already does not have permission to outside of the container.
 
 ```bash
 compute-session:docker run --gpus=all --rm -it -v ${PWD}:${PWD} --workdir=${PWD} \
-  deepops/nccl-tests-tf20.06-ubuntu18.04:latest bash -c 'touch somefile-in-container'
+  registry.example.com/hpc/nccl-tests:latest bash -c 'touch somefile-in-container'
 ```
 
 Then outside of the container.
@@ -434,7 +434,7 @@ outside of the container.
 
 ```bash
 compute-session:docker run --gpus=all --rm -it -v /etc/slurm:/slurm --workdir=${PWD} \
-  deepops/nccl-tests-tf20.06-ubuntu18.04:latest bash -c 'cat /slurm/slurmdbd.conf'
+  registry.example.com/hpc/nccl-tests:latest bash -c 'cat /slurm/slurmdbd.conf'
 cat: /slurm/slurmdbd.conf: Permission denied
 ```
 
@@ -464,13 +464,15 @@ Singularity and enroot could also be deployed via DeepOps. These would be
 useful for multi-node jobs if running on more than one DGX system.
 Enroot with pyxis can be tested by running:
 
+The examples below use `registry.example.com/hpc/nccl-tests:latest` as a placeholder for a site-maintained NCCL tests image.
+
 ```bash
 login-session:srun --mpi=pmi2 --ntasks=2 --gpus-per-task=1 \
-  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
+  --container-image=registry.example.com/hpc/nccl-tests:latest \
   all_reduce_perf -b 1M -e 4G -f 2 -g 1
 ```
 
-The pyxis+enroot is invoked via option “ --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest”
+The pyxis+enroot is invoked via option “ --container-image=registry.example.com/hpc/nccl-tests:latest”
 to run the “all_reduce_perf” nccl test. Refer to enroot and pyxis documentation
 for further details.
 
@@ -490,7 +492,7 @@ Then invoke as:
 
 ```bash
 login-session:srun --ntasks=2 --gpus-per-task=1 --no-container-remap-root \
-  --container-image=deepops/nccl-tests-tf20.06-ubuntu18.04:latest --container-workdir=${PWD} \
+  --container-image=registry.example.com/hpc/nccl-tests:latest --container-workdir=${PWD} \
   test-allreduce.sh
 ```
 
@@ -507,7 +509,7 @@ Singularity could be used in a similar fashion to enroot. Don’t forget the
 
 ```bash
 login-session:srun --mpi=pmi2 --ntasks=2 --gpus-per-task=1 \
-  singularity exec --nv docker://deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
+  singularity exec --nv docker://registry.example.com/hpc/nccl-tests:latest \
     all_reduce_perf -b 1M -e 4G -f 2 -g 1
 ```
 
@@ -516,7 +518,7 @@ with enroot):
 
 ```bash
 login-session:srun --ntasks=2 --gpus-per-task=1 \
-  singularity exec --nv docker://deepops/nccl-tests-tf20.06-ubuntu18.04:latest \
+  singularity exec --nv docker://registry.example.com/hpc/nccl-tests:latest \
     ${PWD}/test_allreduce.sh
 ```
 
diff --git a/playbooks/slurm-cluster/slurm-validation.yml b/playbooks/slurm-cluster/slurm-validation.yml
@@ -11,7 +11,7 @@
   vars:
     # String; Container for nccl performance/validation tests. Either docker
     #   repo or can be path to sqsh file.
-    base_container: "nvcr.io/nvidia/tensorflow:21.09-tf2-py3"
+    base_container: "nvcr.io/nvidia/pytorch:26.04-py3"
     # String; Container to be created or one that might exist with nccl tests.
     #   If `compile_nccl_tests` is True, it must be a sqsh file.
     nccl_tests_container: "${HOME}/enroot_images/nccl_tests_slurm_val.sqsh"
diff --git a/roles/nginx-docker-registry-cache/defaults/main.yml b/roles/nginx-docker-registry-cache/defaults/main.yml
@@ -5,7 +5,7 @@ nginx_docker_cache_image: "rpardini/docker-registry-proxy:0.6.5"
 nginx_docker_cache_mirror_path: "/opt/deepops/nginx-docker-cache/mirror"
 nginx_docker_cache_ca_path: "/opt/deepops/nginx-docker-cache/ca"
 
-nginx_docker_cache_registry_string: "quay.io k8s.gcr.io gcr.io nvcr.io"
+nginx_docker_cache_registry_string: "registry.k8s.io quay.io k8s.gcr.io gcr.io nvcr.io"
 nginx_docker_cache_manifests: "false"
 nginx_docker_cache_manifest_default_time: "1h"
 
diff --git a/workloads/examples/k8s/dask-rapids/docker/Dockerfile b/workloads/examples/k8s/dask-rapids/docker/Dockerfile
@@ -5,9 +5,12 @@ USER root
 
 RUN apt-get update && \
     apt-get install -y --no-install-recommends font-manager && \
+    mkdir -p /opt/rapids/notebooks && \
+    chown -R rapids:conda /opt/rapids && \
     rm -rf /var/lib/apt/lists/*
 
 USER rapids
+WORKDIR /opt/rapids/notebooks
 
 # Copy the parallel sum notebook in
-COPY --chown=rapids:conda ParallelSum.ipynb /home/rapids/notebooks/ParallelSum.ipynb
+COPY --chown=rapids:conda ParallelSum.ipynb ./ParallelSum.ipynb
diff --git a/workloads/examples/k8s/pytorch-job.yml b/workloads/examples/k8s/pytorch-job.yml
@@ -8,9 +8,9 @@ spec:
     spec:
       containers:
         - name: pytorch-container
-          image: nvcr.io/nvidia/pytorch:19.02-py3
+          image: nvcr.io/nvidia/pytorch:26.04-py3
           command: ["/bin/sh"]
-          args: ["-c", "python /workspace/examples/upstream/mnist/main.py"]
+          args: ["-c", "python -c 'import torch; print(\"cuda_available=\", torch.cuda.is_available()); print(\"device_count=\", torch.cuda.device_count())'"]
           resources:
             limits:
               nvidia.com/gpu: 1