refactor: version PyTorch directory structure for multi-version support by Eren-Jeager123 · Pull Request #6091 · aws/deep-learning-containers

Eren-Jeager123 · 2026-05-12T23:06:34Z

Summary

Restructure PyTorch to support maintaining multiple versions concurrently (1-year support window). Versioned directories isolate each version's build artifacts. Workflows are version-agnostic — no new workflow files needed when adding a future version.

1. Versioned directory structure

docker/pytorch/2.11/
├── Dockerfile.cuda        # Uses ARG DLC_PYTORCH_VERSION=2.11 for COPY paths
├── Dockerfile.cpu
├── versions-cuda.env
├── versions-cpu.env
├── cuda/pyproject.toml
├── cuda/uv.lock
├── cpu/pyproject.toml
└── cpu/uv.lock

Config files versioned by name: pytorch-2.11-ec2-cuda.yml, etc.
Shared scripts (scripts/pytorch/) and tests (test/pytorch/) stay in place.

2. PR workflows — matrix over detected versions

gatekeeper → detect-versions → build-images (matrix) → sanity / security / telemetry / single-gpu / efa

detect-versions: scans diff for docker/pytorch/X.Y/ or pytorch-X.Y-*.yml paths, outputs JSON array. Falls back to LATEST_PYTORCH_VERSION for shared file changes.
build-images: matrix over detected versions with fail-fast: false. Each leg loads its own config (via yq), builds with --build-arg DLC_PYTORCH_VERSION, runs unit tests. CI image tags include version (pytorch-runtime-2.11-pr-{PR}) to avoid collisions.
Downstream tests: consume config + image-uri from build-images outputs. Run against the latest version's image.

Multi-version PRs build ALL changed versions in parallel. Tests run for the latest.

3. Autorelease workflows — cron → config file mapping

case "$CRON" in
  "00 17 * * 1,3") echo "config-file=.github/config/image/pytorch-2.11-ec2-cuda.yml" ;;
esac

Versions staggered 10 minutes apart. Manual dispatch accepts config file path. PyTorch version derived from load-config's framework-version output — no redundant version variable.

4. Dockerfile ARG for version paths

Dockerfiles use ARG DLC_PYTORCH_VERSION=2.11 instead of hardcoding:

COPY docker/pytorch/${DLC_PYTORCH_VERSION}/cuda/pyproject.toml /tmp/build/

Workflows pass --build-arg DLC_PYTORCH_VERSION=${{ matrix.version }} to ensure correctness.

5. Additional fixes

test_versions.py — reads DLC_PYTORCH_VERSION env var to locate versioned versions-*.env
upload_cached_wheels.sh — accepts Dockerfile path as parameter (previously hardcoded a path that never existed)

Adding PyTorch 2.12

Copy docker/pytorch/2.11/ → docker/pytorch/2.12/, update ARG DLC_PYTORCH_VERSION=2.12 + version pins
Create 4 config files (pytorch-2.12-{ec2,sagemaker}-{cuda,cpu}.yml)
Autorelease: add one cron line + one case entry per workflow
PR workflows: auto-detect from diff — no changes needed

Test plan

PR workflow detects version from docker/pytorch/2.11/** changes
Matrix builds in parallel when multiple versions change (CI tags include version — no collision)
Unit tests run for each matrix version with correct DLC_PYTORCH_VERSION
Downstream tests receive config from build-images outputs
DLC_PYTORCH_VERSION build-arg resolves correctly in Dockerfile COPY paths
Autorelease workflow_dispatch with config-file input works
upload_cached_wheels.sh uses parameterized Dockerfile path

🤖 Generated with Claude Code

Move all PyTorch build artifacts under docker/pytorch/2.11/ to support maintaining multiple PyTorch versions concurrently (1-year support window per version). Structure change: docker/pytorch/Dockerfile.cuda → docker/pytorch/2.11/Dockerfile.cuda docker/pytorch/versions-cuda.env → docker/pytorch/2.11/versions-cuda.env docker/pytorch/cuda/pyproject.toml → docker/pytorch/2.11/cuda/pyproject.toml (same pattern for CPU) .github/config/image/pytorch-ec2-cuda.yml → pytorch-2.11-ec2-cuda.yml .github/workflows/pr-pytorch-ec2-cuda.yml → pr-pytorch-2.11-ec2-cuda.yml (same pattern for all 4 variants × PR + autorelease) Adding PyTorch 2.12 when it releases means creating docker/pytorch/2.12/, new configs, and new workflows — without touching 2.11. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The versions env file moved from docker/pytorch/versions-cuda.env to docker/pytorch/2.11/versions-cuda.env. Use glob to find the file under any version directory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test_versions.py now requires DLC_PYTORCH_VERSION env var (e.g., "2.11") to locate the correct versions-{cuda,cpu}.env under the versioned directory. All 4 PR workflows pass it via docker run -e. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same fix as the PR workflows — autorelease workflows also run unit tests that need the versioned env file path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Only unit tests (test_versions.py) need this env var. Removed from single-GPU and multi-GPU test containers in EC2 CUDA workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The script had a hardcoded docker/pytorch/Dockerfile path that no longer exists after the versioned restructure. Accept the Dockerfile path as a parameter and update all 3 callers to pass it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PR workflows detect the PyTorch version from changed file paths (docker/pytorch/X.Y/ or pytorch-X.Y-*.yml), falling back to LATEST_PYTORCH_VERSION env var for shared file changes. Autorelease workflows use multi-cron scheduling with a case mapping from cron expression to version. Staggered 10 min apart per version. Also supports workflow_dispatch with an explicit pytorch-version input. Adding PyTorch 2.12: create docker/pytorch/2.12/ + config files, then add one cron line + one case entry per autorelease workflow. No new workflow files needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sirutBuasai · 2026-05-19T18:11:04Z

+      version: ${{ steps.version.outputs.version }}
+      config-file: ${{ steps.version.outputs.config-file }}
+    steps:
+      - name: Determine PyTorch version


Why do we need to determine pytorch version and then construct a config file name based off that? Why dont we simply use the config file as input, or simple do

case "$CRON" in "00 17 * * 1,3") CONF_FILE="<path_to_pt2.11_cpu>" ;; # "10 17 * * 1,3") CONF_FILE="<path_to_pt2.12_cpu>" ;; *) echo "::error::Unknown cron: $CRON"; exit 1 ;; esac

Then any usage of the actual Pytorch version field can just be derived from the config file using the output from load_config job.

Good point I forgot we can use the one from load_config

sirutBuasai · 2026-05-19T18:13:05Z

-  CONFIG_FILE: ".github/config/image/pytorch-ec2-cuda.yml"

 jobs:
+  determine-version:


sirutBuasai · 2026-05-19T18:14:23Z

-  CONFIG_FILE: ".github/config/image/pytorch-sagemaker-cpu.yml"

 jobs:
+  determine-version:


sirutBuasai · 2026-05-19T18:14:33Z

-  CONFIG_FILE: ".github/config/image/pytorch-sagemaker-cuda.yml"

 jobs:
+  determine-version:


sirutBuasai · 2026-05-19T19:04:02Z

+        run: |
+          VERSION=$(git diff --name-only origin/main...HEAD \
+            | grep -oP 'docker/pytorch/\K[0-9]+\.[0-9]+' \
+            | sort -u | head -1)


Multi-version PR silently tests only one version. so if we make a change for both PT 2.11 and PT 2.12, we would expect both jobs to run. The current logic only run one PR

I thought that it's rare to change more than 1 version each time previously; Ya however I will change this.

sirutBuasai · 2026-05-19T19:05:25Z

  # ============================================================
  load-config:
-    needs: [gatekeeper]
+    needs: [gatekeeper, check-changes]


we should be able to parallelize the check-changes with load-config. Serializing this slows down the PR a little bit due to these two jobs being extremely small but the job startup each takes 30s

sirutBuasai · 2026-05-19T19:07:13Z

Take a look at https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax#jobsjob_idstrategymatrix matrix job submission. This would allow us to define a list of configurations that the workflow can in sets in parallel of each other.

https://github.com/aws/deep-learning-containers/blob/main/.github/workflows/dispatch-vllm-benchmark.yml is also another good example that we have that uses matrix run mechanism

sirutBuasai · 2026-05-19T19:08:04Z

-      - ".github/config/image/pytorch-ec2-cpu.yml"
+      - ".github/config/image/pytorch-*-ec2-cpu.yml"
      - ".github/workflows/pr-pytorch-ec2-cpu.yml"
      - "docker/pytorch/**"


this is too broad, this means that a change in cuda dockerfile will trigger cpu as well

sirutBuasai · 2026-05-19T19:08:53Z

-          source docker/pytorch/versions-cpu.env
+          VERSION="${{ needs.check-changes.outputs.pytorch-version }}"
+          source docker/pytorch/${VERSION}/versions-cpu.env
          CI_IMAGE_URI="${{ vars.CI_AWS_ACCOUNT_ID }}.dkr.ecr.${{ vars.AWS_REGION }}.amazonaws.com/ci:pytorch-cpu-runtime-pr-${{ github.event.pull_request.number }}"


With matrix running, we'll need to add some kind of version string to the ci image. This is because if PT 2.11 and PT 2.12 runs at the same time, we run into naming collision

sirutBuasai · 2026-05-19T19:09:28Z

sirutBuasai · 2026-05-19T19:09:33Z

sirutBuasai · 2026-05-19T19:09:39Z

sirutBuasai · 2026-05-19T19:10:45Z

 ENV PATH="/opt/venv/bin:${PATH}"

-COPY docker/pytorch/cuda/pyproject.toml docker/pytorch/cuda/uv.lock /tmp/build/
+COPY docker/pytorch/2.11/cuda/pyproject.toml docker/pytorch/2.11/cuda/uv.lock /tmp/build/


If you're going to have to declare the version across the file, we should use ARG PYTORCH_VERSION=2.11 then use COPY .../${PYTORCH_VERSION}/...

Per team feedback: remove the redundant version output from the determine-config job. Map cron directly to config file path. Derive the docker directory version from load-config's framework-version output (cut major.minor from "2.11.0" → "2.11"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Addresses team feedback: - Multi-version PRs now build ALL changed versions in parallel (not just the first detected one) - Removed separate load-config job — config parsing inlined into build-images and detect-versions (eliminates 30s serialization) - Uses strategy.matrix with fail-fast: false for parallel builds Structure: gatekeeper → detect-versions → build-images (matrix) → test jobs Only the latest version runs the full test suite (sanity, security, telemetry, single-gpu). All versions validate that the build compiles. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s outputs Per team feedback: detect-versions should only detect versions and path changes. Config values are now output by the build-images matrix job (which already loads config per version). Downstream test jobs reference build-images outputs instead. Also removes the latest-version guard on unit tests — all matrix versions now run unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…oded paths Per team feedback: Dockerfiles now use ARG DLC_PYTORCH_VERSION=2.11 for COPY paths instead of hardcoding "docker/pytorch/2.11/..." throughout. Workflows pass --build-arg DLC_PYTORCH_VERSION to ensure the value matches the matrix version. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CUDA workflows only trigger on CUDA-related paths: docker/pytorch/*/Dockerfile.cuda, docker/pytorch/*/cuda/**, versions-cuda.env CPU workflows only trigger on CPU-related paths: docker/pytorch/*/Dockerfile.cpu, docker/pytorch/*/cpu/**, versions-cpu.env Previously all 4 workflows used docker/pytorch/** which meant a CUDA Dockerfile change triggered the CPU workflow (and vice versa). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Docker requires ARGs to be re-declared after each FROM — global ARGs are only available in FROM lines, not in stage instructions like COPY. Without the re-declaration, DLC_PYTORCH_VERSION resolves to empty string causing "not found" errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

All test jobs (sanity, security, telemetry, single-gpu, efa, sagemaker) now matrix over detected versions. Each version gets its own test run with a constructed image-uri based on the version-specific CI tag. GitHub Actions supports strategy.matrix on reusable workflow calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The load-config action installs yq but our inlined config step didn't. yq is not available by default on CodeBuild runners, causing all config outputs (framework, container-type, etc.) to be empty. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a workflow is cancelled mid-EFA-test, the finally block may not execute, leaking p4d.24xlarge instances and EIPs. This adds a cleanup_stale_efa_instances() call at the start of each test run that terminates instances tagged "CI-CD EFA efa-test" older than 4 hours and releases orphaned EIPs. Prevents: AddressLimitExceeded errors from accumulated leaked EIPs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Logs LD_LIBRARY_PATH, ofi-nccl lib presence, all_reduce_perf binary, fi_info output, NCCL lib path, and full NCCL log on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aws-deep-learning-containers-ci Bot added the authorized label May 12, 2026

Eren-Jeager123 and others added 6 commits May 12, 2026 23:17

fix: pass DLC_PYTORCH_VERSION to autorelease unit tests

dcc1f70

Same fix as the PR workflows — autorelease workflows also run unit tests that need the versioned env file path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove DLC_PYTORCH_VERSION from non-unit-test jobs

5995526

Only unit tests (test_versions.py) need this env var. Removed from single-GPU and multi-GPU test containers in EC2 CUDA workflows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sirutBuasai reviewed May 19, 2026

View reviewed changes

Comment thread .github/workflows/pr-pytorch-ec2-cuda.yml

Copy link
Copy Markdown

Member

sirutBuasai May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

sirutBuasai reviewed May 19, 2026

View reviewed changes

Comment thread .github/workflows/pr-pytorch-sagemaker-cpu.yml

Copy link
Copy Markdown

Member

sirutBuasai May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

sirutBuasai reviewed May 19, 2026

View reviewed changes

Comment thread .github/workflows/pr-pytorch-sagemaker-cuda.yml

Copy link
Copy Markdown

Member

sirutBuasai May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

sirutBuasai reviewed May 19, 2026

View reviewed changes

Eren-Jeager123 and others added 9 commits May 19, 2026 19:56

debug: log config file content to diagnose empty matrix outputs

edd4c39

Eren-Jeager123 and others added 2 commits May 19, 2026 22:50

debug: add diagnostic logging to EFA NCCL test

5c8953f

Logs LD_LIBRARY_PATH, ofi-nccl lib presence, all_reduce_perf binary, fi_info output, NCCL lib path, and full NCCL log on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Conversation

Eren-Jeager123 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Versioned directory structure

2. PR workflows — matrix over detected versions

3. Autorelease workflows — cron → config file mapping

4. Dockerfile ARG for version paths

5. Additional fixes

Adding PyTorch 2.12

Test plan

Uh oh!

sirutBuasai May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Eren-Jeager123 commented May 12, 2026 •

edited

Loading

sirutBuasai May 19, 2026 •

edited

Loading