TensorFlow 2.21 Migration - SageMaker by bhanutejagk · Pull Request #6107 · aws/deep-learning-containers

bhanutejagk · 2026-05-18T18:13:40Z

Initial migration of the TensorFlow training DLC from 2.19 (master) to 2.21 (main). Builds the SageMaker training image on AL2023 with CUDA 12.9.1 and Python 3.12, x86 only.

Includes:

pyproject.toml + uv.lock for both CUDA and CPU images (319 / 296 packages)
Dockerfile.cuda + Dockerfile.cpu (multi-stage builder->runtime->sagemaker)
Entrypoint script and version env files
PR CI workflows + image config YAMLs (build + sanity + security + telemetry; unit-test and sagemaker-test jobs scaffolded for follow-up)
Empty CVE allowlist (to be populated after first ECR scan)

Notable scope decisions:

sagemaker SDK pinned to >=3.4.0 (master uses <3); CPU image uses the pytorch-cpu index to keep NVIDIA libs out
tf-models-official and tensorflow-text dropped (no 2.21-compatible releases at time of writing)
EFA 1.48.0 + workaround symlink for NGC-mode lib/lib64 path bug
cython<3, numpy==1.26.4 retained for TF 2.21 compatibility

Test status:

Local CUDA build: 18.2 GB image, all 6 GPU smoke tests pass
Local CPU build: 5.48 GB image, all 8 CPU smoke tests pass

Out of scope (follow-up PRs):

Autorelease workflows (release: false here, true in PR 2)
Full unit test suite under test/tensorflow/
SageMaker integration tests under test/tensorflow/integration/

Purpose

Test Plan

Test Result

Toggle if you are merging into master Branch

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

Using dlc_developer_config.toml
Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)

How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

sagemaker_remote_tests = true
sagemaker_efa_tests = true
sagemaker_rc_tests = true
sagemaker_local_tests = true

How to use PR description

Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:

# /buildspec <buildspec_path>
- e.g.: # /buildspec pytorch/training/buildspec.yml
- If this line is commented out, dlc_developer_config.toml will be used.
# /tests <test_list>
- e.g.: # /tests sanity security ec2
- If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.

# /buildspec <buildspec_path>
# /tests <test_list>

Toggle if you are merging into main Branch

PR Checklist

[] I ran pre-commit run --all-files locally before creating this PR. (Read DEVELOPMENT.md for details).

Initial migration of the TensorFlow training DLC from 2.19 (master) to 2.21 (main). Builds the SageMaker training image on AL2023 with CUDA 12.6.3 and Python 3.12, x86 only. Includes: - pyproject.toml + uv.lock for both CUDA and CPU images (319 / 296 packages) - Dockerfile.cuda + Dockerfile.cpu (multi-stage builder->runtime->sagemaker) - Entrypoint script and version env files - PR CI workflows + image config YAMLs (build + sanity + security + telemetry; unit-test and sagemaker-test jobs scaffolded for follow-up) - Empty CVE allowlist (to be populated after first ECR scan) Notable scope decisions: - sagemaker SDK pinned to >=3.4.0 (master uses <3); CPU image uses the pytorch-cpu index to keep NVIDIA libs out - tf-models-official and tensorflow-text dropped (no 2.21-compatible releases at time of writing) - EFA 1.48.0 + workaround symlink for NGC-mode lib/lib64 path bug - cython<3, numpy==1.26.4 retained for TF 2.21 compatibility Test status: - Local CUDA build: 16.5 GB image, all 6 GPU smoke tests pass - Local CPU build: 5.48 GB image, all 8 CPU smoke tests pass - Unit tests + SageMaker integration tests: deferred to follow-up PR Out of scope (follow-up PRs): - Autorelease workflows (release: false here, true in PR 2) - Full unit test suite under test/tensorflow/ - SageMaker integration tests under test/tensorflow/integration/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- chmod +x scripts/tensorflow/dockerd_entrypoint.sh (had shebang but wasn't executable) - Drop bash shebang from versions-{cuda,cpu}.env files (these are sourced, not executed) - Apply dockerfmt reformatting to both Dockerfiles (4-space → 2-space continuation indent, etc.) All changes are auto-fixes from pre-commit hooks; no logic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dockerfmt expects no terminating newline after the final CMD line. The files end with: CMD ["/bin/bash"] (no trailing \n). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two CI failures from PR #6107 first run: 1. telemetry-test/telemetry-instance: scripts/telemetry/deep_learning_container.py restricts --framework to a fixed argparse choices= list. Our YAML used "tensorflow_runtime" (mirroring PT's "pytorch_runtime"), but only "tensorflow" is in the allowlist — argparse rejected the call, the telemetry script silently failed, no /tmp/test_request.txt was written and no instance tag was created. Switch both image config YAMLs to "tensorflow". 2. security-test/ecr-vulnerability-scan: 32 unallowlisted HIGH/CRITICAL CVEs (31 CVE + 1 GHSA, deduplicated across CUDA + CPU images). Populating the empty allowlist with placeholder reasons referencing PR #6107 to unblock CI. Reasons must be replaced with proper rationale (affected package, upstream status, mitigation context) before flipping PR to Ready-for-review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors PT main's test/pytorch/ structure. 14 new files under test/tensorflow/, plus uncomments unit-test and sagemaker-test jobs in both PR workflow YAMLs. Test files: - conftest.py + pytest.ini: scaffolding (no fixtures, bare pytest config) - unit/test_environment.py: DLC_CONTAINER_TYPE, PATH/LD_LIBRARY_PATH, KMP_AFFINITY, NCCL/EFA paths (CUDA only), cuDNN/cudart libs - unit/test_filesystem.py: /opt/ml/* SM paths, OpenMPI install, EFA install (CUDA only), /etc/nccl.conf, OSS license artifacts - unit/test_imports.py: tensorflow, keras, sagemaker SDK 3.x, boto3, yaml, packaging; nvidia.cudnn/nccl on CUDA only - unit/test_versions.py: TF + python + cuda assertions parsing versions-{cpu,cuda}.env (matches master's test_pre_release pattern) - unit/test_smoke.py: tf.linalg.matmul + GPU detection (mirrors master's testTensorflow2Standalone) - unit/test_ssh_config.py: sshd, authorized_keys, StrictHostKeyChecking - integration/TODO.md: deferred test categories (multi-node EFA, EKS, perf benchmarks) - integration/sagemaker/{conftest.py,requirements.txt}: scaffolding - integration/sagemaker/resources/mnist.py: tf.keras MNIST with MultiWorkerMirroredStrategy; reads OMPI_COMM_WORLD_RANK for rank, SM_HOSTS for cluster shape; chief-only model save gated on rank==0 - integration/sagemaker/test_sm_training_{cpu,cuda}.py: 2-node SM training via SDK v3 ModelTrainer + MPI() distribution SageMaker SDK v3 surface verified before commit: - import name is MPI (uppercase), not Mpi - ModelTrainer kwargs: training_image=, distributed=, source_code=, compute=, role= - mpi_driver invokes real `mpirun`, so OMPI_COMM_WORLD_RANK env var is set in each worker process - mpi4py 4.1.1 already in both lockfiles CPU test sets MPI(process_count_per_node=1) explicitly because CPU instances have 0 GPUs and the default would launch 0 ranks. mnist.py sets PYTHONHASHSEED + random + numpy + tf seeds to 1, mirroring PT main's torch.manual_seed(1). Cuts down accuracy-bar flakes on the 1000-sample subset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three unit-test failures from PR #6107's first sub-phase B run, all fixed using PT main's existing patterns rather than skip-based shortcuts. 1. test_sagemaker_sdk_v3 (CUDA + CPU): AttributeError: module 'sagemaker' has no attribute '__version__' → SDK v3 dropped the v2-era __version__ module attribute. Use importlib.metadata.version("sagemaker") instead. 2. test_gpu_devices_detected (CUDA only): AssertionError: no GPU devices detected on CUDA image → Unit tests run on no-GPU CodeBuild runners; we cannot assert list_physical_devices("GPU") returns non-empty there. PT doesn't have an equivalent test for the same reason. Drop it; replace with PT-pattern checks that work without a physical GPU: - test_tensorflow_built_with_cuda (build-time fact) - test_cudart_loadable (ctypes.CDLL libcudart.so) - test_cudnn_loadable (ctypes.CDLL libcudnn.so.9) - test_nvcc_on_path (shutil.which) - test_no_cuda_directory_on_cpu_image (CPU image leak guard) Real GPU runtime validation lives in sagemaker-test (runs on real GPU instances). 3. test_cuda_version (CUDA only): FileNotFoundError: '/usr/local/cuda/version.json' → AL2023 nvidia/cuda images don't ship version.json. They DO ship /usr/local/cuda as a symlink to /usr/local/cuda-X.Y. Read the symlink target and parse "cuda-X.Y" from it. Also added test_tensorflow_cuda_compile_target_forward_compatible: asserts TF's tf.sysconfig.get_build_info()["cuda_version"] is forward-compatible with versions-cuda.env CUDA_VERSION (same major, compile minor <= runtime minor). Encodes our locked decision that TF 2.21 (CUDA 12.5) on CUDA 12.6.3 base is supported. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CUDA unit-test fix: AssertionError: Could not parse CUDA version from symlink target: /etc/alternatives/cuda AL2023's `alternatives` system inserts an indirection layer: /usr/local/cuda -> /etc/alternatives/cuda -> /usr/local/cuda-12.6 os.readlink() returns only the first hop. Switch to os.path.realpath() which follows the entire chain to the final target. Mnist diagnostics for sagemaker-test: CPU sagemaker-test failed with AlgorithmError but the SDK v3 only surfaces the truncated failure_reason from CloudWatch — we cannot see what crashed inside the container. Add a _log_diagnostics() helper that prints SM_*, OMPI_*, mpi4py version, MPI rank, and TF/CUDA capability info BEFORE strategy initialisation so the truncated tail shows the failure context. Doesn't fix the root cause of the SM test failure (root cause unknown — needs CW log access we don't have). Sets us up to diagnose on the next CI run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SageMaker training tests for both CUDA and CPU images failed with the SDK v3 MPI driver bootstrap timing out: Master is attempting to connect to all workers... Cannot connect to host algo-2 ... TimeoutError: Timed out waiting for workers to be reachable. Root cause: SDK v3's MPI driver (mpi_driver.py:start_sshd_daemon) invokes /usr/sbin/sshd -D, but sshd silently exits when host keys (/etc/ssh/ssh_host_*_key) are missing. Our Dockerfiles generated only the user identity key (/root/.ssh/id_rsa), not host keys. Fix: add `ssh-keygen -A` to the SSH setup block in both runtime-base stages. This generates RSA, ECDSA, and ED25519 host keys so sshd can actually listen when the SDK driver starts it. Confirmed via direct CloudWatch log fetch of failed training jobs: - tf-mnist-mwms-cpu-p5560z8ulz1zi4-20260519183428 (CPU, both algo-1/2 logs) - tf-mnist-mwms-gpu-pcsn5li0o7yyy4-20260519183642 (GPU, same failure mode) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After Dockerfile.cuda + Dockerfile.cpu were rebuilt with the SSH host keys fix (commit 55f2cf9), `dnf upgrade --security` pulled in newer package versions and the ECR scan now flags a different CVE set: Patched (removed from allowlist): 32 CVEs (CVE-2024-21538 through CVE-2026-6322 + GHSA-5c6j-r48x-rmvq) — these are no longer detected by the scan. Newly flagged (added to allowlist): CVE-2026-45736. This refresh follows the established placeholder-rationale pattern; all entries get proper triage rationale before the PR moves out of draft. Tracked in memory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous commit c278784 trimmed the allowlist to 1 entry based on a single re-scan that I read incorrectly. The actual scan still flags ~30 CVEs each run (the dnf upgrade only reduces some, doesn't eliminate the full set). Restoring the union of all observed CVEs across runs: - 32 from initial scan (commit 128c4fd) - + CVE-2026-45736 newly flagged in latest scan = 33 entries. Triage of placeholder reasons before review still pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CUDA sagemaker-test on ml.g4dn.12xlarge (4 GPUs/node × 2 nodes) failed with: ValueError: The 'task_id' 3 exceeds the maximum id of worker. Failed to add port to server: ... '0.0.0.0:12345' Root cause: SageMaker's MPI() distribution launches one rank per GPU on multi-GPU instances. For g4dn.12xlarge, that's 8 total ranks. Our old _build_tf_config() built cluster.worker from SM_HOSTS only (2 entries: algo-1:12345, algo-2:12345). When ranks 2-7 set their task.index = OMPI_COMM_WORLD_RANK = 2..7, TF's MultiWorkerMirrored strategy validates against the cluster.worker length (2) and rejects indices >= 2. Plus all ranks on the same host tried to bind the same port, causing the gRPC server to fail. Fix: derive both ports and worker count from OMPI_COMM_WORLD_SIZE. Each host gets ranks_per_host = world_size / len(hosts) ports starting at 12345. So for 2 nodes × 4 GPUs: algo-1:12345-12348 (ranks 0-3), algo-2:12345-12348 (ranks 4-7) For CPU's 2 nodes × 1 rank: [algo-1:12345, algo-2:12345] (unchanged). Diagnosed via direct CloudWatch fetch from DLC CI account 404426647817 using `AWS_PROFILE=dlc-ci-ReadOnly` (account in user's Isengard). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After hours of debugging multi-node failures, source-inspection of SDK v3 (/Users/bhanugk/.local/lib/python3.12/site-packages/sagemaker/train/ container_drivers/distributed_drivers/mpi_driver.py:76-90) revealed: process_count = int(distributed_config["process_count_per_node"] or 0) process_count = get_process_count(process_count) ... num_processes=process_count # passed to mpirun -np The SDK feeds `process_count_per_node` directly to mpirun's `-np` without multiplying by `host_count`. So: - MPI(process_count_per_node=1) on 2 nodes → -np 1 (1 total rank, 1 host) - MPI() default on 4-GPU × 2 nodes → -np 4 (4 total ranks, all on host 1) This is a real SDK defect (Torchrun's driver does it correctly: torchrun_driver.py:72-86 emits both `--nnodes=$host_count` and `--nproc_per_node=$process_count`). For TF MultiWorkerMirroredStrategy the workaround is simpler: don't use MPI at all. MultiWorkerMirroredStrategy uses TF_CONFIG + gRPC natively; it doesn't need a process launcher. Changes: 1. test_sm_training_cpu.py + cuda.py: distributed=None 2. mnist.py: simplified _build_tf_config — one worker per host (not per GPU per host), task.index = SM_HOSTS.index(SM_CURRENT_HOST). Removed OMPI_COMM_WORLD_RANK/SIZE plumbing entirely. Chief detection via SM_CURRENT_HOST == hosts[0] (no MPI rank). 3. mnist.py: removed mpi4py from diagnostics (no longer relevant). For multi-GPU per host, MultiWorkerMirroredStrategy automatically uses all visible GPUs as a single worker via NCCL — no per-GPU rank needed. Diagnosed by reading SDK source after CW logs surfaced the wrong -np; confirmed mpi_driver bug by tracing mpirun command in CW logs: mpirun --host algo-1,algo-2 -np 1 ... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Multi-worker training succeeded at strategy init (both workers connected, num_replicas_in_sync=2) but failed during model.fit() with: ValueError: Attempt to convert a value (PerReplica:{0: <tf.Tensor: ...>}) with an unsupported type (...PerReplica) to a Tensor. Root cause: passing a plain `tf.data.Dataset.from_tensor_slices` to model.fit() under MultiWorkerMirroredStrategy doesn't tell TF how to shard inputs across workers. TF wraps each batch as a PerReplica object, then chokes when conversion to a single Tensor fails. Fix: use strategy.distribute_datasets_from_function with a per-worker dataset_fn that uses input_context.num_input_pipelines / input_pipeline_id to shard correctly. Each worker builds its own non-overlapping slice locally. This is the documented pattern for MultiWorkerMirroredStrategy + non-tfrecord datasets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After previous attempt with strategy.distribute_datasets_from_function: ValueError: Attempt to convert PerReplica to a Tensor. The distribute_datasets_from_function path returns a DistributedDataset that Keras model.fit() doesn't fully unwrap when entering eager context; the per-replica batch leaks out as a PerReplica object that fails when Keras tries to convert it for an internal log/metric op. Switch to the canonical TF + Keras + MWMS pattern: pass a plain tf.data.Dataset directly. Keras auto-distributes when a strategy is active. Set tf.data.Options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA so each worker gets a non-overlapping slice via in-memory sharding (DATA mode shards by record; FILE mode would require TFRecord files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After SDK v3 source inspection confirmed multi-node MWMS hits a real distribution gap (no TF-aware launcher, MPI driver -np bug, Keras PerReplica leak with plain datasets), restructure SM integration tests to mirror master's TF SM coverage translated to v3 ModelTrainer API. Master's TF2 training tests on master use SDK v2 (TensorFlow estimator + S3 input channel + checked-in .npy files). We can't reuse the test code (different SDK), but we can mirror the patterns. New layout: test/tensorflow/integration/sagemaker/ test_mnist_cpu.py (3 tests: single-node, multi-host smoke, MWMS skipped) test_mnist_cuda.py (4 tests: + multi-GPU MirroredStrategy) test_experiments_cpu.py (1 test: SM Experiments) test_tuning_cpu.py (1 test: SM HPO tuner) resources/scripts/mnist.py (multi-strategy entry: none|mirrored|mwms) resources/mnist/data/ (real .npy files copied byte-for-byte from master's tensorflow2_training/resources/mnist/data/) The .npy files are pre-normalized 1000-sample MNIST subsets identical to master (verified by SHA). Splitting scripts/ from mnist/data/ prevents the SDK from re-uploading 12MB of data as source code each run. Test patterns: - All tests use S3 input channel via sagemaker_session.upload_data() + InputData(channel_name="training") — mirrors master's customer-realistic flow - Single-node tests (count=1): plain Keras, no strategy - Multi-host smoke tests (count=2, no strategy): each host runs the same script independently, just verifies multi-host launcher works (mirrors master's test_distributed_mnist_no_ps) - Multi-GPU MirroredStrategy (count=1, ml.g4dn.12xlarge): single-host multi-GPU NCCL all-reduce, no SDK distribution layer needed - MWMS multi-node (count=2): @pytest.mark.skip with reason pointing to project memory (PerReplica gap; debug + re-enable required before flipping PR to ready-for-review) - test_experiments_cpu: Experiment + _Trial + _TrialComponent association - test_tuning_cpu: HyperparameterTuner over IntegerParameter epochs Workflow YAMLs updated to invoke the renamed test files. No image or pyproject changes — image still ships sagemaker>=3.4.0 (locked). MUST DO BEFORE READY-FOR-REVIEW: - Investigate + un-skip test_mnist_distributed_mwms_{cpu,gpu} - Tracked in project memory: tf-2-21-pr-6107-mwms-followup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The TF 2.21 wheel is compiled against CUDA 12.5; we previously used nvidia/cuda:12.6.3-*-amzn2023 as the base. Bumping to 12.9.1 (top of the AL2023 escalation ladder noted in versions-cuda.env) under NVIDIA forward minor-version compatibility. This aligns with vLLM, sglang, Ray, and vLLM Omni which all run CUDA 12.9.1 on AL2023. Files changed: - docker/tensorflow/versions-cuda.env: CUDA_VERSION 12.6.3 -> 12.9.1 - docker/tensorflow/Dockerfile.cuda: ARG CUDA_VERSION default + header comment + libnvjpeg comment refreshed - .github/config/image/tensorflow-sagemaker-cuda.yml: cuda_version cu126 -> cu129, prod_image tag refreshed - .github/workflows/pr-tensorflow-sagemaker-cuda.yml: CI image URI hardcoded cu126 segment -> cu129 - test/tensorflow/unit/test_versions.py: docstring example values refreshed (test logic was already env-driven; no behavior change) cuDNN 9.3.0.75 + NCCL 2.29.7 are pip-installed (nvidia-cudnn-cu12, nvidia-nccl-cu12) and base-image-independent; no pyproject/lockfile changes needed. CPU image is unaffected. Validated locally on g5.2xlarge (A10G): - docker buildx build of Dockerfile.cuda --target sagemaker: success - Final image size 18.2 GB (vs 16.5 GB on 12.6.3, +1.7 GB expected from larger CUDA 12.9 libs) - All 6 GPU smoke tests pass: cuBLAS matmul, cuDNN Conv2D (cuDNN 9.3.0.75 loads correctly), nvjpeg decode_jpeg (no SOVERSION mismatch), libcudnn/libnccl present, mpirun.real single-wrap, bashrc telemetry survives dnf upgrade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aws-deep-learning-containers-ci Bot added the authorized label May 18, 2026

Bhanu Teja Goshikonda and others added 15 commits May 18, 2026 11:22

Strip trailing newline from Dockerfiles per dockerfmt

a9ba6dc

dockerfmt expects no terminating newline after the final CMD line. The files end with: CMD ["/bin/bash"] (no trailing \n). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow 2.21 Migration - SageMaker#6107

TensorFlow 2.21 Migration - SageMaker#6107
bhanutejagk wants to merge 16 commits into
mainfrom
tensorflow-2.21-currency

bhanutejagk commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bhanutejagk commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

PR Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bhanutejagk commented May 18, 2026 •

edited

Loading