Skip to content

TensorFlow 2.21 Migration - SageMaker#6107

Open
bhanutejagk wants to merge 16 commits into
mainfrom
tensorflow-2.21-currency
Open

TensorFlow 2.21 Migration - SageMaker#6107
bhanutejagk wants to merge 16 commits into
mainfrom
tensorflow-2.21-currency

Conversation

@bhanutejagk
Copy link
Copy Markdown
Contributor

@bhanutejagk bhanutejagk commented May 18, 2026

Initial migration of the TensorFlow training DLC from 2.19 (master) to 2.21 (main). Builds the SageMaker training image on AL2023 with CUDA 12.9.1 and Python 3.12, x86 only.

Includes:

  • pyproject.toml + uv.lock for both CUDA and CPU images (319 / 296 packages)
  • Dockerfile.cuda + Dockerfile.cpu (multi-stage builder->runtime->sagemaker)
  • Entrypoint script and version env files
  • PR CI workflows + image config YAMLs (build + sanity + security + telemetry; unit-test and sagemaker-test jobs scaffolded for follow-up)
  • Empty CVE allowlist (to be populated after first ECR scan)

Notable scope decisions:

  • sagemaker SDK pinned to >=3.4.0 (master uses <3); CPU image uses the pytorch-cpu index to keep NVIDIA libs out
  • tf-models-official and tensorflow-text dropped (no 2.21-compatible releases at time of writing)
  • EFA 1.48.0 + workaround symlink for NGC-mode lib/lib64 path bug
  • cython<3, numpy==1.26.4 retained for TF 2.21 compatibility

Test status:

  • Local CUDA build: 18.2 GB image, all 6 GPU smoke tests pass
  • Local CPU build: 5.48 GB image, all 8 CPU smoke tests pass

Out of scope (follow-up PRs):

  • Autorelease workflows (release: false here, true in PR 2)
  • Full unit test suite under test/tensorflow/
  • SageMaker integration tests under test/tensorflow/integration/

Purpose

Test Plan

Test Result


Toggle if you are merging into master Branch

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>
Toggle if you are merging into main Branch

PR Checklist

  • [] I ran pre-commit run --all-files locally before creating this PR. (Read DEVELOPMENT.md for details).

Initial migration of the TensorFlow training DLC from 2.19 (master)
to 2.21 (main). Builds the SageMaker training image on AL2023 with
CUDA 12.6.3 and Python 3.12, x86 only.

Includes:
- pyproject.toml + uv.lock for both CUDA and CPU images (319 / 296
  packages)
- Dockerfile.cuda + Dockerfile.cpu (multi-stage builder->runtime->sagemaker)
- Entrypoint script and version env files
- PR CI workflows + image config YAMLs (build + sanity + security +
  telemetry; unit-test and sagemaker-test jobs scaffolded for follow-up)
- Empty CVE allowlist (to be populated after first ECR scan)

Notable scope decisions:
- sagemaker SDK pinned to >=3.4.0 (master uses <3); CPU image uses
  the pytorch-cpu index to keep NVIDIA libs out
- tf-models-official and tensorflow-text dropped (no 2.21-compatible
  releases at time of writing)
- EFA 1.48.0 + workaround symlink for NGC-mode lib/lib64 path bug
- cython<3, numpy==1.26.4 retained for TF 2.21 compatibility

Test status:
- Local CUDA build: 16.5 GB image, all 6 GPU smoke tests pass
- Local CPU build: 5.48 GB image, all 8 CPU smoke tests pass
- Unit tests + SageMaker integration tests: deferred to follow-up PR

Out of scope (follow-up PRs):
- Autorelease workflows (release: false here, true in PR 2)
- Full unit test suite under test/tensorflow/
- SageMaker integration tests under test/tensorflow/integration/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bhanu Teja Goshikonda and others added 15 commits May 18, 2026 11:22
- chmod +x scripts/tensorflow/dockerd_entrypoint.sh (had shebang but
  wasn't executable)
- Drop bash shebang from versions-{cuda,cpu}.env files (these are
  sourced, not executed)
- Apply dockerfmt reformatting to both Dockerfiles (4-space → 2-space
  continuation indent, etc.)

All changes are auto-fixes from pre-commit hooks; no logic changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dockerfmt expects no terminating newline after the final CMD line.
The files end with: CMD ["/bin/bash"] (no trailing \n).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI failures from PR #6107 first run:

1. telemetry-test/telemetry-instance: scripts/telemetry/deep_learning_container.py
   restricts --framework to a fixed argparse choices= list. Our YAML used
   "tensorflow_runtime" (mirroring PT's "pytorch_runtime"), but only
   "tensorflow" is in the allowlist — argparse rejected the call, the
   telemetry script silently failed, no /tmp/test_request.txt was written
   and no instance tag was created. Switch both image config YAMLs to
   "tensorflow".

2. security-test/ecr-vulnerability-scan: 32 unallowlisted HIGH/CRITICAL
   CVEs (31 CVE + 1 GHSA, deduplicated across CUDA + CPU images).
   Populating the empty allowlist with placeholder reasons referencing
   PR #6107 to unblock CI. Reasons must be replaced with proper rationale
   (affected package, upstream status, mitigation context) before
   flipping PR to Ready-for-review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors PT main's test/pytorch/ structure. 14 new files under
test/tensorflow/, plus uncomments unit-test and sagemaker-test jobs in
both PR workflow YAMLs.

Test files:
- conftest.py + pytest.ini: scaffolding (no fixtures, bare pytest config)
- unit/test_environment.py: DLC_CONTAINER_TYPE, PATH/LD_LIBRARY_PATH,
  KMP_AFFINITY, NCCL/EFA paths (CUDA only), cuDNN/cudart libs
- unit/test_filesystem.py: /opt/ml/* SM paths, OpenMPI install, EFA
  install (CUDA only), /etc/nccl.conf, OSS license artifacts
- unit/test_imports.py: tensorflow, keras, sagemaker SDK 3.x, boto3,
  yaml, packaging; nvidia.cudnn/nccl on CUDA only
- unit/test_versions.py: TF + python + cuda assertions parsing
  versions-{cpu,cuda}.env (matches master's test_pre_release pattern)
- unit/test_smoke.py: tf.linalg.matmul + GPU detection (mirrors master's
  testTensorflow2Standalone)
- unit/test_ssh_config.py: sshd, authorized_keys, StrictHostKeyChecking
- integration/TODO.md: deferred test categories (multi-node EFA, EKS,
  perf benchmarks)
- integration/sagemaker/{conftest.py,requirements.txt}: scaffolding
- integration/sagemaker/resources/mnist.py: tf.keras MNIST with
  MultiWorkerMirroredStrategy; reads OMPI_COMM_WORLD_RANK for rank,
  SM_HOSTS for cluster shape; chief-only model save gated on rank==0
- integration/sagemaker/test_sm_training_{cpu,cuda}.py: 2-node SM
  training via SDK v3 ModelTrainer + MPI() distribution

SageMaker SDK v3 surface verified before commit:
- import name is MPI (uppercase), not Mpi
- ModelTrainer kwargs: training_image=, distributed=, source_code=,
  compute=, role=
- mpi_driver invokes real `mpirun`, so OMPI_COMM_WORLD_RANK env var is
  set in each worker process
- mpi4py 4.1.1 already in both lockfiles

CPU test sets MPI(process_count_per_node=1) explicitly because CPU
instances have 0 GPUs and the default would launch 0 ranks.

mnist.py sets PYTHONHASHSEED + random + numpy + tf seeds to 1, mirroring
PT main's torch.manual_seed(1). Cuts down accuracy-bar flakes on the
1000-sample subset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three unit-test failures from PR #6107's first sub-phase B run, all
fixed using PT main's existing patterns rather than skip-based shortcuts.

1. test_sagemaker_sdk_v3 (CUDA + CPU):
   AttributeError: module 'sagemaker' has no attribute '__version__'
   → SDK v3 dropped the v2-era __version__ module attribute.
   Use importlib.metadata.version("sagemaker") instead.

2. test_gpu_devices_detected (CUDA only):
   AssertionError: no GPU devices detected on CUDA image
   → Unit tests run on no-GPU CodeBuild runners; we cannot assert
   list_physical_devices("GPU") returns non-empty there. PT doesn't
   have an equivalent test for the same reason. Drop it; replace
   with PT-pattern checks that work without a physical GPU:
   - test_tensorflow_built_with_cuda  (build-time fact)
   - test_cudart_loadable             (ctypes.CDLL libcudart.so)
   - test_cudnn_loadable              (ctypes.CDLL libcudnn.so.9)
   - test_nvcc_on_path                (shutil.which)
   - test_no_cuda_directory_on_cpu_image (CPU image leak guard)
   Real GPU runtime validation lives in sagemaker-test (runs on real
   GPU instances).

3. test_cuda_version (CUDA only):
   FileNotFoundError: '/usr/local/cuda/version.json'
   → AL2023 nvidia/cuda images don't ship version.json. They DO ship
   /usr/local/cuda as a symlink to /usr/local/cuda-X.Y. Read the
   symlink target and parse "cuda-X.Y" from it.

   Also added test_tensorflow_cuda_compile_target_forward_compatible:
   asserts TF's tf.sysconfig.get_build_info()["cuda_version"] is
   forward-compatible with versions-cuda.env CUDA_VERSION (same major,
   compile minor <= runtime minor). Encodes our locked decision that
   TF 2.21 (CUDA 12.5) on CUDA 12.6.3 base is supported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA unit-test fix:
  AssertionError: Could not parse CUDA version from symlink target:
  /etc/alternatives/cuda

  AL2023's `alternatives` system inserts an indirection layer:
    /usr/local/cuda -> /etc/alternatives/cuda -> /usr/local/cuda-12.6

  os.readlink() returns only the first hop. Switch to os.path.realpath()
  which follows the entire chain to the final target.

Mnist diagnostics for sagemaker-test:
  CPU sagemaker-test failed with AlgorithmError but the SDK v3 only
  surfaces the truncated failure_reason from CloudWatch — we cannot
  see what crashed inside the container. Add a _log_diagnostics()
  helper that prints SM_*, OMPI_*, mpi4py version, MPI rank, and
  TF/CUDA capability info BEFORE strategy initialisation so the
  truncated tail shows the failure context.

  Doesn't fix the root cause of the SM test failure (root cause
  unknown — needs CW log access we don't have). Sets us up to
  diagnose on the next CI run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SageMaker training tests for both CUDA and CPU images failed with
the SDK v3 MPI driver bootstrap timing out:

  Master is attempting to connect to all workers...
  Cannot connect to host algo-2
  ...
  TimeoutError: Timed out waiting for workers to be reachable.

Root cause: SDK v3's MPI driver (mpi_driver.py:start_sshd_daemon)
invokes /usr/sbin/sshd -D, but sshd silently exits when host keys
(/etc/ssh/ssh_host_*_key) are missing. Our Dockerfiles generated only
the user identity key (/root/.ssh/id_rsa), not host keys.

Fix: add `ssh-keygen -A` to the SSH setup block in both runtime-base
stages. This generates RSA, ECDSA, and ED25519 host keys so sshd can
actually listen when the SDK driver starts it.

Confirmed via direct CloudWatch log fetch of failed training jobs:
- tf-mnist-mwms-cpu-p5560z8ulz1zi4-20260519183428 (CPU, both algo-1/2 logs)
- tf-mnist-mwms-gpu-pcsn5li0o7yyy4-20260519183642 (GPU, same failure mode)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After Dockerfile.cuda + Dockerfile.cpu were rebuilt with the SSH host
keys fix (commit 55f2cf9), `dnf upgrade --security` pulled in newer
package versions and the ECR scan now flags a different CVE set:

  Patched (removed from allowlist): 32 CVEs (CVE-2024-21538 through
  CVE-2026-6322 + GHSA-5c6j-r48x-rmvq) — these are no longer detected
  by the scan.

  Newly flagged (added to allowlist): CVE-2026-45736.

This refresh follows the established placeholder-rationale pattern;
all entries get proper triage rationale before the PR moves out of
draft. Tracked in memory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit c278784 trimmed the allowlist to 1 entry based on a
single re-scan that I read incorrectly. The actual scan still flags
~30 CVEs each run (the dnf upgrade only reduces some, doesn't eliminate
the full set).

Restoring the union of all observed CVEs across runs:
- 32 from initial scan (commit 128c4fd)
- + CVE-2026-45736 newly flagged in latest scan
= 33 entries.

Triage of placeholder reasons before review still pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA sagemaker-test on ml.g4dn.12xlarge (4 GPUs/node × 2 nodes) failed
with:

  ValueError: The 'task_id' 3 exceeds the maximum id of worker.
  Failed to add port to server: ... '0.0.0.0:12345'

Root cause: SageMaker's MPI() distribution launches one rank per GPU
on multi-GPU instances. For g4dn.12xlarge, that's 8 total ranks. Our
old _build_tf_config() built cluster.worker from SM_HOSTS only (2
entries: algo-1:12345, algo-2:12345). When ranks 2-7 set their
task.index = OMPI_COMM_WORLD_RANK = 2..7, TF's MultiWorkerMirrored
strategy validates against the cluster.worker length (2) and rejects
indices >= 2. Plus all ranks on the same host tried to bind the same
port, causing the gRPC server to fail.

Fix: derive both ports and worker count from OMPI_COMM_WORLD_SIZE.
Each host gets ranks_per_host = world_size / len(hosts) ports starting
at 12345. So for 2 nodes × 4 GPUs:
  algo-1:12345-12348 (ranks 0-3), algo-2:12345-12348 (ranks 4-7)
For CPU's 2 nodes × 1 rank: [algo-1:12345, algo-2:12345] (unchanged).

Diagnosed via direct CloudWatch fetch from DLC CI account 404426647817
using `AWS_PROFILE=dlc-ci-ReadOnly` (account in user's Isengard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After hours of debugging multi-node failures, source-inspection of SDK v3
(/Users/bhanugk/.local/lib/python3.12/site-packages/sagemaker/train/
container_drivers/distributed_drivers/mpi_driver.py:76-90) revealed:

  process_count = int(distributed_config["process_count_per_node"] or 0)
  process_count = get_process_count(process_count)
  ...
  num_processes=process_count   # passed to mpirun -np

The SDK feeds `process_count_per_node` directly to mpirun's `-np` without
multiplying by `host_count`. So:
  - MPI(process_count_per_node=1) on 2 nodes → -np 1 (1 total rank, 1 host)
  - MPI() default on 4-GPU × 2 nodes → -np 4 (4 total ranks, all on host 1)

This is a real SDK defect (Torchrun's driver does it correctly:
torchrun_driver.py:72-86 emits both `--nnodes=$host_count` and
`--nproc_per_node=$process_count`). For TF MultiWorkerMirroredStrategy
the workaround is simpler: don't use MPI at all. MultiWorkerMirroredStrategy
uses TF_CONFIG + gRPC natively; it doesn't need a process launcher.

Changes:
1. test_sm_training_cpu.py + cuda.py: distributed=None
2. mnist.py: simplified _build_tf_config — one worker per host (not per
   GPU per host), task.index = SM_HOSTS.index(SM_CURRENT_HOST). Removed
   OMPI_COMM_WORLD_RANK/SIZE plumbing entirely. Chief detection via
   SM_CURRENT_HOST == hosts[0] (no MPI rank).
3. mnist.py: removed mpi4py from diagnostics (no longer relevant).

For multi-GPU per host, MultiWorkerMirroredStrategy automatically uses
all visible GPUs as a single worker via NCCL — no per-GPU rank needed.

Diagnosed by reading SDK source after CW logs surfaced the wrong -np;
confirmed mpi_driver bug by tracing mpirun command in CW logs:
  mpirun --host algo-1,algo-2 -np 1 ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-worker training succeeded at strategy init (both workers connected,
num_replicas_in_sync=2) but failed during model.fit() with:

  ValueError: Attempt to convert a value (PerReplica:{0: <tf.Tensor: ...>})
  with an unsupported type (...PerReplica) to a Tensor.

Root cause: passing a plain `tf.data.Dataset.from_tensor_slices` to
model.fit() under MultiWorkerMirroredStrategy doesn't tell TF how to
shard inputs across workers. TF wraps each batch as a PerReplica object,
then chokes when conversion to a single Tensor fails.

Fix: use strategy.distribute_datasets_from_function with a per-worker
dataset_fn that uses input_context.num_input_pipelines /
input_pipeline_id to shard correctly. Each worker builds its own
non-overlapping slice locally.

This is the documented pattern for MultiWorkerMirroredStrategy +
non-tfrecord datasets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After previous attempt with strategy.distribute_datasets_from_function:

  ValueError: Attempt to convert PerReplica to a Tensor.

The distribute_datasets_from_function path returns a DistributedDataset
that Keras model.fit() doesn't fully unwrap when entering eager context;
the per-replica batch leaks out as a PerReplica object that fails when
Keras tries to convert it for an internal log/metric op.

Switch to the canonical TF + Keras + MWMS pattern: pass a plain
tf.data.Dataset directly. Keras auto-distributes when a strategy is
active. Set tf.data.Options.experimental_distribute.auto_shard_policy =
AutoShardPolicy.DATA so each worker gets a non-overlapping slice via
in-memory sharding (DATA mode shards by record; FILE mode would require
TFRecord files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After SDK v3 source inspection confirmed multi-node MWMS hits a real
distribution gap (no TF-aware launcher, MPI driver -np bug, Keras
PerReplica leak with plain datasets), restructure SM integration tests
to mirror master's TF SM coverage translated to v3 ModelTrainer API.

Master's TF2 training tests on master use SDK v2 (TensorFlow
estimator + S3 input channel + checked-in .npy files). We can't reuse
the test code (different SDK), but we can mirror the patterns.

New layout:
  test/tensorflow/integration/sagemaker/
    test_mnist_cpu.py            (3 tests: single-node, multi-host smoke, MWMS skipped)
    test_mnist_cuda.py           (4 tests: + multi-GPU MirroredStrategy)
    test_experiments_cpu.py      (1 test: SM Experiments)
    test_tuning_cpu.py           (1 test: SM HPO tuner)
    resources/scripts/mnist.py   (multi-strategy entry: none|mirrored|mwms)
    resources/mnist/data/        (real .npy files copied byte-for-byte from
                                  master's tensorflow2_training/resources/mnist/data/)

The .npy files are pre-normalized 1000-sample MNIST subsets identical
to master (verified by SHA). Splitting scripts/ from mnist/data/
prevents the SDK from re-uploading 12MB of data as source code each run.

Test patterns:
- All tests use S3 input channel via sagemaker_session.upload_data()
  + InputData(channel_name="training") — mirrors master's customer-realistic flow
- Single-node tests (count=1): plain Keras, no strategy
- Multi-host smoke tests (count=2, no strategy): each host runs the same
  script independently, just verifies multi-host launcher works (mirrors
  master's test_distributed_mnist_no_ps)
- Multi-GPU MirroredStrategy (count=1, ml.g4dn.12xlarge): single-host
  multi-GPU NCCL all-reduce, no SDK distribution layer needed
- MWMS multi-node (count=2): @pytest.mark.skip with reason pointing to
  project memory (PerReplica gap; debug + re-enable required before
  flipping PR to ready-for-review)
- test_experiments_cpu: Experiment + _Trial + _TrialComponent association
- test_tuning_cpu: HyperparameterTuner over IntegerParameter epochs

Workflow YAMLs updated to invoke the renamed test files. No image
or pyproject changes — image still ships sagemaker>=3.4.0 (locked).

MUST DO BEFORE READY-FOR-REVIEW:
- Investigate + un-skip test_mnist_distributed_mwms_{cpu,gpu}
- Tracked in project memory: tf-2-21-pr-6107-mwms-followup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TF 2.21 wheel is compiled against CUDA 12.5; we previously used
nvidia/cuda:12.6.3-*-amzn2023 as the base. Bumping to 12.9.1 (top of
the AL2023 escalation ladder noted in versions-cuda.env) under NVIDIA
forward minor-version compatibility. This aligns with vLLM, sglang,
Ray, and vLLM Omni which all run CUDA 12.9.1 on AL2023.

Files changed:
- docker/tensorflow/versions-cuda.env: CUDA_VERSION 12.6.3 -> 12.9.1
- docker/tensorflow/Dockerfile.cuda: ARG CUDA_VERSION default + header
  comment + libnvjpeg comment refreshed
- .github/config/image/tensorflow-sagemaker-cuda.yml: cuda_version
  cu126 -> cu129, prod_image tag refreshed
- .github/workflows/pr-tensorflow-sagemaker-cuda.yml: CI image URI
  hardcoded cu126 segment -> cu129
- test/tensorflow/unit/test_versions.py: docstring example values
  refreshed (test logic was already env-driven; no behavior change)

cuDNN 9.3.0.75 + NCCL 2.29.7 are pip-installed (nvidia-cudnn-cu12,
nvidia-nccl-cu12) and base-image-independent; no pyproject/lockfile
changes needed. CPU image is unaffected.

Validated locally on g5.2xlarge (A10G):
- docker buildx build of Dockerfile.cuda --target sagemaker: success
- Final image size 18.2 GB (vs 16.5 GB on 12.6.3, +1.7 GB expected
  from larger CUDA 12.9 libs)
- All 6 GPU smoke tests pass: cuBLAS matmul, cuDNN Conv2D (cuDNN
  9.3.0.75 loads correctly), nvjpeg decode_jpeg (no SOVERSION
  mismatch), libcudnn/libnccl present, mpirun.real single-wrap,
  bashrc telemetry survives dnf upgrade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant