feat: add DETR-ResNet50 object detection fine-tuning test case by aravneelaws · Pull Request #1068 · awslabs/awsome-distributed-ai

aravneelaws · 2026-04-28T20:44:21Z

Purpose

Adds a new test case for fine-tuning DETR-ResNet50 for object detection on Amazon SageMaker HyperPod EKS, demonstrating the Qualcomm AI Hub + AWS integration for cloud-to-edge ML workflows.

Changes

Add detr_main.py: PyTorch DDP training script for DETR-ResNet50 fine-tuning with torchmetrics mAP evaluation, Qualcomm AI Hub model wrapper with custom detection heads, checkpoint resume support, and distributed training via torchrun
Add Dockerfile: Container image with pinned dependencies (torch 2.5.1, qai-hub-models 0.30.2, transformers 4.51.3, torchmetrics 1.6.1) and pre-downloaded model weights for air-gapped operation
Add kubernetes/detr-resnet50-finetune.yaml-template: Parameterized PyTorchJob manifest with elastic policy, EFA networking, NCCL tuning, and automatic checkpoint resume
Add README.md: Overview, architecture, prerequisites, training arguments, and expected results
Add kubernetes/README.md: Step-by-step deployment guide with PVC naming guidance
Add data/README.md: Dataset preparation instructions for Supervisely Supermarket Shelves format
Add .gitignore

Test Plan

Environment:

AWS Service: Amazon SageMaker HyperPod EKS
Instance type: ml.g5.8xlarge (NVIDIA A10G)
Number of nodes: 2

Test commands:

# Build and push container image
docker build -t ${REGISTRY}detr-resnet50-finetune:v3 -f Dockerfile .
docker push ${REGISTRY}detr-resnet50-finetune:v3

# Generate and deploy training manifest
export IMAGE_URI=${REGISTRY}detr-resnet50-finetune:v3
export NUM_NODES=2
export INSTANCE_TYPE=ml.g5.8xlarge
envsubst '$IMAGE_URI $NUM_NODES $INSTANCE_TYPE' \
  < kubernetes/detr-resnet50-finetune.yaml-template \
  > detr-resnet50-finetune.yaml
kubectl apply -f detr-resnet50-finetune.yaml

Test Results

Distributed training verified with 2x ml.g5.8xlarge nodes over EFA:

DDP confirmed: DistributedSampler active (5 train batches/worker, split from 9 single-GPU), NCCL process group initialized over EFA with Libfabric
Loss: Decreased from ~1.45 (epoch 1) to ~1.25 (epoch 22) on the 36-image Supermarket Shelves dataset
Checkpoint resume: Verified — training correctly resumes from /fsx/checkpoint/checkpoint.pth.tar on restart
Air-gapped operation: Model weights baked into Docker image; no internet access required from training pods (TRANSFORMERS_OFFLINE=1)
Restart behavior: restartPolicy: ExitCode with maxRestarts: 3 prevents infinite restart loop on successful completion

Directory Structure

3.test_cases/
└── pytorch/
    └── detr-finetune/
        ├── Dockerfile
        ├── README.md
        ├── detr_main.py
        ├── .gitignore
        ├── data/
        │  └── README.md
        └── kubernetes/
           ├── README.md
           └── detr-resnet50-finetune.yaml-template

Checklist

I have read the contributing guidelines (https://github.com/awslabs/awsome-distributed-training/blob/main/CONTRIBUTING.md).
I am working against the latest main branch.
I have searched existing open and recently merged PRs to confirm this is not a duplicate.
The contribution is self-contained with documentation and scripts.
External dependencies are pinned to a specific version or tag (no latest).
A README is included or updated with prerequisites, instructions, and known issues.
New test cases follow the expected directory structure (#directory-structure).

Add a new PyTorch DDP test case for fine-tuning DETR-ResNet50 on a custom object detection dataset (supermarket shelves) using Qualcomm AI Hub pre-trained weights. Designed for distributed training on SageMaker HyperPod with EKS orchestration. Key components: - Training script with torchmetrics mAP evaluation - Dockerfile with HPC-optimized base image (EFA + NCCL) - Kubeflow PyTorchJob manifest with envsubst parameterization - Dataset preparation and Kubernetes deployment documentation Tested on 2x ml.g5.8xlarge with elastic scaling (2-36 replicas).

- Bump qai-hub-models to 0.30.2 for torch 2.5.1 compatibility - Bake DETR-ResNet50 weights into Docker image for air-gapped operation - Add TRANSFORMERS_OFFLINE/HF_HUB_OFFLINE env vars to YAML template - Remove model-cache emptyDir volume that overwrote baked-in weights - Remove redundant runtime model pre-download step - Clarify model weight provenance (HuggingFace Hub via QAI Hub wrapper) - Add PVC naming guidance (fsx-pvc default, sed command for alternatives) - Note that Docker build requires internet for weight download

- Mention SageMaker HyperPod EKS alongside vanilla EKS in prerequisites - Note that Kubeflow Training Operator is pre-installed on HyperPod - Remove visible  directive from README titles - Correct expected validation loss from ~0.64 to ~1.24 - Remove unverified training time estimate - Change restartPolicy from OnFailure to ExitCode to prevent infinite loop - Reduce maxRestarts from 100 to 3 - Auto-resume from checkpoint on legitimate restarts

- Fix DDP: change --dist-url default to env:// and pass it in YAML template - Fix LOCAL_RANK: read from environment when launched via torchrun - Fix DDP unused params: add find_unused_parameters=True (custom heads replace original COCO heads, leaving original parameters unused) - Fix RandomHorizontalFlip: remove it since it flips images but not bounding box coordinates, causing misaligned training data - Remove unused import numpy, QAI_HUB_MODEL_ID env var, /workspace/src from PYTHONPATH - Change maxReplicas from hardcoded 36 to $NUM_NODES - Clarify batch-size default in README (code=8, YAML template=4) Tested on 2x ml.g5.8xlarge HyperPod EKS cluster: confirmed DDP active via DistributedSampler (5 batches/worker vs 9 single-GPU), NCCL over EFA, and loss decreasing from ~1.45 to ~1.25 over 22 epochs.

KeitaW

Review Batch 1/5 — Structure & Repository Hygiene

Thanks for the detailed PR description and the test plan. Posting in 5 themed batches; this first one covers repo-hygiene items that touch multiple files.

Directory placement under `pytorch/` rather than under a library

Path: 3.test_cases/pytorch/detr-finetune/ (whole tree)

The convention documented in the project's is 3.test_cases/<framework>/<library>/[<model>/]. Looking at the existing siblings (pytorch/ddp/, pytorch/FSDP/, pytorch/deepspeed/, pytorch/distillation/, pytorch/torchtitan/, …) the second level is normally a parallelism strategy or a library, and the model lives one level deeper. Since this test case is plain DDP, I'd suggest moving it to 3.test_cases/pytorch/ddp/detr-finetune/ to match precedent — that keeps pytorch/<library>/<model> shape consistent and makes it discoverable alongside the other DDP examples. A few existing entries (nanoVLM, verl, picotron) do sit directly under pytorch/, so there is some prior inconsistency, but matching the dominant pattern would be my preference.

Missing copyright/license headers

Files: Dockerfile, detr_main.py, kubernetes/detr-resnet50-finetune.yaml-template, all the new READMEs.

All new contributions should carry the license header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

(adjust comment marker per file type —  for markdown). I noticed none of the new files in this PR include it — could you add the header to the Dockerfile, the training script, the YAML template, and the markdown files? It's a small change but the CI lint and downstream license tooling key on it.

KeitaW · 2026-04-29T09:51:17Z

@@ -0,0 +1,56 @@
+# DETR-ResNet50 Object Detection Training Container
+# Base image: HPC-optimized with pre-configured EFA and NCCL for distributed training
+FROM public.ecr.aws/hpc-cloud/nccl-tests:latest


Base image pinned to :latest

Per the contributing guidelines and the CI version-check workflow, container image tags must be pinned to a specific version or commit — never latest. This matters because the nccl-tests base is rebuilt periodically with new EFA/NCCL/CUDA versions, and using :latest means a future rebuild of this Dockerfile could land on a stack that silently changes the EFA installer version, NCCL version, or CUDA version, all of which the repo's CI explicitly enforces minimums for (EFA >= 1.47.0, NCCL >= 2.28, CUDA >= 13.0).

Could you pin to a specific tag? You can list available tags with aws ecr-public describe-image-tags --repository-name nccl-tests --registry-id 098967265814, or pick the digest used during your verified test run. Other test cases in the repo (e.g., 3.test_cases/pytorch/FSDP/Dockerfile) use tagged base images — that's good precedent to follow.

:latest is consistent with all other test cases using this base image (FSDP, DDP, nanoVLM, trl, distillation). The images I build in this will be pinned but I will keep it consistent with others here.

KeitaW

Review Batch 2/5 — Deployment Pipeline (K8s)

Findings on the PyTorchJob template and Kubernetes deployment instructions.

KeitaW · 2026-04-29T09:51:50Z

+              - name: NCCL_DEBUG
+                value: "INFO"
+              - name: NCCL_SOCKET_IFNAME
+                value: "eth0"


NCCL_SOCKET_IFNAME=eth0 breaks on EFA-equipped instances

This is the most important finding for me. Positive interface selection like eth0 is fragile on AWS GPU instances: instances with EFA enroll additional interfaces (efa0, eth1, etc.) and on some AMIs the primary may not be named eth0 (e.g., predictable interface names like ens5). The repo convention (see the EFA cheatsheet and micro-benchmarks/nccl-tests/nccl-tests.Dockerfile) is exclusion-based.

The fact that your verified test run worked on ml.g5.8xlarge is consistent with that AMI happening to expose eth0, but it isn't portable. I'd suggest switching to ^lo so this template works on the broader instance family and on future AMIs.

Suggested change

value: "eth0"

value: "^lo"

KeitaW · 2026-04-29T09:51:50Z

+              resources:
+                requests:
+                  nvidia.com/gpu: 1
+                  vpc.amazonaws.com/efa: 1
+                limits:
+                  nvidia.com/gpu: 1
+                  vpc.amazonaws.com/efa: 1


Worker manifest does not declare CPU/memory requests

I noticed CPU and memory aren't requested. For a single-GPU PyTorchJob on ml.g5.8xlarge (32 vCPU, 128 GiB) this usually works fine, but without a CPU request the kubelet won't reserve resources and a sidecar / DaemonSet competing for the node can cause OOM-killed dataloader workers. (Side note: the elasticPolicy.metrics block above references CPU utilisation; in practice Training Operator honours minReplicas/maxReplicas/rdzvBackend for elastic Worker replicas and largely ignores the metrics target, so this isn't load-bearing — just worth a CPU request for kubelet accounting.)

Suggested change

resources:

requests:

nvidia.com/gpu: 1

vpc.amazonaws.com/efa: 1

limits:

nvidia.com/gpu: 1

vpc.amazonaws.com/efa: 1

resources:

requests:

cpu: "8"

memory: "32Gi"

nvidia.com/gpu: 1

vpc.amazonaws.com/efa: 1

limits:

cpu: "8"

memory: "32Gi"

nvidia.com/gpu: 1

vpc.amazonaws.com/efa: 1

KeitaW · 2026-04-29T09:51:50Z

+export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
+export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
+export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
+docker build -t ${REGISTRY}detr-finetune:latest -f Dockerfile .


Image tag :latest in the deployment example

The PR body shows you used detr-resnet50-finetune:v3 for the actual verified run. I'd suggest aligning the docs with that — using a versioned tag (or a ${VERSION} variable) makes the example consistent with the no-latest rule and prevents readers from accidentally picking up a stale image when iterating. The same applies to lines 61 and 69 below (docker image push and IMAGE_URI).

Side note: the PR body uses detr-resnet50-finetune while this README uses detr-finetune for the ECR repo name — picking one and using it consistently across the README, the directory, and the YAML default would make the docs easier to follow.

Suggested change

docker build -t ${REGISTRY}detr-finetune:latest -f Dockerfile .

docker build -t ${REGISTRY}detr-finetune:v1 -f Dockerfile .

KeitaW · 2026-04-29T09:51:50Z

+
+```bash
+# Create repository if needed
+REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"detr-finetune\" | wc -l)


aws ecr describe-repositories | grep ... is fragile

This relies on the literal quoting in the JSON output — it works today but it'll match any other repo whose name contains detr-finetune as a substring, and it'll silently fail if AWS ever changes JSON formatting. The cleaner pattern other test cases use is to query the specific repo by name and create on failure.

Suggested change

REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"detr-finetune\" | wc -l)

aws ecr describe-repositories --repository-names detr-finetune \

>/dev/null 2>&1 || aws ecr create-repository --repository-name detr-finetune

KeitaW

Review Batch 3/5 — Training Code Quality

Findings on detr_main.py. The most consequential here are the checkpoint-save gate (every rank writes the same file under torchrun) and the resume path's dependency on original_state_dict. The simplified positional DETR loss is intentional for the workshop demo, but worth calling out so the next reader doesn't carry it into a production setting unchanged.

KeitaW · 2026-04-29T09:53:03Z

+        if not args.multiprocessing_distributed or (
+            args.multiprocessing_distributed and args.rank % ngpus_per_node == 0
+        ):


Checkpoint save fires on every rank when launched via torchrun

The YAML template launches via python3 -m torch.distributed.run (i.e. torchrun) without setting --multiprocessing-distributed, so args.multiprocessing_distributed is False (action="store_true", default False). With False, the gate short-circuits to True on every rank, so every worker pod calls torch.save(...) against /fsx/checkpoint/checkpoint.pth.tar simultaneously. After DDP all-reduce the parameters are identical across ranks, so the harm isn't divergent content — it's a write race on the same file (and a second race on shutil.copyfile to model_best.pth.tar). Lustre is mostly forgiving but the race can yield truncated reads on the next resume, and there's no dist.barrier() between save and the start of the next epoch.

References:

PyTorch DDP tutorial — Save and Load Checkpoints

PyTorch DDP design notes

PyTorch ImageNet example main.py — canonical rank-0 pattern

Suggested change

if not args.multiprocessing_distributed or (

args.multiprocessing_distributed and args.rank % ngpus_per_node == 0

):

is_rank_zero = (not args.distributed) or args.rank == 0

if is_rank_zero:

KeitaW · 2026-04-29T09:53:03Z

+        scheduler.step()
+
+        # Track best validation loss
+        is_best = final_val_loss < best_loss


Validation loss is rank-local; "best" decision diverges across ranks

Each rank's DistributedSampler(val_dataset, shuffle=False) (line 831) shards validation across workers, so validate() returns the loss over only that rank's shard. There's no dist.all_reduce on losses.avg before this is_best comparison. With the checkpoint-save bug above unfixed, two ranks may both decide they are best on different epochs and both copy to model_best.pth.tar. Even after fixing the gate, the saved checkpoint reflects only rank 0's view of the data.

Could you all-reduce losses.sum and losses.count (or losses.avg) across the process group before deciding is_best? Something like:

if args.distributed: t = torch.tensor([losses.sum, losses.count], device=f"cuda:{args.gpu}") dist.all_reduce(t, op=dist.ReduceOp.SUM) losses.avg = (t[0] / t[1]).item()

The same applies to evaluate_map() — MeanAveragePrecision.compute() is being called on a rank-local shard, so the printed mAP is partial. For a small (9-image) val set this might be tolerable for a demo, but it's worth noting in docs that the reported mAP is rank-0's view.

KeitaW · 2026-04-29T09:53:03Z

+        img_name = ann_file.stem  # e.g. "001.jpg" from "001.jpg.json"
+        img_path = self.images_dir / img_name
+        for ext in [".jpg", ".jpeg", ".png"]:
+            candidate = self.images_dir / f"{img_name}{ext}"
+            if candidate.exists():
+                img_path = candidate
+                break


Image lookup loop is dead code

Because Path("001.jpg.json").stem is "001.jpg", img_name already includes the extension; the fallback loop builds 001.jpg.jpg, 001.jpg.jpeg, 001.jpg.png — none of which exist. The initial assignment img_path = self.images_dir / img_name is what actually finds the image, so it works for .jpg but is silently wrong for .png annotations (it would still resolve via the initial assignment too).

I'd suggest replacing with a single with_suffix-aware lookup:

Suggested change

img_name = ann_file.stem # e.g. "001.jpg" from "001.jpg.json"

img_path = self.images_dir / img_name

for ext in [".jpg", ".jpeg", ".png"]:

candidate = self.images_dir / f"{img_name}{ext}"

if candidate.exists():

img_path = candidate

break

img_stem = ann_file.name.removesuffix(".json") # e.g. "001.jpg"

img_path = self.images_dir / img_stem

if not img_path.exists():

raise FileNotFoundError(img_path)

It's not a runtime bug today (the Supervisely format names annotations <image_basename>.<ext>.json), but the loop is misleading code that suggests a behavior it doesn't actually implement.

KeitaW · 2026-04-29T09:53:03Z

+        original_state_dict = state["state_dict"]
+        cleaned_state_dict = {}
+
+        for key, value in original_state_dict.items():
+            new_key = key.replace("module.", "")  # Remove DDP prefix
+            new_key = new_key.replace("qai_model.", "")  # Remove wrapper prefix
+            cleaned_state_dict[new_key] = value
+
+        state_to_save = state.copy()
+        state_to_save["state_dict"] = cleaned_state_dict
+        state_to_save["original_state_dict"] = original_state_dict


Resume path depends on original_state_dict and breaks silently if absent

save_checkpoint overwrites state["state_dict"] with a cleaned dict that strips module. (DDP) and qai_model. (wrapper) prefixes, then stores the original under state["original_state_dict"]. On resume (line 794):

state_dict_to_load = checkpoint.get("original_state_dict", checkpoint["state_dict"]) model.load_state_dict(state_dict_to_load)

If original_state_dict is ever missing — e.g., a checkpoint produced by an earlier or future variant of this script, or someone hand-edits the file — load_state_dict will be called with the cleaned dict against a DDP+wrapper model and fail with one missing-key error per parameter. The verified test plan shows resume works because both keys are present, but the design is fragile.

I'd suggest keeping state["state_dict"] as the trainable form and storing the cleaned version under a separate key like inference_state_dict (or saving two files, one for resume and one for export). The current shape inverts the convention every other PyTorch example follows and will surprise readers.

KeitaW · 2026-04-29T09:53:03Z

+        }
+
+
+def create_model(num_classes: int, pretrained: bool = True):


--pretrained flag is documented but ignored

pretrained is accepted as a parameter but never read — DETRModel.from_pretrained() is called unconditionally on line 370. The README and CLI table both advertise --pretrained as a meaningful toggle ("use pre-trained model"). I'd suggest either honouring the flag (e.g., construct an unpretrained model when --pretrained is unset) or removing it from the argparser and the docs to avoid drift between code and docs.

KeitaW · 2026-04-29T09:53:03Z

+            args.batch_size = int(args.batch_size / ngpus_per_node)
+            args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
+            model = torch.nn.parallel.DistributedDataParallel(
+                model, device_ids=[args.gpu], find_unused_parameters=True)


find_unused_parameters=True may be masking a real graph

find_unused_parameters=True adds a per-iteration traversal cost and is generally a band-aid. Looking at QAIHubDETRWrapper.forward, all parameters of class_embed and bbox_embed are used on every step, and the DETR backbone is also fully used. If the unused-params flag is needed, it likely points at a frozen submodule somewhere in the QAI Hub wrapper. Could you confirm which parameters are unused? In most cases you can either (a) drop the flag entirely, (b) freeze the backbone explicitly with requires_grad_(False) and exclude it from the optimizer, which is faster than relying on DDP's discovery.

KeitaW · 2026-04-29T09:53:03Z

+            loss = total_loss / valid_samples
+        else:
+            loss = torch.tensor(0.0, requires_grad=True).cuda(args.gpu)


Empty-batch loss path is incompatible with removing find_unused_parameters

When every target in a batch is empty, this constructs a leaf tensor with no graph and calls .backward() on it. With find_unused_parameters=True (current setting) DDP tolerates the resulting "no parameters received gradient" state. If the previous finding (drop find_unused_parameters) is acted on, this branch will fire DDP's "some parameters did not receive grad" assertion and crash the worker. I'd suggest skipping the optimizer step entirely on empty batches:

Suggested change

loss = total_loss / valid_samples

else:

loss = torch.tensor(0.0, requires_grad=True).cuda(args.gpu)

if valid_samples == 0:

continue

loss = total_loss / valid_samples

KeitaW · 2026-04-29T09:53:03Z

+            class_loss = criterion(sample_logits, sample_labels)
+
+            # Bounding box L1 loss
+            sample_boxes = pred_boxes[batch_idx, :num_objects, :]
+            target_boxes = target["boxes"][:num_objects, :].cuda(args.gpu)
+            bbox_loss = nn.functional.l1_loss(sample_boxes, target_boxes)
+
+            # Combined loss (bbox weighted higher for detection)
+            combined_loss = class_loss + 5.0 * bbox_loss
+            total_loss += combined_loss
+            valid_samples += 1


Per-batch loss is not the DETR matching loss

This pairs the first num_objects queries with the first num_objects ground-truth boxes positionally, which doesn't match what DETR actually trains against (Hungarian matching between all 100 queries and the GT set, with a "no-object" class for unmatched queries). The PR description says this is a workshop/demo training, and the loss decreases on the small dataset, so I understand the simplification — but I'd strongly suggest adding a docstring or README note that the loss is intentionally simplified and not the standard DETR loss, so a reader doesn't copy this into a production setting and find that mAP plateaus.

Alternatively, since transformers is already a dep, you can call DetrForObjectDetection's built-in loss directly with labels= as the typical HF training pattern — it implements the matcher.

KeitaW · 2026-04-29T09:53:03Z

+            f.write(f"Completed: {datetime.now().strftime('%Y-%m-%d')}\n")
+        print(f"[OK] Training statistics saved to: {stats_file}")
+    except Exception as e:
+        print(f"[WARN] Could not save statistics file: {e}")


No dist.destroy_process_group() at exit

After training completes, main_worker returns without tearing down the NCCL process group. PyTorch will emit warnings on exit and, depending on the NCCL version, can leak GPU memory across pod restarts (relevant because maxRestarts: 3 means the same node may rerun this script). I'd suggest adding at the end of main_worker (or wrapping the body in try/finally):

if args.distributed: dist.destroy_process_group()

KeitaW · 2026-04-29T09:53:03Z

+                metric.update(preds, tgts)
+
+    results = metric.compute()
+    print("[INFO] mAP Evaluation Results:")


evaluate_map prints from every rank

The mAP results are printed unconditionally, so every rank dumps its own (partial, shard-local) numbers to kubectl logs. Combined with the rank-local metric issue noted above, readers see N inconsistent mAP tables per epoch. Gating the print block on args.rank == 0 (or a small log() helper) makes the output unambiguous.

KeitaW

Review Batch 4/5 — Infrastructure & NCCL Configuration

One inline finding here on the Dockerfile's air-gapped claim. The NCCL_SOCKET_IFNAME issue from Batch 2 is the other infrastructure-side item, kept there because it's anchored to the YAML.

KeitaW · 2026-04-29T09:53:19Z

+# Pre-download DETR-ResNet50 weights so the container works in air-gapped environments.
+# The weights originate from facebook/detr-resnet-50 on HuggingFace Hub; the QAI Hub
+# Model.from_pretrained() wrapper downloads them via the transformers library.
+# NOTE: This step requires internet access during docker build.
+ENV TRANSFORMERS_CACHE=/workspace/cache/transformers
+ENV HF_HOME=/workspace/cache/huggingface
+RUN mkdir -p /workspace/cache/transformers /workspace/cache/huggingface && \
+    python3 -c "from qai_hub_models.models.detr_resnet50 import Model; m = Model.from_pretrained(); print('DETR-ResNet50 weights cached')"


"Air-gapped" runtime claim has a qai_hub telemetry caveat

HF_HUB_OFFLINE=1 and TRANSFORMERS_OFFLINE=1 cover the HuggingFace path that DETRModel.from_pretrained() uses today, so the runtime-no-internet claim holds for the current code path. However, qai-hub-models 0.30.x transitively installs the qai_hub client which can read ~/.qai_hub/client.ini and attempt API calls if any helper beyond Model.from_pretrained() is invoked (e.g., compile, target_runtime).

I'd suggest either:

adding a one-line note in the README that "air-gapped" is verified for Model.from_pretrained() only, plus adding QAIHM_DISABLE_TELEMETRY=1 to the YAML env block defensively, or

adding a runtime assertion in detr_main.py that no outbound HTTPS is needed by setting HTTPS_PROXY= to an unreachable address during the smoke test.

KeitaW

Review Batch 5/5 — Documentation Consistency + Wrap-up

Final batch — small documentation drift items, the things that look great, and a suggested priority for addressing the findings across all 5 batches.

Things That Look Great

Clear, complete test plan: The PR body documents the exact verified configuration (2x ml.g5.8xlarge, NCCL/EFA confirmed, loss curve, checkpoint resume tested, air-gapped behaviour) — that's exactly the level of evidence I want to see for a new test case.
Self-contained contribution: Dockerfile, training script, Kubernetes manifest, dataset prep doc, and main README all live together and cross-link cleanly. New readers can go from zero to running.
Pinned Python dependencies: Every pip dependency in the Dockerfile is pinned to a specific version (torch==2.5.1, qai-hub-models==0.30.2, transformers==4.51.3, …), and the inline comment explaining why qai-hub-models is installed without the [detr-resnet50] extra is the kind of context I love seeing in Dockerfiles.
envsubst whitelist: The deployment instructions correctly use an explicit variable whitelist (envsubst '$IMAGE_URI $NUM_NODES $INSTANCE_TYPE'), avoiding accidental substitution of YAML's ${...} patterns.
PyTorchJob with elastic policy and restartPolicy: ExitCode: Using Kubeflow's PyTorchJob with maxRestarts: 3 is the right pattern for distributed training — it stays well under the "burning GPU hours on repeated failures" threshold and is consistent with other K8s test cases in the repo.
Air-gapped runtime: Pre-baking model weights into the image with TRANSFORMERS_OFFLINE=1/HF_HUB_OFFLINE=1 set at runtime is a thoughtful pattern for clusters without NAT, and it's clearly documented in the README.
Checkpoint resume from FSx: The script's RESUME_FLAG pattern in the YAML command is a clean way to wire up automatic resume on pod restarts, and it's verified end-to-end in the test plan.
.gitignore: Sensibly scoped — model artifacts, caches, and wandb/ are excluded so accidental commits of large outputs are unlikely.

Suggested priority

Items I'd want addressed before merge (correctness / portability):

NCCL_SOCKET_IFNAME=eth0 → ^lo (Batch 2) — portability across AMIs
Checkpoint save fires on every rank under torchrun (Batch 3) — file-write race on FSx
Resume path depends on original_state_dict (Batch 3) — fragile checkpoint format
Base image :latest (Batch 1) — pin to a specific tag
Missing copyright headers (Batch 1) — required by repo conventions

Everything else is merge-friendly follow-up work.

KeitaW · 2026-04-29T09:54:05Z

+# The weights originate from facebook/detr-resnet-50 on HuggingFace Hub; the QAI Hub
+# Model.from_pretrained() wrapper downloads them via the transformers library.
+# NOTE: This step requires internet access during docker build.
+ENV TRANSFORMERS_CACHE=/workspace/cache/transformers


TRANSFORMERS_CACHE is deprecated in favor of HF_HOME

Per the HuggingFace docs TRANSFORMERS_CACHE has been deprecated for several major versions in favor of HF_HOME. With transformers==4.51.3 it still works but emits a deprecation warning. Could you drop TRANSFORMERS_CACHE here (and in the YAML template at lines 56–57) and rely on HF_HOME alone? That keeps the test case forward-compatible with future transformers versions.

Suggested change

ENV TRANSFORMERS_CACHE=/workspace/cache/transformers

KeitaW · 2026-04-29T09:54:06Z

+              - name: TRANSFORMERS_CACHE
+                value: "/workspace/cache/transformers"


Drop TRANSFORMERS_CACHE (deprecated) — HF_HOME alone is sufficient

Same finding as in the Dockerfile: per the HuggingFace docs, TRANSFORMERS_CACHE is deprecated. The next two lines (HF_HOME=/workspace/cache/huggingface) cover the cache path correctly.

Suggested change

- name: TRANSFORMERS_CACHE

value: "/workspace/cache/transformers"

KeitaW · 2026-04-29T09:54:06Z

+
+### Different Classes
+
+Edit `meta.json` to define your own classes:


meta.json location is unclear in this section

The Customization section says "Edit meta.json to define your own classes" without saying where the file lives. A reader has to jump to data/README.md to learn it goes at the dataset root next to Supermarket shelves/. A one-line note here would save the trip:

Suggested change

Edit `meta.json` to define your own classes:

Edit `meta.json` (place at `<data-dir>/meta.json` — see [data/README.md](data/README.md)) to define your own classes:

KeitaW · 2026-04-29T09:54:06Z

+Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment
+variable (default: `/tmp/checkpoints`):


CHECKPOINT_DIR default in README differs from K8s default

The README says checkpoints land in CHECKPOINT_DIR with default /tmp/checkpoints. The K8s template script (yaml-template lines 102–106) sets CHECKPOINT_DIR to /fsx/checkpoint when writable — which is the actually-used path during the verified test run. Could you clarify in the README that the script-default differs from the deployment-default, or just point to /fsx/checkpoint since that's what reviewers checking the test plan will see in kubectl logs?

Suggested change

Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment

variable (default: `/tmp/checkpoints`):

Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment

variable (script default: `/tmp/checkpoints`; the Kubernetes deployment sets this to `/fsx/checkpoint`):

KeitaW

Few comments

aravneelaws · 2026-04-30T20:49:07Z

Thanks for the thorough review, @KeitaW. I will fix them and notify once done.
Quick clarification on the directory structure - while I see the point of keeping it under DDP, I also see the point for being its own directory inside pytorch (similar to nanoVLM). This is the first object detection model we have here. Given that, do you think it is ok to keep it here or still better to move it to inside the DDP directory?

- Add copyright/license header to detr_main.py (MIT-0) - Change NCCL_SOCKET_IFNAME from eth0 to ^lo for portability across AMIs - Gate checkpoint save on rank 0 only (was racing on all ranks under torchrun) - Invert state_dict convention: save DDP-prefixed form as state_dict (standard PyTorch convention), cleaned form as inference_state_dict Tested on 2x ml.g5.8xlarge with ^lo: DDP init over EFA succeeded, training to epoch 24 with val loss ~1.13, checkpoint saved from rank 0 only (worker-1 confirmed 0 saves).

Training code (detr_main.py): - All-reduce validation loss across ranks before is_best decision (3.2) - Fix dead code in image lookup loop: use removesuffix instead (3.3) - Remove --pretrained flag: model always loads pre-trained weights (3.5) - Document find_unused_parameters=True rationale in code comment (3.6) - Skip empty batches with continue instead of zero-loss tensor (3.7) - Document simplified positional loss vs Hungarian matching (3.8) - Add dist.destroy_process_group() at exit (3.9) - Gate evaluate_map prints and final summary on rank 0 only (3.10) Deployment (YAML template): - Add cpu/memory resource requests (8 CPU, 32Gi) (2.2) - Add QAIHM_DISABLE_TELEMETRY=1 env var (4.1) - Remove --pretrained from torchrun command (3.5) Documentation: - Use versioned tag (v1) and consistent ECR repo name in README (2.3) - Replace fragile grep ECR check with --repository-names query (2.4) - Remove deprecated TRANSFORMERS_CACHE, use HF_HOME only (5.1/5.2) - Add meta.json path reference in customization section (5.3) - Clarify CHECKPOINT_DIR defaults (script vs K8s) (5.4) - Document simplified loss in Expected Results note (3.8) Tested on 2x ml.g5.8xlarge: DDP training, all-reduce val loss, rank-0-only checkpoint saves and mAP prints all verified.

aravneelaws · 2026-05-01T17:30:13Z

@KeitaW I have addressed all the comments except the directory structure and the :latest tag in aws-hpc base docker image for consistency. Please let me know if you have another comments. Thanks!

KeitaW

Review Batch 1/3 (round 2) — Round-2 Status & Structure & Repository Hygiene

Posting the round-2 pass in 3 themed batches; this first one carries the resolution table for the round-1 findings, plus the structure / hygiene items that are still open.

Summary

Thanks for the careful pass through the round-1 feedback — most of it has landed cleanly. The portability fixes (NCCL ^lo, CPU/memory requests, QAIHM_DISABLE_TELEMETRY, HF_HOME-only) are in, the checkpoint-save and validation-loss correctness issues are resolved with a clean is_rank_zero gate plus an all_reduce(AVG), the resume path now stores the original state dict directly, and the K8s docs are now consistent (detr-resnet50-finetune everywhere, version-tagged image, --repository-names).

One thing has shifted from "policy nit" to merge blocker since round 1: PR #1070 bumps nccl-tests:latest to CUDA 13.0.2, and torch==2.5.1 has no CUDA-13 wheels — the moment #1070 lands, this Dockerfile starts producing broken containers. The :latest pin needs to become a concrete tag (or torch needs to bump to ≥2.9). See the inline comment on Dockerfile:3 below.

Beyond that: the directory should move to pytorch/ddp/detr-finetune/, the README still advertises a --pretrained flag that argparse no longer accepts (Batch 2), and copyright headers are still missing on the non-.py files (5 inline comments below). The CUDA pin and the --pretrained README fix are hard pre-merge; the directory move I'd also do pre-merge; the headers are merge-friendly follow-up.

Resolution Status of Round-1 Findings

#	Finding	Status
1	Base image pinned to `:latest`	❌ Open — and now a merge-blocker: PR #1070 bumps `nccl-tests:latest` to CUDA 13.0.2, which breaks `torch==2.5.1` (no CUDA-13 wheels). See inline.
2	Directory placement under `pytorch/` rather than under a library	❌ Open — `nanoVLM`/`verl`/`picotron` are themselves libraries (single-model, model level omitted), not the model-at-second-level pattern this PR uses; `pytorch/ddp/detr-finetune/` matches the documented `<framework>/<library>/<model>/` shape
3	Missing copyright/license headers	🟡 Partial — `detr_main.py` now has it; `Dockerfile`, YAML template, and the 3 READMEs still don't (5 inline comments below)
4	`NCCL_SOCKET_IFNAME=eth0` breaks on EFA-equipped instances	✅ Resolved — YAML now `^lo`
5	Image tag `:latest` in the deployment example	✅ Resolved — kubernetes README now uses `${VERSION}=v1` end-to-end
6	Worker manifest does not declare CPU/memory requests	✅ Resolved — `cpu: "8" / memory: "32Gi"` requests + limits added
7	Checkpoint save fires on every rank under `torchrun`	✅ Resolved — gated on `is_rank_zero`
8	Validation loss is rank-local; "best" decision diverges across ranks	✅ Resolved — `dist.all_reduce(loss_tensor, op=ReduceOp.AVG)` (small bias note in Batch 2)
9	Image lookup loop in `SupermarketDataset.__getitem__` is dead code	✅ Resolved — replaced with `removesuffix(".json")` and explicit `FileNotFoundError`
10	Resume path depends on `original_state_dict` and breaks silently if absent	✅ Resolved — `state_dict` now keeps the trainable form; cleaned dict moved to `inference_state_dict`
11	`--pretrained` flag is documented but ignored	🟡 Partial — flag removed from argparse, but the README still passes it (will now fail with "unrecognized arguments"); see Batch 2
12	No `dist.destroy_process_group()` at exit	✅ Resolved
13	Empty-batch loss path is incompatible with removing `find_unused_parameters`	✅ Resolved — now `continue`s when `valid_samples == 0`
14	`find_unused_parameters=True` masks a real graph	🟡 Partial — flag still set to `True`, but the inline comment honestly explains the COCO-heads-still-attached cause; deleting the dead heads in the wrapper is cleaner (Batch 2)
15	Per-batch loss is not the DETR matching loss	✅ Resolved — explicit NOTE in `train()` and the README
16	`print()` from all ranks instead of rank-0-only logging	🟡 Partial — final-stats and mAP prints are gated on rank 0; per-epoch banners still fire from every rank (Batch 2)
17	"Air-gapped" runtime claim has a `qai_hub` telemetry caveat	✅ Resolved — `QAIHM_DISABLE_TELEMETRY=1` in YAML env block
18	`TRANSFORMERS_CACHE` is deprecated in favor of `HF_HOME`	✅ Resolved — only `HF_HOME` set in both Dockerfile and YAML
19	Image name inconsistency between docs and PR test plan	✅ Resolved — `detr-resnet50-finetune` used everywhere
20	`meta.json` location in main README is unclear	✅ Resolved — main README points at `<data-dir>/meta.json`
21	`CHECKPOINT_DIR` default in README differs from K8s default	✅ Resolved — README explicitly notes both
22	`aws ecr describe-repositories \| grep` is fragile	✅ Resolved — switched to `--repository-names ... \|\| create-repository` pattern
23	`evaluate_map` prints from every rank	✅ Resolved — mAP table gated on rank 0

Legend: ✅ Resolved · 🟡 Partial · ❌ Open · ➖ N/A (scope changed)

Directory placement — `pytorch/ddp/detr-finetune/` matches the convention

(Cross-cutting; not anchored to a single line.)

Re-reading the convention more carefully: nanoVLM, verl, and picotron aren't "models sitting directly under pytorch/" — they're libraries themselves (single-model libraries where the <model>/ level is omitted because the library effectively is the model). The documented shape is:

3.test_cases/<framework>/<library>/<model>/

with <model>/ optionally collapsed when the library has a single model. detr-finetune is a model (the library being plain PyTorch DDP), so the correct slot is pytorch/<library>/detr-finetune/. Since this test case is plain DDP with torch.distributed.run and no specialized library, that puts it at pytorch/ddp/detr-finetune/ alongside the other DDP examples — which was my original suggestion. I'd ask you to make that move before merge; it's a git mv plus a relative-path bump in the READMEs (../../../1.architectures → ../../../../1.architectures and similar).

KeitaW · 2026-05-07T13:00:49Z

+# DETR-ResNet50 Object Detection Training Container
+# Base image: HPC-optimized with pre-configured EFA and NCCL for distributed training


Optional: MIT-0 license header — Dockerfile

After auditing the repo: only 5 of 19 Dockerfiles in 3.test_cases/ carry the SPDX header (~26%), so this is not a universal convention — softening from "missing" to "optional, for consistency with the recently-headered files." Skip if you prefer matching the majority pattern.

If you do want to add it:

Suggested change

# DETR-ResNet50 Object Detection Training Container

# Base image: HPC-optimized with pre-configured EFA and NCCL for distributed training

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

# SPDX-License-Identifier: MIT-0

KeitaW · 2026-05-07T13:00:49Z

@@ -0,0 +1,55 @@
+# DETR-ResNet50 Object Detection Training Container
+# Base image: HPC-optimized with pre-configured EFA and NCCL for distributed training
+FROM public.ecr.aws/hpc-cloud/nccl-tests:latest


nccl-tests:latest × torch==2.5.1 is about to become a tested-config break

Edited after self-audit: the failure-mode mechanism in my original wording was overstated — see correction below. The pin/bump recommendation stands.

PR #1070 bumps public.ecr.aws/hpc-cloud/nccl-tests from CUDA 12.9.1 to CUDA 13.0.2 (and NCCL 2.30.4, EFA 1.48), and once it merges, :latest will resolve to that new image. Per PyTorch's release-compatibility matrix, torch==2.5.1 only ships wheels for CUDA 11.8 / 12.1 / 12.4 — there's no CUDA-13 build of 2.5.x.

Correction on the failure mode (my original comment was wrong here): PyTorch wheels bundle their own libcudart/cuBLAS/cuDNN, and NVIDIA drivers are forward-compatible — a CUDA-13 driver runs CUDA-12-linked binaries fine. So import torch and torch.cuda.is_available() would likely succeed. The real risk is NCCL plugin / ABI mismatch: the new base image will ship aws-ofi-nccl built against system NCCL 2.30; torch 2.5.1 bundles its own NCCL ~2.21. Whichever wins at LD_LIBRARY_PATH resolution governs collectives, and the OFI plugin built for NCCL 2.30 is unlikely to be ABI-compatible with torch's bundled 2.21 — symptoms would surface in NCCL_DEBUG=INFO logs at process-group init or fall back to a TCP path that silently degrades EFA throughput. Either way, the combination is untested — your verified v3 test was on the CUDA-12.9.1 stack.

Two ways out, either is fine:

Pin the base image to a CUDA-12-era tag so the stack stays compatible with torch==2.5.1. The pre-nccl-tests: bump to CUDA 13.0.2 / NCCL 2.30.4 and add sm_103 (B300) #1070 published tag is the obvious anchor (or use whichever digest your verified v3 test run pulled).

Bump torch to a version with CUDA-13 wheels so the test case rides along with nccl-tests: bump to CUDA 13.0.2 / NCCL 2.30.4 and add sm_103 (B300) #1070. The matrix shows torch>=2.9 ships CUDA 13.0 binaries; torch==2.9.x (and matching torchvision==0.24.x) is the minimum.

I'd lean toward option 1 (pin to the pre-bump tag) because your verified test run was on the CUDA-12 stack — bumping torch is a behavior change that should ride its own verification. Either way, this needs to be resolved before merge.

Suggested change

FROM public.ecr.aws/hpc-cloud/nccl-tests:latest

FROM public.ecr.aws/hpc-cloud/nccl-tests:cuda12.9.1-efa1.47.0-ofiv1.18.0-ncclv2.29.3-1-testsv2.17.9

KeitaW · 2026-05-07T13:00:49Z

@@ -0,0 +1,148 @@
+apiVersion: "kubeflow.org/v1"


Optional: MIT-0 license header — YAML template

After auditing the repo: only 13 of 70 YAML files in 3.test_cases/ carry the SPDX header (~19%), so this is not a universal convention — softening from "missing" to "optional." Skip if you'd rather match the majority pattern.

If you do want to add it:

Suggested change

apiVersion: "kubeflow.org/v1"

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

# SPDX-License-Identifier: MIT-0

apiVersion: "kubeflow.org/v1"

KeitaW · 2026-05-07T13:00:49Z

@@ -0,0 +1,176 @@
+# Kubernetes Deployment


~~Missing MIT-0 license header — kubernetes/README.md~~ — withdrawn

Withdrawn — see retraction on the main README.md comment. Repo convention is that markdown READMEs don't carry SPDX headers (1/54).

KeitaW

Review Batch 2/3 (round 2) — Training Code Quality

Findings on detr_main.py and the README's training instructions. The most consequential is the --pretrained README regression (the basic-usage commands will now fail argparse). The other items are smaller polish.

Per-rank prints still spam logs (downgraded from round 1)

Cross-cutting across detr_main.py (examples: lines 187, 365, 367, 370, 644, 649, 758, 762, 808, 873-876).

Final-stats and mAP-table prints are now correctly gated on rank 0 — that's the bulk of the noise gone. The remaining per-rank prints (model-load banners, [OK] Checkpoint saved, [INFO] Loading checkpoint, dataset summary [INFO] Loaded N train samples, [INFO] Starting training for ..., etc.) still fire from every worker, so kubectl logs on multiple workers still shows N copies of each banner. Not a correctness issue and merge-friendly, but a small log() helper would tighten the K8s logs:

def log(*args, **kwargs):
    is_rank_zero = (not getattr(_args, "distributed", False)) or getattr(_args, "rank", 0) == 0
    if is_rank_zero:
        print(*args, **kwargs)

(Or, more locally, gate the existing print(...) calls with if is_rank_zero: where they already compute that flag — main_worker and save_checkpoint are the hot ones.)

KeitaW · 2026-05-07T13:00:57Z

+To run training locally with a single GPU:
+
+```bash
+python detr_main.py /path/to/data --epochs 50 --batch-size 4 --lr 1e-4 --pretrained --num-classes 2


README still passes --pretrained even though argparse no longer accepts it

This is a small regression introduced by the round-1 cleanup. You removed the --pretrained flag from argparse (good — it was a no-op), but this basic-usage command and the distributed example below it still pass --pretrained. Running these now exits with error: unrecognized arguments: --pretrained from argparse. The create_model docstring at lines 358-360 also still lists pretrained as a parameter (see Batch 3).

Could you also drop --pretrained from the torchrun example a few lines below (line 81)? Same fix, same line removed.

Suggested change

python detr_main.py /path/to/data --epochs 50 --batch-size 4 --lr 1e-4 --pretrained --num-classes 2

python detr_main.py /path/to/data --epochs 50 --batch-size 4 --lr 1e-4 --num-classes 2

KeitaW · 2026-05-07T13:00:57Z

+            # find_unused_parameters=True is required because the QAI Hub DETR wrapper
+            # replaces the original 92-class COCO detection heads with custom heads, but
+            # the original heads (class_labels_classifier, bbox_predictor) remain in the
+            # model's parameter list. They are never used in forward(), so DDP would
+            # otherwise error with "Expected to have finished reduction". This flag has
+            # no effect on accuracy -- it only tells DDP to skip unused params in the
+            # gradient all-reduce.
+            model = torch.nn.parallel.DistributedDataParallel(
+                model, device_ids=[args.gpu], find_unused_parameters=True)
+        else:
+            model.cuda()
+            # See comment above for find_unused_parameters rationale.
+            model = torch.nn.parallel.DistributedDataParallel(
+                model, find_unused_parameters=True)


find_unused_parameters=True rationale — accepting the band-aid

Edited after self-audit: my original suggestion to del self.qai_model.model.class_labels_classifier and del ...bbox_predictor was wrong and would break forward(). qai_model.model is HuggingFace's DetrForObjectDetection, whose forward unconditionally evaluates both heads:

logits = self.class_labels_classifier(sequence_output) pred_boxes = self.bbox_predictor(sequence_output).sigmoid()

Deleting them would raise AttributeError on the next forward pass. Please ignore the original suggestion — apologies for the misdirection.

The cleaner alternative would be to skip DetrForObjectDetection.forward entirely and call the inner DetrModel directly (self.qai_model.model.model(pixel_values=images)) — but that's a wrapper refactor, not a one-line fix, and the cost of find_unused_parameters=True on this small model is negligible. The flag with your explanatory comment is a fine final state.

KeitaW · 2026-05-07T13:00:57Z

+        if args.distributed:
+            loss_tensor = torch.tensor([final_val_loss], device=f"cuda:{args.gpu}")
+            dist.all_reduce(loss_tensor, op=dist.ReduceOp.AVG)
+            final_val_loss = loss_tensor.item()


ReduceOp.AVG of per-rank means is biased when shards are unequal

This fixes the round-1 "best decision diverges" bug — thank you. One small wrinkle worth knowing about: averaging per-rank losses.avg values with ReduceOp.AVG is exactly correct only when all ranks see the same number of validation samples. With a 9-image val set across 2 ranks and DistributedSampler(val_dataset, shuffle=False) (default drop_last=False), the sampler pads to 10 by duplicating samples — so one image gets weighted 2× in the global average. For the 36-image workshop demo this bias is small enough to ignore, but if you want bit-exact aggregation:

if args.distributed: t = torch.tensor([losses.sum, losses.count], device=f"cuda:{args.gpu}") dist.all_reduce(t, op=dist.ReduceOp.SUM) final_val_loss = (t[0] / t[1]).item()

(This requires returning losses.sum/losses.count from validate() instead of losses.avg, or moving the all-reduce inside validate().) Optional; current code is correct enough for this demo.

KeitaW

Review Batch 3/3 (round 2) — Documentation Consistency + Wrap-up

Last batch — two small doc-vs-code drift items, the things that look great this round, and a suggested merge order.

Things That Look Great

Round-1 turnaround: 17 of 23 findings fully resolved in one pass, with another 4 partially addressed and only 2 deliberately deferred — that's a thorough response, and the resolution table in Batch 1 is mostly green.
Validation-loss all-reduce: The dist.all_reduce(... ReduceOp.AVG) plus the is_rank_zero checkpoint gate cleanly fix the rank-divergence-on-is_best correctness issue from round 1.
is_rank_zero helper computed locally where needed: Same one-liner used at detr_main.py:460, :900, and :918 — readable and consistent without overengineering a global helper.
Honest workshop-demo NOTE on the loss: The inline comment at detr_main.py:498-502 and the README "Note" at README.md:157-161 calling out the simplified positional matching (vs. DETR's Hungarian) are exactly the kind of "don't copy this into production" framing that protects future readers.
Detailed find_unused_parameters rationale: The new comment block at detr_main.py:769-775 makes the design tradeoff explicit. Even if the cleaner fix is to delete the dead heads, the explanation is honest about what the band-aid is doing and why.
Air-gapped defense in depth: Adding QAIHM_DISABLE_TELEMETRY=1 to the YAML in addition to TRANSFORMERS_OFFLINE=1 / HF_HUB_OFFLINE=1 removes the last potential outbound-HTTPS path and matches the README's "no internet at runtime" claim.
Kubernetes README polish: The image-name unification (detr-resnet50-finetune everywhere), the ${VERSION} variable instead of :latest, and the --repository-names ... || create-repository pattern are all clean small wins that materially improve the deployment guide's reliability.
Resume path now stores the trainable form as state_dict: Inverting the previous shape (cleaned dict was the state_dict, original was the fallback) restores the convention every other PyTorch script follows and avoids the silent missing-key failure mode from round 1.

Suggested merge order

Hard pre-merge:

Dockerfile:3 CUDA-13 incompatibility (Batch 1) — pin the base image, otherwise this Dockerfile breaks the moment #1070 publishes.
README --pretrained flag (Batch 2) — current basic-usage commands fail argparse.

Pre-merge if you have appetite:

Directory move to pytorch/ddp/detr-finetune/ (Batch 1) — matches the documented layout convention.

Merge-friendly follow-ups (won't block on my side):

License headers on the 5 non-.py files (Batch 1).
Per-rank print() cleanup (Batch 2).
create_model docstring + Training Configuration table corrections (Batch 3).
find_unused_parameters removal via dead-head deletion (Batch 2) and ReduceOp.AVG bias fix (Batch 2) — both optional polish.

KeitaW · 2026-05-07T13:01:05Z

+| Batch Size | 4 per GPU | Optimized for DETR memory requirements |
+| Image Size | 800x800 | DETR standard input resolution |
+| Loss | CrossEntropy + 5x L1 bbox | Classification + weighted bbox regression |
+| Augmentation | HFlip, ColorJitter | Applied during training only |


Training Configuration table claims Augmentation | HFlip, ColorJitter but HFlip is intentionally omitted

The table lists augmentations as HFlip, ColorJitter, but the script explicitly omits RandomHorizontalFlip at detr_main.py:818-820 with a clear comment explaining why (the standard transform flips images but not bounding boxes, leading to misaligned image-box pairs). Could you update the table to match what's actually applied?

Suggested change

| Augmentation | HFlip, ColorJitter | Applied during training only |

| Augmentation | ColorJitter | HFlip omitted -- standard transform doesn't flip box coords |

KeitaW · 2026-05-07T13:01:05Z

+        num_classes: Number of target object classes (excluding background).
+        pretrained: Whether to load pre-trained weights.


create_model docstring still references removed pretrained parameter

The function signature was simplified to def create_model(num_classes: int): but the docstring's Args block still lists:

pretrained: Whether to load pre-trained weights.

Small cleanup — drop that line so the docstring matches the signature. The suggestion replaces the num_classes: line plus the now-stale pretrained: line with the single accurate line:

Suggested change

num_classes: Number of target object classes (excluding background).

pretrained: Whether to load pre-trained weights.

num_classes: Number of target object classes (excluding background).

KeitaW · 2026-05-07T13:12:11Z

Hi @aravneelaws , updated the review.

KeitaW

Review Batch 4/3 (round 2 follow-up) — Self-audit corrections + missed findings

A self-audit (/lt:review on my own round-2 comments) surfaced one real bug I missed and several wording corrections. The wording corrections have been applied directly via PATCH to the original comments — you'll see "Edited after self-audit" / "Withdrawn" prefixes on those threads. This batch carries the 3 new findings I should have flagged in the first round-2 pass.

Summary of edits already applied to round-2 comments

Original comment	Action
Missing license header — `README.md` / `kubernetes/README.md` / `data/README.md`	Withdrawn — only 1 of 54 READMEs in `3.test_cases/` carries the SPDX header; pushing it on `.md` files isn't actual repo convention.
Missing license header — `Dockerfile`, YAML template	Softened to optional — only ~26% of Dockerfiles and ~19% of YAMLs in `3.test_cases/` carry the header.
`nccl-tests:latest` × `torch==2.5.1`	Mechanism corrected — the failure mode is more likely NCCL plugin ABI mismatch (system NCCL 2.30 vs torch's bundled 2.21) than `libcudart`. Pin/bump recommendation stands.
`find_unused_parameters=True` cleaner-fix	Suggestion withdrawn — `del self.qai_model.model.class_labels_classifier` would break HF `DetrForObjectDetection.forward`, which unconditionally evaluates the heads. The flag with your explanatory comment is a fine final state.
`--pretrained` README finding	Tiny edit — line reference 82 → 81.

New findings (3 inline comments below)

The new ones, in priority order:

🔴 CHECKPOINT_DIR falls through to ephemeral storage when /fsx/checkpoint doesn't pre-exist (yaml-template L106-111) — the entire --resume mechanism is silently broken on first run unless someone manually pre-creates /fsx/checkpoint on the FSx volume. Your verified test plan only worked because the test environment had it pre-created.
🟡 QAIHubDETRWrapper.forward fallback chain is dead code that would crash if reached (detr_main.py L327-340) — only the last_hidden_state branch produces a 256-dim tensor that fits self.class_embed; the prediction_logits and dict-style fallbacks would silently produce wrong shapes.
🟡 --nproc_per_node=1 is hardcoded in the YAML regardless of INSTANCE_TYPE (yaml-template L125) — anyone scaling beyond ml.g5.8xlarge (per the K8s README's Scaling section) silently gets 1/N GPU utilization.

KeitaW · 2026-05-07T13:23:05Z

+                  if [ -w "/fsx/checkpoint" ]; then
+                    export CHECKPOINT_DIR="/fsx/checkpoint"
+                  else
+                    export CHECKPOINT_DIR="/workspace/checkpoints"
+                    mkdir -p ${CHECKPOINT_DIR}
+                  fi


[ -w "/fsx/checkpoint" ] falls through to ephemeral storage when the dir doesn't pre-exist

Observation: Bash [ -w PATH ] returns false (exit 1) when PATH doesn't exist. So if /fsx/checkpoint isn't pre-created on the FSx volume, the script silently sets CHECKPOINT_DIR=/workspace/checkpoints — which lives in the ephemeral container layer, not on FSx.

Impact: The entire resume mechanism (maxRestarts: 3, RESUME_FLAG="--resume=..." block on lines 113-118, and the verified "resume works" claim in the PR body) hinges on checkpoints surviving pod restarts. With this bug they don't survive — /workspace/checkpoints vanishes when the pod is replaced. Your test plan's verified-resume result must have worked because the test environment had /fsx/checkpoint pre-created; on a fresh deploy it silently fails.

Suggestion: Test the FSx mountpoint and create the subdir unconditionally instead of probing it:

Suggested change

if [ -w "/fsx/checkpoint" ]; then

export CHECKPOINT_DIR="/fsx/checkpoint"

else

export CHECKPOINT_DIR="/workspace/checkpoints"

mkdir -p ${CHECKPOINT_DIR}

fi

# Setup checkpoint directory on FSx (created if missing)

if [ -d "/fsx" ]; then

export CHECKPOINT_DIR="/fsx/checkpoint"

else

export CHECKPOINT_DIR="/workspace/checkpoints"

fi

mkdir -p "${CHECKPOINT_DIR}"

(Apologies for missing this in the first round-2 pass — it's the highest-severity finding in this PR.)

KeitaW · 2026-05-07T13:23:05Z

+        if hasattr(outputs, "last_hidden_state"):
+            hidden_states = outputs.last_hidden_state
+        elif hasattr(outputs, "prediction_logits"):
+            hidden_states = outputs.prediction_logits
+        else:
+            # Try dict-style access
+            hidden_states = outputs.get("logits", outputs.get("pred_logits"))
+
+        if hidden_states is None:
+            raise RuntimeError(
+                "Could not extract hidden states from QAI Hub DETR model. "
+                f"Output type: {type(outputs)}, "
+                f"Available attributes: {[a for a in dir(outputs) if not a.startswith('_')]}"
+            )


Hidden-state extraction fallback chain is dead code that would crash silently

Observation: Per HF's DetrObjectDetectionOutput, outputs always has last_hidden_state, so the first if branch always wins. The prediction_logits and dict-style logits/pred_logits fallbacks below it can never execute — and if they ever did (e.g., HF rename), they would produce a tensor of the wrong shape for self.class_embed = Linear(256, num_classes+1): outputs.logits on a DetrForObjectDetection is the already-classified [B, num_queries, 92] tensor, not the 256-dim decoder hidden states.

Impact: This is the "defensive getattr where the attr is guaranteed" pattern — the fallback chain looks like robustness but actually masks failure. If HF renames or restructures the output, the Linear layer would crash with a confusing shape error several lines later instead of failing fast with AttributeError.

Suggestion: Replace the chain with a direct access plus an explicit assertion:

Suggested change

if hasattr(outputs, "last_hidden_state"):

hidden_states = outputs.last_hidden_state

elif hasattr(outputs, "prediction_logits"):

hidden_states = outputs.prediction_logits

else:

# Try dict-style access

hidden_states = outputs.get("logits", outputs.get("pred_logits"))

if hidden_states is None:

raise RuntimeError(

"Could not extract hidden states from QAI Hub DETR model. "

f"Output type: {type(outputs)}, "

f"Available attributes: {[a for a in dir(outputs) if not a.startswith('_')]}"

)

hidden_states = outputs.last_hidden_state

assert hidden_states.shape[-1] == 256, (

f"Expected 256-dim decoder hidden states from QAI Hub DETR, "

f"got {hidden_states.shape}. The QAI Hub model output schema may have changed."

)

KeitaW · 2026-05-07T13:23:05Z

+
+                  echo "Starting DETR object detection training..."
+                  python3 -m torch.distributed.run \
+                    --nproc_per_node=1 \


--nproc_per_node=1 is hardcoded — multi-GPU instances will be silently underutilized

Observation: The launch line hardcodes --nproc_per_node=1, but kubernetes/README.md (Scaling section, line 153) invites users to bump INSTANCE_TYPE=ml.g5.12xlarge (4 GPUs) or larger. The script's per-process logic uses LOCAL_RANK so multi-GPU launching would work — but the YAML doesn't expose a way to vary nproc_per_node alongside INSTANCE_TYPE.

Impact: On ml.g5.12xlarge you'd use 1/4 GPUs; on ml.g5.48xlarge 1/8; on ml.p5.48xlarge 1/8 H100s. Cost-per-step quadruples or worse. The README's scaling instructions are misleading without this knob.

Suggestion: Expose it as a substituted variable:

Suggested change

--nproc_per_node=1 \

--nproc_per_node=${NPROC_PER_NODE:-1} \

…and document NPROC_PER_NODE in kubernetes/README.md (Scaling section) alongside INSTANCE_TYPE, with a note that NPROC_PER_NODE should match the instance's GPU count (ml.g5.8xlarge=1, ml.g5.12xlarge=4, ml.g5.48xlarge=8, ml.p5.48xlarge=8). Add it to the envsubst whitelist on K8s README line 71 too.

- Pin Dockerfile base image to cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1 (prevents CUDA 13 / torch 2.5.1 ABI mismatch from PR #1070) - Move directory from pytorch/detr-finetune/ to pytorch/ddp/detr-finetune/ (follows <framework>/<library>/<model> convention) - Remove --pretrained from README examples (arg was removed from argparse) - Fix CHECKPOINT_DIR: check [ -d /fsx ] + mkdir -p (was using -w on non-existent dir, causing silent fallback to ephemeral storage) - Replace QAIHubDETRWrapper.forward fallback chain with direct outputs.last_hidden_state access + shape assertion - Parameterize --nproc_per_node=${NPROC_PER_NODE:-1} for multi-GPU support - Fix training config table: note HFlip intentionally omitted - Remove stale pretrained param from create_model docstring - Update relative path references after directory move Tested on 2x ml.g5.8xlarge: training runs, CHECKPOINT_DIR correctly resolves to /fsx/checkpoint with mkdir -p, DDP active.

aravneelaws · 2026-05-12T23:54:12Z

Thanks for the thorough round-2 review @KeitaW! All hard pre-merge and pre-merge items have been addressed in commit cc14ab6:

Hard pre-merge (all fixed):

Base image pinned — Using cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9 (the newest available tag compatible with torch==2.5.1 CUDA 12.4 wheels). Note: the tag you suggested (cuda12.9.1-efa1.47.0-...) doesn't exist in the registry yet — I used the closest available. Happy to bump once that tag is published.
--pretrained removed from README examples — Both single-GPU and torchrun examples updated.
CHECKPOINT_DIR fixed — Now uses [ -d "/fsx" ] + mkdir -p "${CHECKPOINT_DIR}" so it correctly creates the directory on fresh deploys instead of silently falling back to ephemeral storage.

Pre-merge (all fixed):

Directory moved to 3.test_cases/pytorch/ddp/detr-finetune/ — agreed with your reasoning that this is a model using DDP, not a library. Relative path references updated.
QAIHubDETRWrapper.forward fallback chain removed — Replaced with direct outputs.last_hidden_state access + shape assertion. The fallbacks were dead code that would have produced wrong shapes.
--nproc_per_node parameterized — Now ${NPROC_PER_NODE:-1}, supporting multi-GPU instances without code changes.

Follow-ups also addressed:

Training config table updated (HFlip intentionally omitted — standard transform doesn't flip box coords)
Stale pretrained param removed from create_model docstring

Regarding :latest base image (1.3): As noted earlier, :latest is used by all other test cases (FSDP, DDP, nanoVLM, trl, distillation). I've now pinned to a specific tag to address the CUDA 13 compatibility concern from PR #1070.

Regarding find_unused_parameters=True (3.6): Acknowledged your withdrawal — keeping the flag with the explanatory comment as final state.

Regarding ReduceOp.AVG bias (3.2 follow-up): Agreed this is "correct enough" for the 9-image demo val set. Can address in a follow-up if needed.

All changes tested on 2x ml.g5.8xlarge HyperPod EKS cluster — DDP active, checkpoints saving to /fsx/checkpoint, loss decreasing.

KeitaW

LGTM thanks @aravneelaws for addressing the comments!

aravneelaws added 4 commits April 22, 2026 18:26

KeitaW reviewed Apr 29, 2026

View reviewed changes

KeitaW requested changes Apr 29, 2026

View reviewed changes

aravneelaws added 2 commits April 30, 2026 17:21

KeitaW reviewed May 7, 2026

View reviewed changes

KeitaW approved these changes May 15, 2026

View reviewed changes

KeitaW merged commit 9c2a9a7 into main May 15, 2026
5 checks passed

KeitaW deleted the finetune_detr_rn50 branch May 15, 2026 06:57

	docker build -t ${REGISTRY}detr-finetune:latest -f Dockerfile .
	docker build -t ${REGISTRY}detr-finetune:v1 -f Dockerfile .

	REGISTRY_COUNT=$(aws ecr describe-repositories \| grep \"detr-finetune\" \| wc -l)
	aws ecr describe-repositories --repository-names detr-finetune \
	>/dev/null 2>&1 \|\| aws ecr create-repository --repository-name detr-finetune

		}


		def create_model(num_classes: int, pretrained: bool = True):

		- name: TRANSFORMERS_CACHE
		value: "/workspace/cache/transformers"


		### Different Classes

		Edit `meta.json` to define your own classes:

	Edit `meta.json` to define your own classes:
	Edit `meta.json` (place at `<data-dir>/meta.json` — see [data/README.md](data/README.md)) to define your own classes:

		Checkpoints are saved to the directory specified by `CHECKPOINT_DIR` environment
		variable (default: `/tmp/checkpoints`):

		# DETR-ResNet50 Object Detection Training Container
		# Base image: HPC-optimized with pre-configured EFA and NCCL for distributed training

	FROM public.ecr.aws/hpc-cloud/nccl-tests:latest
	FROM public.ecr.aws/hpc-cloud/nccl-tests:cuda12.9.1-efa1.47.0-ofiv1.18.0-ncclv2.29.3-1-testsv2.17.9

	python detr_main.py /path/to/data --epochs 50 --batch-size 4 --lr 1e-4 --pretrained --num-classes 2
	python detr_main.py /path/to/data --epochs 50 --batch-size 4 --lr 1e-4 --num-classes 2

	\| Augmentation \| HFlip, ColorJitter \| Applied during training only \|
	\| Augmentation \| ColorJitter \| HFlip omitted -- standard transform doesn't flip box coords \|

		num_classes: Number of target object classes (excluding background).
		pretrained: Whether to load pre-trained weights.

Conversation

aravneelaws commented Apr 28, 2026

Purpose

Changes

Test Plan

Environment:

Test commands:

Test Results

Directory Structure

Checklist

Uh oh!

KeitaW left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review Batch 1/5 — Structure & Repository Hygiene

Directory placement under pytorch/ rather than under a library

Missing copyright/license headers

Uh oh!

Choose a reason for hiding this comment

Base image pinned to :latest

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/5 — Deployment Pipeline (K8s)

Uh oh!

Choose a reason for hiding this comment

NCCL_SOCKET_IFNAME=eth0 breaks on EFA-equipped instances

Uh oh!

Choose a reason for hiding this comment

Worker manifest does not declare CPU/memory requests

Uh oh!

Choose a reason for hiding this comment

Image tag :latest in the deployment example

Uh oh!

Choose a reason for hiding this comment

aws ecr describe-repositories | grep ... is fragile

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/5 — Training Code Quality

Uh oh!

Choose a reason for hiding this comment

Checkpoint save fires on every rank when launched via torchrun

Uh oh!

Choose a reason for hiding this comment

Validation loss is rank-local; "best" decision diverges across ranks

Uh oh!

Choose a reason for hiding this comment

Image lookup loop is dead code

Uh oh!

Choose a reason for hiding this comment

Resume path depends on original_state_dict and breaks silently if absent

Uh oh!

Choose a reason for hiding this comment

--pretrained flag is documented but ignored

Uh oh!

Choose a reason for hiding this comment

find_unused_parameters=True may be masking a real graph

Uh oh!

Choose a reason for hiding this comment

Empty-batch loss path is incompatible with removing find_unused_parameters

Uh oh!

Choose a reason for hiding this comment

Per-batch loss is not the DETR matching loss

Uh oh!

Choose a reason for hiding this comment

No dist.destroy_process_group() at exit

Uh oh!

Choose a reason for hiding this comment

evaluate_map prints from every rank

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 4/5 — Infrastructure & NCCL Configuration

Uh oh!

Choose a reason for hiding this comment

"Air-gapped" runtime claim has a qai_hub telemetry caveat

Uh oh!

KeitaW left a comment •

edited

Loading

Directory placement under `pytorch/` rather than under a library

Base image pinned to `:latest`

`NCCL_SOCKET_IFNAME=eth0` breaks on EFA-equipped instances

Image tag `:latest` in the deployment example

`aws ecr describe-repositories | grep ...` is fragile

Checkpoint save fires on every rank when launched via `torchrun`

Resume path depends on `original_state_dict` and breaks silently if absent

`--pretrained` flag is documented but ignored

`find_unused_parameters=True` may be masking a real graph

Empty-batch loss path is incompatible with removing `find_unused_parameters`

No `dist.destroy_process_group()` at exit

`evaluate_map` prints from every rank

"Air-gapped" runtime claim has a `qai_hub` telemetry caveat

`TRANSFORMERS_CACHE` is deprecated in favor of `HF_HOME`

Drop `TRANSFORMERS_CACHE` (deprecated) — `HF_HOME` alone is sufficient

`meta.json` location is unclear in this section

`CHECKPOINT_DIR` default in README differs from K8s default

Directory placement — `pytorch/ddp/detr-finetune/` matches the convention

KeitaW May 7, 2026 •

edited

Loading

Optional: MIT-0 license header — `Dockerfile`

KeitaW May 7, 2026 •

edited

Loading

`nccl-tests:latest` × `torch==2.5.1` is about to become a tested-config break