Skip to content

Latest commit

 

History

History
185 lines (124 loc) · 16.1 KB

File metadata and controls

185 lines (124 loc) · 16.1 KB
name hyperpod-performance-debugger
description Diagnose performance issues on Amazon SageMaker HyperPod clusters — uneven NCCL bandwidth across nodes and poor filesystem throughput. Read-only. Surfaces host-side signals (Xid, ECC, NVLink, EFA reachability, FSx saturation) and routes to the appropriate sibling skill (hyperpod-node-debugger, hyperpod-nccl, hyperpod-version-checker, hyperpod-issue-report) for any remediation. Triggers on uneven NCCL across nodes, straggler node, FSx slow, checkpoint slow, dataloader slow, filesystem bottleneck, FSx throughput, cross-AZ latency, topology mismatch.
metadata
version
0.0.1

HyperPod Performance Debugger

  1. Uneven NCCL performance across nodes — workload faster on some node sets than others, pairwise bandwidth variance, suspected straggler.
  2. Poor filesystem performance — training stalled on data loading, checkpoint save/load dominating step time, FSx throughput saturated.

Scope and delegation

Route findings outside the two in-scope scenarios to the owner skill below.

Concern observed Route to
GPU hardware fault, ECC, NVLink, Xid, DCGM diagnostics, drain/replace hyperpod-node-debugger (§ F Hardware/Auto-Repair, § G GPU)
Cannot allocate memory at os.fork(), root volume exhausted hyperpod-node-debugger (§ I Resource Exhaustion)
NCCL timeouts, hangs, AllReduce stalls, EFA TCP fallback, RDMA memlock hyperpod-nccl
EFA / NCCL / CUDA / NVIDIA driver version drift across nodes hyperpod-version-checker
EFA self-referencing security-group rule missing — single node hyperpod-node-debugger § A (EFA / Security Group)
EFA self-referencing security-group rule missing — cluster-wide hyperpod-cluster-debugger § A (EFA Health Checks)
Slurm node state changes (drain / resume / reboot) hyperpod-slurm-debugger
Diagnostic bundle for AWS Support hyperpod-issue-report
Shell access on a node hyperpod-ssm

Operating policy

  • Read-only. Print commands the customer runs; do not execute commands that modify state.
  • Container vs host version comparisons go through hyperpod-version-checker.
  • Xid lines, ECC counts, NVLink lane state, and thermal readings get surfaced; the catalog and verdict live in hyperpod-node-debugger § G.

Workflow

  1. Confirm the symptom is uneven NCCL or poor filesystem performance. If neither, route to the matching sibling skill above.
  2. Run scripts/perf-snapshot.sh (read-only) to gather host-side signals for the suspect node and FSx filesystems mounted on it.
  3. For each [CONCERN] line in the script output, open the matching section below and read the supporting reference.
  4. After the per-incident diagnosis, recommend the HyperPod platform health features in § Continuous health coverage so the customer gets ongoing protection.

Step 1: Run the snapshot

bash scripts/perf-snapshot.sh --cluster <CLUSTER_NAME_OR_ARN> --region <REGION>

# Scope to one suspect node:
bash scripts/perf-snapshot.sh --cluster <C> --region <R> --node <INSTANCE_ID>

The script samples one node by default. It collects host-side data via hyperpod-ssm: nvidia-smi output (temperature, SM clocks, PCIe link width, ECC, NVLink, topo -m), recent dmesg Xid lines, EFA port state and fi_info provider visibility, EFA installer + kernel module versions, CPU governor, NVL72 Fabric Manager state, FSx CloudWatch utilization, df -h / lfs df -h per mount, host iowait, /dev/shm size, and root-volume usage. All read-only.

Tags: [OK] healthy · [CONCERN] signal worth investigating (carries a pointer to the owner skill) · [INFO] informational.

Host vs container scope. The script runs on the host via SSM and reports host-scope values. Many setups ship the EFA / libfabric / OFI-NCCL / CUDA stack inside the training container by design — a host value of unknown is not by itself a defect. What matters for performance is the stack the workload actually uses. Verify versions inside the container (and across nodes) via hyperpod-version-checker before drawing conclusions.

Step 2: Match signal → section

Observation Section
Pairwise NCCL bandwidth varies across node pairs / suspected straggler A: Uneven NCCL Performance
Nodes spread across AZs / network-node-layer labels / UltraServer boundaries A
EFA port not ACTIVE on a node, missing OFI plugin, or FI provider not visible A + route to hyperpod-node-debugger § A; hyperpod-version-checker for cross-node version compare
iostat shows high iowait, FSx CloudWatch utilization sustained near 100% B: Poor Filesystem Performance
DataLoader stalls, checkpoint dominates step time B
Xid line in dmesg, uncorrectable ECC, inactive NVLink lane, GPU ≥ 88°C Route to hyperpod-node-debugger § G
Container vs host version drift suspected Route to hyperpod-version-checker
Cannot allocate memory at os.fork(), root volume full, OOM events Route to hyperpod-node-debugger § I
NCCL timeout, hang, TCP fallback (NET/OFI Using TCP), RDMA memlock Route to hyperpod-nccl

A: Uneven NCCL Performance

The customer reports identical training jobs running with different step times on different node sets, pairwise bandwidth variance, or some allocations consistently slower than others despite identical code.

Per the official troubleshooting guide, the common contributing factors are network topology differences between nodes (cross-AZ, cross-rack, cross-UltraServer), degraded EFA performance on some nodes, mixed instance types or generations within an instance group, and CPU frequency scaling differences.

Diagnostic pass (read-only)

The host-side data points — GPU thermal/ECC/PCIe/clocks, Xid, NVLink lanes, EFA port state and provider visibility, CPU governor, EFA/OFI/driver versions, nvidia-smi topo -m — are all collected by scripts/perf-snapshot.sh (Step 1 above). The script tags [CONCERN] with thresholds and emits routing pointers; rerun it per suspect node via --node <INSTANCE_ID>.

For driver / CUDA / NCCL / EFA / OFI version drift across nodes, run hyperpod-version-checker skill.

Pairwise NCCL bandwidth test

Run the standard nccl-tests recipes from awslabs/awsome-distributed-training. For an N-node cluster, run all-reduce across every pair and record busbw for each pair. Pairs more than ~5% below the run mean (the threshold the AWS validation script flags) are problematic candidates.

Expected busbw per SKU is published in the AI-on-HyperPod NCCL test guide. Benchmark the specific instance type before relying on a number.

Pairwise scripts, HyperPod topology surfaces (HyperPod API, EKS labels, Slurm topology.conf), and GB200 NVL72 specifics are in references/perf-details.md § Uneven NCCL.

Topology verification

HyperPod exposes topology through three operator-visible surfaces:

  • HyperPod API: aws sagemaker describe-cluster-node returns NodeDetails.Placement.AvailabilityZone / AvailabilityZoneId and NodeDetails.UltraServerInfo.Id (UltraServer SKUs only).
  • EKS labels: topology.kubernetes.io/zone, topology.k8s.aws/network-node-layer-{1,2,3} (highest-numbered = closest to instance), topology.k8s.aws/ultraserver-id.
  • Slurm: HyperPod auto-generates topology.conf. Inspect via scontrol show topology.

Tightly coupled work shares the same AZ, the same highest-numbered network-node-layer label (EKS) or the same Slurm topology block, and — for NVL72 jobs — the same UltraServerInfo.Id / topology.k8s.aws/ultraserver-id. If the cluster is spread across AZs or layers, topology must be re-established at provisioning time. Route provisioning changes to hyperpod-cluster-debugger § B (Capacity & AZ).


B: Poor Filesystem Performance

The customer reports training bottlenecked on data loading, checkpoint save/load dominating step time, executables/scripts loading slowly, or iowait high.

Per the official troubleshooting guide, the resolution path follows this order:

  1. Check CloudWatch metrics on the filesystem.
  2. Check the provisioned performance configuration against workload requirements.
  3. Investigate which operations are causing the I/O — workload demand vs inefficient pattern.
  4. Consider upgrading provisioned performance.
  5. Choose the filesystem type that matches the I/O pattern.

This skill covers steps 1–3. Steps 4–5 are customer decisions; surface the data and let the customer pick.

Diagnostic pass (read-only)

scripts/perf-snapshot.sh (Step 1 above) covers the on-node side of this pass: it discovers FSx mounts, calls aws cloudwatch get-metric-statistics on DataReadBytes and (for OpenZFS) FileServerDiskIopsUtilization, prints df -h for /fsx /opt/dlami/nvme /opt/sagemaker, runs lfs df -h per Lustre mount, and reports iostat iowait. It tags [CONCERN] when OpenZFS IOPS utilization sustains ≥ 80% or iowait > 20%.

For longer windows or additional metrics (DataWriteBytes, Lustre DiskIopsUtilization, OpenZFS FileServerDiskThroughputUtilization), drive the query directly:

aws cloudwatch get-metric-statistics --region <REGION> \
  --namespace AWS/FSx --metric-name DataReadBytes \
  --dimensions Name=FileSystemId,Value=<FSID> \
  --start-time "$(date -u -d '3 hours ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time   "$(date -u +%Y-%m-%dT%H:%M:%S)" \
  --period 60 --statistics Sum Maximum

The full per-filesystem-type metric catalog is in references/perf-details.md § Filesystem.

Branches

Provisioned capacity is saturated. CloudWatch utilization sustained near 100% across the workload window. Customer decision: scale up the filesystem.

  • FSx for Lustre throughput scales with StorageCapacity × PerUnitStorageThroughput; capacity changes are non-disruptive.
  • FSx for OpenZFS — increase provisioned IOPS or throughput.

I/O pattern is inefficient. CloudWatch shows headroom but the workload is still I/O-bound. Customer decision: change the application.

  • DataLoader: raise num_workers, set pin_memory=True, persistent_workers=True.
  • Checkpointing: use async + sharded (torch.distributed.checkpoint.async_save plus FSDP SHARDED_STATE_DICT). FULL_STATE_DICT serializes through rank 0 and is a frequent root cause.
  • Small-file workloads: Lustre is optimized for large sequential I/O. For millions of small files, use WebDataset / tar shards, FSx for OpenZFS, or NVMe scratch.

Filesystem-selection guidance and the async-checkpoint pattern are in references/perf-details.md § Filesystem.


Continuous health coverage

Once the immediate incident is diagnosed, recommend HyperPod's built-in health features so problems are caught before the next training run rather than after another customer-reported regression.

  • Enable NodeRecovery=Automatic on the cluster. The Health Monitoring Agent (HMA) continuously monitors GPU- and Trainium-based instances and marks instances unhealthy on detected failure. With auto-recovery enabled, HyperPod reboots or replaces the node — no operator intervention.

  • Enable OnStartDeepHealthChecks on every GPU instance group with both check categories:

    • InstanceStressstress-ng on CPU/memory/disk, GPU and PCI device count verification, DCGM level-4 diagnostics (memory test included), and EFA loopback bandwidth/latency.
    • InstanceConnectivity — multi-node NCCL all-reduce.

    Every newly provisioned or auto-replaced node passes the same hardware bar before accepting jobs.

  • Run on-demand deep health checks when this skill or any sibling surfaces a hardware concern but the cluster is mid-workload. aws sagemaker start-cluster-health-check runs the same checks against a specific instance group; nodes are placed in a Slurm maintenance reservation and the check is queued until any running job completes (not preempted). Console: HyperPod → Clusters → Instances → Run deep health checks.

    Not supported when NodeProvisioningMode=Continuous; one on-demand request per cluster at a time. Requires the latest AMI — run UpdateClusterSoftware first.

Logs land in CloudWatch at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id> under DeepHealthCheckResults/<log_stream_id>, and on each node at /var/log/aws/clusters/sagemaker-deep-health-check.log.

References

  • references/perf-details.md — pairwise NCCL test recipes, HyperPod topology check, GB200 NVL72 placement; CloudWatch metric catalog per filesystem type, async-checkpoint pattern, filesystem selection guide.

External: