Skip to content

Commit 95381e8

Browse files
authored
feat(sagemaker-ai): add HyperPod debugging skills (#169)
Add new skills for diagnosing and troubleshooting HyperPod clusters: - hyperpod-cluster-debugger: cluster-wide diagnostics - hyperpod-nccl: NCCL failure diagnosis - hyperpod-node-debugger: per-node issue triage - hyperpod-performance-debugger: performance bottleneck analysis - hyperpod-slurm-debugger: Slurm scheduler issues Also updates hyperpod-ssm, hyperpod-version-checker, and hyperpod-issue-report with related improvements. Updates README with new skill documentation.
1 parent 38a9152 commit 95381e8

32 files changed

Lines changed: 13840 additions & 28 deletions

File tree

plugins/sagemaker-ai/README.md

Lines changed: 30 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,29 @@
33
This plugin brings deep AWS AI/ML expertise directly into your coding assistant, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/); currently, skills are provided to assist with the following capability areas:
44

55
- **Model Customization** — End-to-end guided workflows for fine-tuning foundation models, from use case definition through data preparation, training, evaluation, and deployment on Amazon SageMaker AI.
6-
- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, and diagnostic reporting for SageMaker HyperPod training clusters.
6+
- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, diagnostic reporting, and deep debugging for SageMaker HyperPod training clusters.
77

88
## Agent Skills
99

10-
| # | Skill | Description | Documentation |
11-
| -- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------- |
12-
| 1 | `planning` | Builds a dynamic, step-by-step plan tailored to your intents | [SKILL.md](skills/planning/SKILL.md) |
13-
| 2 | `directory-management` | Manages project directory setup, artifact organization, and plan association for new or existing projects | [SKILL.md](skills/directory-management/SKILL.md) |
14-
| 3 | `use-case-specification` | Guided, conversational process to define your model customization use case goals, key stakeholders, and success criteria | [SKILL.md](skills/use-case-specification/SKILL.md) |
15-
| 4 | `dataset-evaluation` | Dataset quality validation, format detection, and data requirements analysis | [SKILL.md](skills/dataset-evaluation/SKILL.md) |
16-
| 5 | `dataset-transformation` | Dataset format conversion and preparation for SageMaker-compatible training formats | [SKILL.md](skills/dataset-transformation/SKILL.md) |
17-
| 6 | `finetuning-setup` | Fine-tuning technique selection (SFT, DPO, RLVR, etc.) and base model selection | [SKILL.md](skills/finetuning-setup/SKILL.md) |
18-
| 7 | `finetuning` | Hyperparameter configuration and training job execution | [SKILL.md](skills/finetuning/SKILL.md) |
19-
| 8 | `model-evaluation` | Evaluation design, benchmark selection, LLM-as-a-judge, and model comparison | [SKILL.md](skills/model-evaluation/SKILL.md) |
20-
| 9 | `model-deployment` | Deployment configuration and endpoint setup (SageMaker or Bedrock) | [SKILL.md](skills/model-deployment/SKILL.md) |
21-
| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) |
22-
| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) |
23-
| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) |
10+
| # | Skill | Description | Documentation |
11+
| -- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------- |
12+
| 1 | `planning` | Builds a dynamic, step-by-step plan tailored to your intents | [SKILL.md](skills/planning/SKILL.md) |
13+
| 2 | `directory-management` | Manages project directory setup, artifact organization, and plan association for new or existing projects | [SKILL.md](skills/directory-management/SKILL.md) |
14+
| 3 | `use-case-specification` | Guided, conversational process to define your model customization use case goals, key stakeholders, and success criteria | [SKILL.md](skills/use-case-specification/SKILL.md) |
15+
| 4 | `dataset-evaluation` | Dataset quality validation, format detection, and data requirements analysis | [SKILL.md](skills/dataset-evaluation/SKILL.md) |
16+
| 5 | `dataset-transformation` | Dataset format conversion and preparation for SageMaker-compatible training formats | [SKILL.md](skills/dataset-transformation/SKILL.md) |
17+
| 6 | `finetuning-setup` | Fine-tuning technique selection (SFT, DPO, RLVR, etc.) and base model selection | [SKILL.md](skills/finetuning-setup/SKILL.md) |
18+
| 7 | `finetuning` | Hyperparameter configuration and training job execution | [SKILL.md](skills/finetuning/SKILL.md) |
19+
| 8 | `model-evaluation` | Evaluation design, benchmark selection, LLM-as-a-judge, and model comparison | [SKILL.md](skills/model-evaluation/SKILL.md) |
20+
| 9 | `model-deployment` | Deployment configuration and endpoint setup (SageMaker or Bedrock) | [SKILL.md](skills/model-deployment/SKILL.md) |
21+
| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) |
22+
| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) |
23+
| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) |
24+
| 13 | `hyperpod-cluster-debugger` | Diagnose cluster-wide HyperPod problems — creation failures, EFA health, lifecycle scripts, capacity | [SKILL.md](skills/hyperpod-cluster-debugger/SKILL.md) |
25+
| 14 | `hyperpod-nccl` | Diagnose NCCL failures — training hangs, AllReduce timeouts, EFA errors, rendezvous failures | [SKILL.md](skills/hyperpod-nccl/SKILL.md) |
26+
| 15 | `hyperpod-node-debugger` | Diagnose per-node issues — GPU hardware, EFA, disk/memory pressure, container runtime | [SKILL.md](skills/hyperpod-node-debugger/SKILL.md) |
27+
| 16 | `hyperpod-performance-debugger` | Diagnose performance issues — uneven NCCL bandwidth, filesystem throughput, straggler nodes | [SKILL.md](skills/hyperpod-performance-debugger/SKILL.md) |
28+
| 17 | `hyperpod-slurm-debugger` | Diagnose Slurm scheduler issues — nodes stuck down/drain, jobs pending, GRES miscounts, auto-resume | [SKILL.md](skills/hyperpod-slurm-debugger/SKILL.md) |
2429

2530
## MCP Servers
2631

@@ -99,12 +104,22 @@ The HyperPod skills provide operational tooling for Amazon SageMaker HyperPod AI
99104
- **`hyperpod-ssm`** — Run commands and transfer files on cluster nodes via AWS Systems Manager (SSM), without needing direct SSH access.
100105
- **`hyperpod-version-checker`** — Check and compare software component versions (drivers, libraries, frameworks) across cluster nodes to identify drift or incompatibilities.
101106
- **`hyperpod-issue-report`** — Generate comprehensive issue reports that collect system state, logs, and configuration details for troubleshooting or support case submission.
107+
- **`hyperpod-cluster-debugger`** — Diagnose cluster-wide problems including creation/deployment failures, EFA health checks, lifecycle script errors, and capacity issues.
108+
- **`hyperpod-nccl`** — Diagnose NCCL failures and training-pod issues such as AllReduce timeouts, EFA/libfabric errors, rendezvous failures, and container OOM.
109+
- **`hyperpod-node-debugger`** — Diagnose per-node issues including GPU hardware faults (XID, ECC, NVLink), EFA, disk/memory pressure, and container runtime problems.
110+
- **`hyperpod-performance-debugger`** — Diagnose performance bottlenecks such as uneven NCCL bandwidth across nodes, filesystem throughput issues, and straggler nodes.
111+
- **`hyperpod-slurm-debugger`** — Diagnose Slurm scheduler and node-daemon issues including nodes stuck in down/drain, jobs pending, GRES miscounts, and auto-resume failures.
102112

103113
### Examples
104114

105115
- "Check the GPU memory usage on all nodes in my HyperPod cluster using SSM"
106116
- "Check driver versions on my HyperPod cluster"
107117
- "Generate an issue report for my HyperPod cluster"
118+
- "My HyperPod cluster creation failed, help me debug it"
119+
- "Training is hanging with NCCL timeout errors"
120+
- "A node in my cluster is unhealthy, diagnose it"
121+
- "My training is slower than expected across nodes"
122+
- "Slurm jobs are stuck pending even though nodes show idle"
108123

109124
## Supported Environments
110125

0 commit comments

Comments
 (0)