|
3 | 3 | This plugin brings deep AWS AI/ML expertise directly into your coding assistant, covering the surface area of [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/); currently, skills are provided to assist with the following capability areas: |
4 | 4 |
|
5 | 5 | - **Model Customization** — End-to-end guided workflows for fine-tuning foundation models, from use case definition through data preparation, training, evaluation, and deployment on Amazon SageMaker AI. |
6 | | -- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, and diagnostic reporting for SageMaker HyperPod training clusters. |
| 6 | +- **HyperPod Cluster Operations** — Remote command execution on nodes via SSM, version checking, diagnostic reporting, and deep debugging for SageMaker HyperPod training clusters. |
7 | 7 |
|
8 | 8 | ## Agent Skills |
9 | 9 |
|
10 | | -| # | Skill | Description | Documentation | |
11 | | -| -- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------- | |
12 | | -| 1 | `planning` | Builds a dynamic, step-by-step plan tailored to your intents | [SKILL.md](skills/planning/SKILL.md) | |
13 | | -| 2 | `directory-management` | Manages project directory setup, artifact organization, and plan association for new or existing projects | [SKILL.md](skills/directory-management/SKILL.md) | |
14 | | -| 3 | `use-case-specification` | Guided, conversational process to define your model customization use case goals, key stakeholders, and success criteria | [SKILL.md](skills/use-case-specification/SKILL.md) | |
15 | | -| 4 | `dataset-evaluation` | Dataset quality validation, format detection, and data requirements analysis | [SKILL.md](skills/dataset-evaluation/SKILL.md) | |
16 | | -| 5 | `dataset-transformation` | Dataset format conversion and preparation for SageMaker-compatible training formats | [SKILL.md](skills/dataset-transformation/SKILL.md) | |
17 | | -| 6 | `finetuning-setup` | Fine-tuning technique selection (SFT, DPO, RLVR, etc.) and base model selection | [SKILL.md](skills/finetuning-setup/SKILL.md) | |
18 | | -| 7 | `finetuning` | Hyperparameter configuration and training job execution | [SKILL.md](skills/finetuning/SKILL.md) | |
19 | | -| 8 | `model-evaluation` | Evaluation design, benchmark selection, LLM-as-a-judge, and model comparison | [SKILL.md](skills/model-evaluation/SKILL.md) | |
20 | | -| 9 | `model-deployment` | Deployment configuration and endpoint setup (SageMaker or Bedrock) | [SKILL.md](skills/model-deployment/SKILL.md) | |
21 | | -| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) | |
22 | | -| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) | |
23 | | -| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) | |
| 10 | +| # | Skill | Description | Documentation | |
| 11 | +| -- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------- | |
| 12 | +| 1 | `planning` | Builds a dynamic, step-by-step plan tailored to your intents | [SKILL.md](skills/planning/SKILL.md) | |
| 13 | +| 2 | `directory-management` | Manages project directory setup, artifact organization, and plan association for new or existing projects | [SKILL.md](skills/directory-management/SKILL.md) | |
| 14 | +| 3 | `use-case-specification` | Guided, conversational process to define your model customization use case goals, key stakeholders, and success criteria | [SKILL.md](skills/use-case-specification/SKILL.md) | |
| 15 | +| 4 | `dataset-evaluation` | Dataset quality validation, format detection, and data requirements analysis | [SKILL.md](skills/dataset-evaluation/SKILL.md) | |
| 16 | +| 5 | `dataset-transformation` | Dataset format conversion and preparation for SageMaker-compatible training formats | [SKILL.md](skills/dataset-transformation/SKILL.md) | |
| 17 | +| 6 | `finetuning-setup` | Fine-tuning technique selection (SFT, DPO, RLVR, etc.) and base model selection | [SKILL.md](skills/finetuning-setup/SKILL.md) | |
| 18 | +| 7 | `finetuning` | Hyperparameter configuration and training job execution | [SKILL.md](skills/finetuning/SKILL.md) | |
| 19 | +| 8 | `model-evaluation` | Evaluation design, benchmark selection, LLM-as-a-judge, and model comparison | [SKILL.md](skills/model-evaluation/SKILL.md) | |
| 20 | +| 9 | `model-deployment` | Deployment configuration and endpoint setup (SageMaker or Bedrock) | [SKILL.md](skills/model-deployment/SKILL.md) | |
| 21 | +| 10 | `hyperpod-ssm` | Remote command execution and file transfer on HyperPod cluster nodes via SSM | [SKILL.md](skills/hyperpod-ssm/SKILL.md) | |
| 22 | +| 11 | `hyperpod-version-checker` | Check and compare software component versions across HyperPod cluster nodes | [SKILL.md](skills/hyperpod-version-checker/SKILL.md) | |
| 23 | +| 12 | `hyperpod-issue-report` | Generate diagnostic reports for HyperPod troubleshooting and support cases | [SKILL.md](skills/hyperpod-issue-report/SKILL.md) | |
| 24 | +| 13 | `hyperpod-cluster-debugger` | Diagnose cluster-wide HyperPod problems — creation failures, EFA health, lifecycle scripts, capacity | [SKILL.md](skills/hyperpod-cluster-debugger/SKILL.md) | |
| 25 | +| 14 | `hyperpod-nccl` | Diagnose NCCL failures — training hangs, AllReduce timeouts, EFA errors, rendezvous failures | [SKILL.md](skills/hyperpod-nccl/SKILL.md) | |
| 26 | +| 15 | `hyperpod-node-debugger` | Diagnose per-node issues — GPU hardware, EFA, disk/memory pressure, container runtime | [SKILL.md](skills/hyperpod-node-debugger/SKILL.md) | |
| 27 | +| 16 | `hyperpod-performance-debugger` | Diagnose performance issues — uneven NCCL bandwidth, filesystem throughput, straggler nodes | [SKILL.md](skills/hyperpod-performance-debugger/SKILL.md) | |
| 28 | +| 17 | `hyperpod-slurm-debugger` | Diagnose Slurm scheduler issues — nodes stuck down/drain, jobs pending, GRES miscounts, auto-resume | [SKILL.md](skills/hyperpod-slurm-debugger/SKILL.md) | |
24 | 29 |
|
25 | 30 | ## MCP Servers |
26 | 31 |
|
@@ -99,12 +104,22 @@ The HyperPod skills provide operational tooling for Amazon SageMaker HyperPod AI |
99 | 104 | - **`hyperpod-ssm`** — Run commands and transfer files on cluster nodes via AWS Systems Manager (SSM), without needing direct SSH access. |
100 | 105 | - **`hyperpod-version-checker`** — Check and compare software component versions (drivers, libraries, frameworks) across cluster nodes to identify drift or incompatibilities. |
101 | 106 | - **`hyperpod-issue-report`** — Generate comprehensive issue reports that collect system state, logs, and configuration details for troubleshooting or support case submission. |
| 107 | +- **`hyperpod-cluster-debugger`** — Diagnose cluster-wide problems including creation/deployment failures, EFA health checks, lifecycle script errors, and capacity issues. |
| 108 | +- **`hyperpod-nccl`** — Diagnose NCCL failures and training-pod issues such as AllReduce timeouts, EFA/libfabric errors, rendezvous failures, and container OOM. |
| 109 | +- **`hyperpod-node-debugger`** — Diagnose per-node issues including GPU hardware faults (XID, ECC, NVLink), EFA, disk/memory pressure, and container runtime problems. |
| 110 | +- **`hyperpod-performance-debugger`** — Diagnose performance bottlenecks such as uneven NCCL bandwidth across nodes, filesystem throughput issues, and straggler nodes. |
| 111 | +- **`hyperpod-slurm-debugger`** — Diagnose Slurm scheduler and node-daemon issues including nodes stuck in down/drain, jobs pending, GRES miscounts, and auto-resume failures. |
102 | 112 |
|
103 | 113 | ### Examples |
104 | 114 |
|
105 | 115 | - "Check the GPU memory usage on all nodes in my HyperPod cluster using SSM" |
106 | 116 | - "Check driver versions on my HyperPod cluster" |
107 | 117 | - "Generate an issue report for my HyperPod cluster" |
| 118 | +- "My HyperPod cluster creation failed, help me debug it" |
| 119 | +- "Training is hanging with NCCL timeout errors" |
| 120 | +- "A node in my cluster is unhealthy, diagnose it" |
| 121 | +- "My training is slower than expected across nodes" |
| 122 | +- "Slurm jobs are stuck pending even though nodes show idle" |
108 | 123 |
|
109 | 124 | ## Supported Environments |
110 | 125 |
|
|
0 commit comments