Skip to content

Commit 5aa4c85

Browse files
zeel2104akoumpajgerh
authored andcommitted
docs: add SkyPilot Kubernetes tutorial (#1667)
* docs: add SkyPilot Kubernetes tutorial Signed-off-by: Zeel <desaizeel2128@gmail.com> * Update docs/launcher/overview.md Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Zeel <desaizeel2128@gmail.com> * docs: apply tech pubs launcher copyedits Signed-off-by: Zeel <desaizeel2128@gmail.com> --------- Signed-off-by: Zeel <desaizeel2128@gmail.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
1 parent 1dcb225 commit 5aa4c85

8 files changed

Lines changed: 537 additions & 23 deletions

File tree

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -276,6 +276,7 @@ Local Workstation <launcher/local-workstation.md>
276276
SLURM Cluster <launcher/slurm.md>
277277
NeMo-Run <launcher/nemo-run.md>
278278
SkyPilot <launcher/skypilot.md>
279+
SkyPilot k8s <launcher/skypilot-kubernetes.md>
279280
::::
280281

281282
::::{toctree}

docs/launcher/overview.md

Lines changed: 17 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,21 @@ NeMo AutoModel provides several ways to launch training. The right choice depend
77
| Launcher | Best for | GPUs | Guide |
88
|---|---|---|---|
99
| **Local Workstation** | Getting started, debugging, single-node training | 1-8 on one machine | [Local Workstation](./local-workstation.md) |
10-
| **Slurm** | Multi-node batch jobs on HPC clusters | 8+ across nodes | [Slurm](./slurm.md) |
1110
| **NeMo-Run** | Managed execution on Slurm, Kubernetes, Docker, local | 1+ | [NeMo-Run](./nemo-run.md) |
12-
| **SkyPilot** | Cloud training (AWS, GCP, Azure) with spot pricing | Any | [SkyPilot](./skypilot.md) |
11+
| **SkyPilot** | Cloud training or Kubernetes clusters | Any | [SkyPilot](./skypilot.md) |
12+
| **Slurm** | Multi-node batch jobs on HPC clusters | 8+ across nodes | [Slurm](./slurm.md) |
1313

14-
### I have 1-2 GPUs on my workstation
14+
### I Have 1–2 GPUs on My Workstation
1515

16-
Use the **interactive** launcher. No scheduler or cluster software needed:
16+
Use the **interactive** launcher. No scheduler or cluster software is needed:
1717

1818
```bash
1919
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
2020
```
2121

2222
See the [Local Workstation](./local-workstation.md) guide.
2323

24-
### I have access to a Slurm cluster
24+
### I Have Access to a Slurm Cluster
2525

2626
Add a `slurm:` section to your YAML config and submit with the same `automodel` command. The CLI generates the `torchrun` invocation and calls `sbatch` for you:
2727

@@ -31,7 +31,7 @@ automodel config_with_slurm.yaml
3131

3232
See the [Slurm](./slurm.md) guide.
3333

34-
### I want managed job submission (Slurm, Kubernetes, Docker)
34+
### I Want Managed Job Submission (Slurm, Kubernetes, Docker)
3535

3636
Add a `nemo_run:` section to your YAML config. NeMo-Run loads a pre-configured executor for your compute target and submits the job:
3737

@@ -41,7 +41,7 @@ automodel config_with_nemo_run.yaml
4141

4242
See the [NeMo-Run](./nemo-run.md) guide.
4343

44-
### I want to train on the cloud
44+
### I Want to Train on the Cloud
4545

4646
Add a `skypilot:` section to your YAML config. SkyPilot provisions VMs on any major cloud and handles spot-instance preemption automatically:
4747

@@ -51,6 +51,16 @@ automodel config_with_skypilot.yaml
5151

5252
See the [SkyPilot](./skypilot.md) guide.
5353

54+
### I Want to Train on Kubernetes with SkyPilot
55+
56+
Use the same `skypilot:` launcher, but set `cloud: kubernetes`. This is a good fit when your team already has a GPU-backed Kubernetes cluster and you want SkyPilot to handle job submission and multi-node orchestration:
57+
58+
```bash
59+
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
60+
```
61+
62+
See the [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md).
63+
5464
## All Launchers Use the Same Config
5565

5666
Every launcher shares the same YAML recipe format. The only difference is an optional launcher section (`slurm:`, `nemo_run:`, or `skypilot:`) that tells the CLI where to run. Without a launcher section, training runs interactively on the current machine.
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# SkyPilot + Kubernetes Tutorial
2+
3+
This tutorial shows how to run NeMo AutoModel on a Kubernetes cluster through SkyPilot.
4+
5+
You will:
6+
7+
1. Check that SkyPilot can see your Kubernetes cluster and GPUs.
8+
2. Launch a small NeMo AutoModel fine-tuning job on one GPU.
9+
3. Scale the same job to two nodes.
10+
4. Follow logs and clean everything up when you are done.
11+
12+
This guide is written for new AutoModel users, so it keeps the moving pieces as small as possible.
13+
14+
## Before you begin
15+
16+
You need:
17+
18+
- a working Kubernetes context in `kubectl`
19+
- at least one GPU-backed node in the cluster
20+
- SkyPilot installed with Kubernetes support
21+
- a local NeMo AutoModel checkout
22+
- a Hugging Face token in `HF_TOKEN` if you plan to use a gated model such as Llama
23+
24+
If you are setting up SkyPilot on Kubernetes for the first time, the official SkyPilot Kubernetes setup guide is here:
25+
26+
- <https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html>
27+
28+
Install the SkyPilot Kubernetes client in your AutoModel environment:
29+
30+
```bash
31+
uv pip install "skypilot[kubernetes]"
32+
```
33+
34+
Set the token once in your shell:
35+
36+
```bash
37+
export HF_TOKEN=hf_your_token_here
38+
```
39+
40+
## Step 1: Verify the cluster
41+
42+
Start with three quick checks:
43+
44+
```bash
45+
kubectl config current-context
46+
kubectl get nodes
47+
sky check kubernetes
48+
```
49+
50+
You want `sky check kubernetes` to report that Kubernetes is enabled.
51+
52+
Next, ask SkyPilot which GPUs it can request from the cluster:
53+
54+
```bash
55+
sky show-gpus --infra k8s
56+
```
57+
58+
Example output:
59+
60+
```text
61+
$ sky show-gpus --infra k8s
62+
Kubernetes GPUs
63+
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
64+
L4 1, 2, 4 8 of 8 free
65+
H100 1, 2, 4, 8 8 of 8 free
66+
67+
Kubernetes per node GPU availability
68+
NODE GPU UTILIZATION
69+
gpu-node-a H100 8 of 8 free
70+
```
71+
72+
If you do not see any GPUs here, stop and fix the Kubernetes or SkyPilot setup first. AutoModel is ready, but SkyPilot still cannot place GPU jobs.
73+
74+
## Step 2: Run a single-node job
75+
76+
The easiest starting point is a one-GPU fine-tune using the existing Llama 3.2 1B SQuAD example.
77+
78+
This repository now includes a Kubernetes-flavored SkyPilot config at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml).
79+
80+
Launch it from the repo root:
81+
82+
```bash
83+
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
84+
```
85+
86+
The important part of that YAML is the `skypilot:` block:
87+
88+
```yaml
89+
skypilot:
90+
cloud: kubernetes
91+
accelerators: H100:1
92+
use_spot: false
93+
disk_size: 200
94+
job_name: llama3-2-1b-k8s
95+
hf_token: ${HF_TOKEN}
96+
```
97+
98+
What AutoModel does for you:
99+
100+
- writes a launcher-free copy of the training config to `skypilot_jobs/<timestamp>/job_config.yaml`
101+
- syncs the repo to the SkyPilot workdir
102+
- runs `torchrun` on the Kubernetes worker pod
103+
- forwards your training config unchanged after removing the `skypilot:` section
104+
105+
Example submission output:
106+
107+
```text
108+
$ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
109+
INFO Config: /workspace/Automodel/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
110+
INFO Recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
111+
INFO Launching job via SkyPilot
112+
INFO SkyPilot job artifacts in: /workspace/Automodel/skypilot_jobs/1712150400
113+
```
114+
115+
Then watch the cluster come up:
116+
117+
```bash
118+
sky status
119+
sky logs llama3-2-1b-k8s
120+
kubectl get pods
121+
```
122+
123+
Example log snippet:
124+
125+
```text
126+
$ sky status
127+
Clusters
128+
NAME LAUNCHED RESOURCES STATUS
129+
llama3-2-1b-k8s 1m ago 1x Kubernetes(H100:1) UP
130+
131+
$ sky logs llama3-2-1b-k8s
132+
...
133+
torchrun --nproc_per_node=1 ~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py -c /tmp/automodel_job_config.yaml
134+
...
135+
```
136+
137+
## Step 3: Scale to two nodes
138+
139+
Once the single-node job works, scaling out is just a small YAML change.
140+
141+
Use the two-node example at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml):
142+
143+
```bash
144+
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml
145+
```
146+
147+
The launcher block looks like this:
148+
149+
```yaml
150+
skypilot:
151+
cloud: kubernetes
152+
accelerators: H100:1
153+
num_nodes: 2
154+
use_spot: false
155+
disk_size: 200
156+
job_name: llama3-2-1b-k8s-2nodes
157+
hf_token: ${HF_TOKEN}
158+
```
159+
160+
For multi-node jobs, AutoModel switches the generated command to a distributed `torchrun` launch that uses SkyPilot's node metadata:
161+
162+
```text
163+
torchrun \
164+
--nproc_per_node=1 \
165+
--nnodes=$SKYPILOT_NUM_NODES \
166+
--node_rank=$SKYPILOT_NODE_RANK \
167+
--rdzv_backend=c10d \
168+
--master_addr=$(echo $SKYPILOT_NODE_IPS | head -n1) \
169+
--master_port=12375 \
170+
~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py \
171+
-c /tmp/automodel_job_config.yaml
172+
```
173+
174+
That means you do not need to hand-build rendezvous arguments yourself.
175+
176+
Use these commands while the job is starting:
177+
178+
```bash
179+
sky status
180+
sky logs llama3-2-1b-k8s-2nodes
181+
kubectl get pods -o wide
182+
```
183+
184+
What you want to see:
185+
186+
- two SkyPilot-managed worker pods
187+
- both pods scheduled onto GPU nodes
188+
- logs that include `--nnodes=$SKYPILOT_NUM_NODES`
189+
190+
## Step 4: Clean up
191+
192+
When the run is finished, tear the cluster down so it stops consuming resources:
193+
194+
```bash
195+
sky down llama3-2-1b-k8s
196+
sky down llama3-2-1b-k8s-2nodes
197+
```
198+
199+
You can remove old local launcher artifacts too:
200+
201+
```bash
202+
rm -rf skypilot_jobs
203+
```
204+
205+
## Common first-run issues
206+
207+
### `sky check kubernetes` fails
208+
209+
Usually this means SkyPilot cannot use your current kubeconfig context yet. Re-check the context with `kubectl config current-context`, then compare it with SkyPilot's Kubernetes setup guide.
210+
211+
### `sky show-gpus --infra k8s` shows no GPUs
212+
213+
SkyPilot can only schedule GPUs that Kubernetes exposes. Make sure the GPU device plugin or operator is installed and the GPU nodes are healthy.
214+
215+
### The job starts, but model download fails
216+
217+
For gated models, make sure `HF_TOKEN` is exported in the shell that runs `automodel`. The SkyPilot launcher forwards it to the remote job.
218+
219+
### Multi-node launch stalls during rendezvous
220+
221+
Start with the single-node example first. If that works, check that:
222+
223+
- your cluster has enough free GPU nodes for `num_nodes`
224+
- worker pods can talk to each other over the cluster network
225+
- the logs include the generated `torchrun` multi-node arguments shown above
226+
227+
## Which file should I edit?
228+
229+
If you want to adapt this tutorial for your own model, the quickest path is:
230+
231+
1. Copy [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml).
232+
2. Change the `model` and dataset sections.
233+
3. Keep the `skypilot:` block small until the first run succeeds.
234+
235+
That way, when something goes wrong, you only have a few knobs to inspect.

docs/launcher/skypilot.md

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,24 @@
1-
# Run on Any Cloud with SkyPilot
1+
# Run with SkyPilot
22

3-
In this guide, you will learn how to launch NeMo AutoModel training jobs on any major cloud provider (AWS, GCP, Azure, Lambda, Kubernetes) using [SkyPilot](https://skypilot.readthedocs.io). For on-premises cluster usage, see [Run on a Cluster (Slurm)](./slurm.md). For single-node workstation usage, see [Run on Your Local Workstation](./local-workstation.md).
3+
In this guide, you will learn how to launch NeMo AutoModel training jobs with [SkyPilot](https://docs.skypilot.co/en/stable/docs/). SkyPilot can target public clouds such as AWS, GCP, Azure, and Lambda, and it can also submit jobs to Kubernetes clusters. For a beginner-friendly Kubernetes walkthrough, see [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). For on-premises cluster usage without SkyPilot, see [Run on a Cluster (Slurm)](./slurm.md). For single-node workstation usage, see [Run on Your Local Workstation](./local-workstation.md).
44

55
SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings.
66

77
## Before You Begin
88

99
Complete the following setup steps before launching your first AutoModel job on a cloud provider.
1010

11-
1. **Install SkyPilot** with the connector for your target cloud:
11+
1. **Install SkyPilot** with the connector for your target infrastructure:
1212

1313
```bash
14-
pip install "skypilot[gcp]" # Google Cloud
15-
pip install "skypilot[aws]" # Amazon Web Services
16-
pip install "skypilot[azure]" # Microsoft Azure
17-
pip install "skypilot[lambda]" # Lambda Cloud
18-
pip install "skypilot[kubernetes]" # Any Kubernetes cluster
14+
uv pip install "skypilot[gcp]" # Google Cloud
15+
uv pip install "skypilot[aws]" # Amazon Web Services
16+
uv pip install "skypilot[azure]" # Microsoft Azure
17+
uv pip install "skypilot[lambda]" # Lambda Cloud
18+
uv pip install "skypilot[kubernetes]" # Any Kubernetes cluster
1919
```
2020

21-
2. **Configure your cloud credentials** by following the SkyPilot credential setup guide for your cloud, then verify:
21+
2. **Configure access** for your target infrastructure, then verify:
2222

2323
```bash
2424
sky check
@@ -38,7 +38,7 @@ export WANDB_API_KEY=... # Optional: Weights & Biases logging
3838
Add a `skypilot:` section to any existing config YAML, then run the same `automodel` command you already know:
3939

4040
```bash
41-
automodel finetune llm -c your_config_with_skypilot.yaml
41+
automodel your_config_with_skypilot.yaml
4242
```
4343

4444
The CLI detects the `skypilot:` key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command.
@@ -118,7 +118,7 @@ skypilot:
118118
hf_token: ${HF_TOKEN}
119119
```
120120
121-
### GCP — spot V100, 8 GPUs (single node)
121+
### GCP — Spot V100, 8 GPUs (Single Node)
122122
123123
```yaml
124124
skypilot:
@@ -130,7 +130,7 @@ skypilot:
130130
hf_token: ${HF_TOKEN}
131131
```
132132
133-
### Multi-node distributed training (2 × 8 × A100)
133+
### Multi-Node Distributed Training (2 x 8 x A100)
134134
135135
```yaml
136136
skypilot:
@@ -142,7 +142,7 @@ skypilot:
142142
hf_token: ${HF_TOKEN}
143143
```
144144
145-
For multi-node jobs the launcher automatically adds the SkyPilot rendezvous environment variables (`$SKYPILOT_NODE_RANK`, `$SKYPILOT_NUM_NODES`, `$SKYPILOT_NODE_IPS`) to the `torchrun` command.
145+
For multi-node jobs, the launcher automatically adds the SkyPilot rendezvous environment variables (`$SKYPILOT_NODE_RANK`, `$SKYPILOT_NUM_NODES`, `$SKYPILOT_NODE_IPS`) to the `torchrun` command.
146146

147147
## Monitor and Manage Jobs
148148

@@ -172,10 +172,19 @@ sky down <cluster_name> # Terminate the cluster and stop billing
172172
Override any training parameter from the command line, same as with local runs:
173173

174174
```bash
175-
automodel finetune llm -c config_with_skypilot.yaml \
175+
automodel config_with_skypilot.yaml \
176176
--model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B
177177
```
178178

179+
## Kubernetes Users
180+
181+
If you want to run on a Kubernetes cluster, use `cloud: kubernetes` and follow the dedicated [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). That guide includes:
182+
183+
- a copy-paste single-node config
184+
- a two-node example
185+
- sample `sky` and `kubectl` output to help you sanity-check your setup
186+
- a short troubleshooting section for common first-run issues
187+
179188
## When to Use SkyPilot vs. Slurm
180189

181190
| | SkyPilot | Slurm |

0 commit comments

Comments
 (0)