Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,7 @@ Local Workstation <launcher/local-workstation.md>
SLURM Cluster <launcher/slurm.md>
NeMo-Run <launcher/nemo-run.md>
SkyPilot <launcher/skypilot.md>
SkyPilot k8s <launcher/skypilot-kubernetes.md>
::::

::::{toctree}
Expand Down
24 changes: 17 additions & 7 deletions docs/launcher/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,21 @@ NeMo AutoModel provides several ways to launch training. The right choice depend
| Launcher | Best for | GPUs | Guide |
|---|---|---|---|
| **Local Workstation** | Getting started, debugging, single-node training | 1-8 on one machine | [Local Workstation](./local-workstation.md) |
| **Slurm** | Multi-node batch jobs on HPC clusters | 8+ across nodes | [Slurm](./slurm.md) |
| **NeMo-Run** | Managed execution on Slurm, Kubernetes, Docker, local | 1+ | [NeMo-Run](./nemo-run.md) |
| **SkyPilot** | Cloud training (AWS, GCP, Azure) with spot pricing | Any | [SkyPilot](./skypilot.md) |
| **SkyPilot** | Cloud training or Kubernetes clusters | Any | [SkyPilot](./skypilot.md) |
| **Slurm** | Multi-node batch jobs on HPC clusters | 8+ across nodes | [Slurm](./slurm.md) |

### I have 1-2 GPUs on my workstation
### I Have 1–2 GPUs on My Workstation

Use the **interactive** launcher. No scheduler or cluster software needed:
Use the **interactive** launcher. No scheduler or cluster software is needed:

```bash
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
```

See the [Local Workstation](./local-workstation.md) guide.

### I have access to a Slurm cluster
### I Have Access to a Slurm Cluster

Add a `slurm:` section to your YAML config and submit with the same `automodel` command. The CLI generates the `torchrun` invocation and calls `sbatch` for you:

Expand All @@ -31,7 +31,7 @@ automodel config_with_slurm.yaml

See the [Slurm](./slurm.md) guide.

### I want managed job submission (Slurm, Kubernetes, Docker)
### I Want Managed Job Submission (Slurm, Kubernetes, Docker)

Add a `nemo_run:` section to your YAML config. NeMo-Run loads a pre-configured executor for your compute target and submits the job:

Expand All @@ -41,7 +41,7 @@ automodel config_with_nemo_run.yaml

See the [NeMo-Run](./nemo-run.md) guide.

### I want to train on the cloud
### I Want to Train on the Cloud

Add a `skypilot:` section to your YAML config. SkyPilot provisions VMs on any major cloud and handles spot-instance preemption automatically:

Expand All @@ -51,6 +51,16 @@ automodel config_with_skypilot.yaml

See the [SkyPilot](./skypilot.md) guide.

### I Want to Train on Kubernetes with SkyPilot

Use the same `skypilot:` launcher, but set `cloud: kubernetes`. This is a good fit when your team already has a GPU-backed Kubernetes cluster and you want SkyPilot to handle job submission and multi-node orchestration:

```bash
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
```

See the [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md).

## All Launchers Use the Same Config

Every launcher shares the same YAML recipe format. The only difference is an optional launcher section (`slurm:`, `nemo_run:`, or `skypilot:`) that tells the CLI where to run. Without a launcher section, training runs interactively on the current machine.
235 changes: 235 additions & 0 deletions docs/launcher/skypilot-kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# SkyPilot + Kubernetes Tutorial

This tutorial shows how to run NeMo AutoModel on a Kubernetes cluster through SkyPilot.

You will:

1. Check that SkyPilot can see your Kubernetes cluster and GPUs.
2. Launch a small NeMo AutoModel fine-tuning job on one GPU.
3. Scale the same job to two nodes.
4. Follow logs and clean everything up when you are done.

This guide is written for new AutoModel users, so it keeps the moving pieces as small as possible.

## Before you begin

You need:

- a working Kubernetes context in `kubectl`
- at least one GPU-backed node in the cluster
- SkyPilot installed with Kubernetes support
- a local NeMo AutoModel checkout
- a Hugging Face token in `HF_TOKEN` if you plan to use a gated model such as Llama

If you are setting up SkyPilot on Kubernetes for the first time, the official SkyPilot Kubernetes setup guide is here:

- <https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html>

Install the SkyPilot Kubernetes client in your AutoModel environment:

```bash
uv pip install "skypilot[kubernetes]"
```

Set the token once in your shell:

```bash
export HF_TOKEN=hf_your_token_here
```

## Step 1: Verify the cluster

Start with three quick checks:

```bash
kubectl config current-context
kubectl get nodes
sky check kubernetes
```

You want `sky check kubernetes` to report that Kubernetes is enabled.

Next, ask SkyPilot which GPUs it can request from the cluster:

```bash
sky show-gpus --infra k8s
```

Example output:

```text
$ sky show-gpus --infra k8s
Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE UTILIZATION
L4 1, 2, 4 8 of 8 free
H100 1, 2, 4, 8 8 of 8 free

Kubernetes per node GPU availability
NODE GPU UTILIZATION
gpu-node-a H100 8 of 8 free
```

If you do not see any GPUs here, stop and fix the Kubernetes or SkyPilot setup first. AutoModel is ready, but SkyPilot still cannot place GPU jobs.

## Step 2: Run a single-node job

The easiest starting point is a one-GPU fine-tune using the existing Llama 3.2 1B SQuAD example.

This repository now includes a Kubernetes-flavored SkyPilot config at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml).

Launch it from the repo root:

```bash
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
```

The important part of that YAML is the `skypilot:` block:

```yaml
skypilot:
cloud: kubernetes
accelerators: H100:1
use_spot: false
disk_size: 200
job_name: llama3-2-1b-k8s
hf_token: ${HF_TOKEN}
```

What AutoModel does for you:

- writes a launcher-free copy of the training config to `skypilot_jobs/<timestamp>/job_config.yaml`
- syncs the repo to the SkyPilot workdir
- runs `torchrun` on the Kubernetes worker pod
- forwards your training config unchanged after removing the `skypilot:` section

Example submission output:

```text
$ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
INFO Config: /workspace/Automodel/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml
INFO Recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction
INFO Launching job via SkyPilot
INFO SkyPilot job artifacts in: /workspace/Automodel/skypilot_jobs/1712150400
```

Then watch the cluster come up:

```bash
sky status
sky logs llama3-2-1b-k8s
kubectl get pods
```

Example log snippet:

```text
$ sky status
Clusters
NAME LAUNCHED RESOURCES STATUS
llama3-2-1b-k8s 1m ago 1x Kubernetes(H100:1) UP

$ sky logs llama3-2-1b-k8s
...
torchrun --nproc_per_node=1 ~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py -c /tmp/automodel_job_config.yaml
...
```

## Step 3: Scale to two nodes

Once the single-node job works, scaling out is just a small YAML change.

Use the two-node example at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml):

```bash
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml
```

The launcher block looks like this:

```yaml
skypilot:
cloud: kubernetes
accelerators: H100:1
num_nodes: 2
use_spot: false
disk_size: 200
job_name: llama3-2-1b-k8s-2nodes
hf_token: ${HF_TOKEN}
```

For multi-node jobs, AutoModel switches the generated command to a distributed `torchrun` launch that uses SkyPilot's node metadata:

```text
torchrun \
--nproc_per_node=1 \
--nnodes=$SKYPILOT_NUM_NODES \
--node_rank=$SKYPILOT_NODE_RANK \
--rdzv_backend=c10d \
--master_addr=$(echo $SKYPILOT_NODE_IPS | head -n1) \
--master_port=12375 \
~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py \
-c /tmp/automodel_job_config.yaml
```

That means you do not need to hand-build rendezvous arguments yourself.

Use these commands while the job is starting:

```bash
sky status
sky logs llama3-2-1b-k8s-2nodes
kubectl get pods -o wide
```

What you want to see:

- two SkyPilot-managed worker pods
- both pods scheduled onto GPU nodes
- logs that include `--nnodes=$SKYPILOT_NUM_NODES`

## Step 4: Clean up

When the run is finished, tear the cluster down so it stops consuming resources:

```bash
sky down llama3-2-1b-k8s
sky down llama3-2-1b-k8s-2nodes
```

You can remove old local launcher artifacts too:

```bash
rm -rf skypilot_jobs
```

## Common first-run issues

### `sky check kubernetes` fails

Usually this means SkyPilot cannot use your current kubeconfig context yet. Re-check the context with `kubectl config current-context`, then compare it with SkyPilot's Kubernetes setup guide.

### `sky show-gpus --infra k8s` shows no GPUs

SkyPilot can only schedule GPUs that Kubernetes exposes. Make sure the GPU device plugin or operator is installed and the GPU nodes are healthy.

### The job starts, but model download fails

For gated models, make sure `HF_TOKEN` is exported in the shell that runs `automodel`. The SkyPilot launcher forwards it to the remote job.

### Multi-node launch stalls during rendezvous

Start with the single-node example first. If that works, check that:

- your cluster has enough free GPU nodes for `num_nodes`
- worker pods can talk to each other over the cluster network
- the logs include the generated `torchrun` multi-node arguments shown above

## Which file should I edit?

If you want to adapt this tutorial for your own model, the quickest path is:

1. Copy [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml).
2. Change the `model` and dataset sections.
3. Keep the `skypilot:` block small until the first run succeeds.

That way, when something goes wrong, you only have a few knobs to inspect.
37 changes: 23 additions & 14 deletions docs/launcher/skypilot.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# Run on Any Cloud with SkyPilot
# Run with SkyPilot

In this guide, you will learn how to launch NeMo AutoModel training jobs on any major cloud provider (AWS, GCP, Azure, Lambda, Kubernetes) using [SkyPilot](https://skypilot.readthedocs.io). For on-premises cluster usage, see [Run on a Cluster (Slurm)](./slurm.md). For single-node workstation usage, see [Run on Your Local Workstation](./local-workstation.md).
In this guide, you will learn how to launch NeMo AutoModel training jobs with [SkyPilot](https://docs.skypilot.co/en/stable/docs/). SkyPilot can target public clouds such as AWS, GCP, Azure, and Lambda, and it can also submit jobs to Kubernetes clusters. For a beginner-friendly Kubernetes walkthrough, see [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). For on-premises cluster usage without SkyPilot, see [Run on a Cluster (Slurm)](./slurm.md). For single-node workstation usage, see [Run on Your Local Workstation](./local-workstation.md).

SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings.

## Before You Begin

Complete the following setup steps before launching your first AutoModel job on a cloud provider.

1. **Install SkyPilot** with the connector for your target cloud:
1. **Install SkyPilot** with the connector for your target infrastructure:

```bash
pip install "skypilot[gcp]" # Google Cloud
pip install "skypilot[aws]" # Amazon Web Services
pip install "skypilot[azure]" # Microsoft Azure
pip install "skypilot[lambda]" # Lambda Cloud
pip install "skypilot[kubernetes]" # Any Kubernetes cluster
uv pip install "skypilot[gcp]" # Google Cloud
uv pip install "skypilot[aws]" # Amazon Web Services
uv pip install "skypilot[azure]" # Microsoft Azure
uv pip install "skypilot[lambda]" # Lambda Cloud
uv pip install "skypilot[kubernetes]" # Any Kubernetes cluster
```

2. **Configure your cloud credentials** by following the SkyPilot credential setup guide for your cloud, then verify:
2. **Configure access** for your target infrastructure, then verify:

```bash
sky check
Expand All @@ -38,7 +38,7 @@ export WANDB_API_KEY=... # Optional: Weights & Biases logging
Add a `skypilot:` section to any existing config YAML, then run the same `automodel` command you already know:

```bash
automodel finetune llm -c your_config_with_skypilot.yaml
automodel your_config_with_skypilot.yaml
```

The CLI detects the `skypilot:` key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command.
Expand Down Expand Up @@ -118,7 +118,7 @@ skypilot:
hf_token: ${HF_TOKEN}
```

### GCP — spot V100, 8 GPUs (single node)
### GCP — Spot V100, 8 GPUs (Single Node)

```yaml
skypilot:
Expand All @@ -130,7 +130,7 @@ skypilot:
hf_token: ${HF_TOKEN}
```

### Multi-node distributed training (2 × 8 × A100)
### Multi-Node Distributed Training (2 x 8 x A100)

```yaml
skypilot:
Expand All @@ -142,7 +142,7 @@ skypilot:
hf_token: ${HF_TOKEN}
```

For multi-node jobs the launcher automatically adds the SkyPilot rendezvous environment variables (`$SKYPILOT_NODE_RANK`, `$SKYPILOT_NUM_NODES`, `$SKYPILOT_NODE_IPS`) to the `torchrun` command.
For multi-node jobs, the launcher automatically adds the SkyPilot rendezvous environment variables (`$SKYPILOT_NODE_RANK`, `$SKYPILOT_NUM_NODES`, `$SKYPILOT_NODE_IPS`) to the `torchrun` command.

## Monitor and Manage Jobs

Expand Down Expand Up @@ -172,10 +172,19 @@ sky down <cluster_name> # Terminate the cluster and stop billing
Override any training parameter from the command line, same as with local runs:

```bash
automodel finetune llm -c config_with_skypilot.yaml \
automodel config_with_skypilot.yaml \
--model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B
```

## Kubernetes Users

If you want to run on a Kubernetes cluster, use `cloud: kubernetes` and follow the dedicated [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). That guide includes:

- a copy-paste single-node config
- a two-node example
- sample `sky` and `kubectl` output to help you sanity-check your setup
- a short troubleshooting section for common first-run issues

## When to Use SkyPilot vs. Slurm

| | SkyPilot | Slurm |
Expand Down
Loading
Loading