|
| 1 | +# SkyPilot + Kubernetes Tutorial |
| 2 | + |
| 3 | +This tutorial shows how to run NeMo AutoModel on a Kubernetes cluster through SkyPilot. |
| 4 | + |
| 5 | +You will: |
| 6 | + |
| 7 | +1. Check that SkyPilot can see your Kubernetes cluster and GPUs. |
| 8 | +2. Launch a small NeMo AutoModel fine-tuning job on one GPU. |
| 9 | +3. Scale the same job to two nodes. |
| 10 | +4. Follow logs and clean everything up when you are done. |
| 11 | + |
| 12 | +This guide is written for new AutoModel users, so it keeps the moving pieces as small as possible. |
| 13 | + |
| 14 | +## Before you begin |
| 15 | + |
| 16 | +You need: |
| 17 | + |
| 18 | +- a working Kubernetes context in `kubectl` |
| 19 | +- at least one GPU-backed node in the cluster |
| 20 | +- SkyPilot installed with Kubernetes support |
| 21 | +- a local NeMo AutoModel checkout |
| 22 | +- a Hugging Face token in `HF_TOKEN` if you plan to use a gated model such as Llama |
| 23 | + |
| 24 | +If you are setting up SkyPilot on Kubernetes for the first time, the official SkyPilot Kubernetes setup guide is here: |
| 25 | + |
| 26 | +- <https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html> |
| 27 | + |
| 28 | +Install the SkyPilot Kubernetes client in your AutoModel environment: |
| 29 | + |
| 30 | +```bash |
| 31 | +uv pip install "skypilot[kubernetes]" |
| 32 | +``` |
| 33 | + |
| 34 | +Set the token once in your shell: |
| 35 | + |
| 36 | +```bash |
| 37 | +export HF_TOKEN=hf_your_token_here |
| 38 | +``` |
| 39 | + |
| 40 | +## Step 1: Verify the cluster |
| 41 | + |
| 42 | +Start with three quick checks: |
| 43 | + |
| 44 | +```bash |
| 45 | +kubectl config current-context |
| 46 | +kubectl get nodes |
| 47 | +sky check kubernetes |
| 48 | +``` |
| 49 | + |
| 50 | +You want `sky check kubernetes` to report that Kubernetes is enabled. |
| 51 | + |
| 52 | +Next, ask SkyPilot which GPUs it can request from the cluster: |
| 53 | + |
| 54 | +```bash |
| 55 | +sky show-gpus --infra k8s |
| 56 | +``` |
| 57 | + |
| 58 | +Example output: |
| 59 | + |
| 60 | +```text |
| 61 | +$ sky show-gpus --infra k8s |
| 62 | +Kubernetes GPUs |
| 63 | +GPU REQUESTABLE_QTY_PER_NODE UTILIZATION |
| 64 | +L4 1, 2, 4 8 of 8 free |
| 65 | +H100 1, 2, 4, 8 8 of 8 free |
| 66 | +
|
| 67 | +Kubernetes per node GPU availability |
| 68 | +NODE GPU UTILIZATION |
| 69 | +gpu-node-a H100 8 of 8 free |
| 70 | +``` |
| 71 | + |
| 72 | +If you do not see any GPUs here, stop and fix the Kubernetes or SkyPilot setup first. AutoModel is ready, but SkyPilot still cannot place GPU jobs. |
| 73 | + |
| 74 | +## Step 2: Run a single-node job |
| 75 | + |
| 76 | +The easiest starting point is a one-GPU fine-tune using the existing Llama 3.2 1B SQuAD example. |
| 77 | + |
| 78 | +This repository now includes a Kubernetes-flavored SkyPilot config at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml). |
| 79 | + |
| 80 | +Launch it from the repo root: |
| 81 | + |
| 82 | +```bash |
| 83 | +automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml |
| 84 | +``` |
| 85 | + |
| 86 | +The important part of that YAML is the `skypilot:` block: |
| 87 | + |
| 88 | +```yaml |
| 89 | +skypilot: |
| 90 | + cloud: kubernetes |
| 91 | + accelerators: H100:1 |
| 92 | + use_spot: false |
| 93 | + disk_size: 200 |
| 94 | + job_name: llama3-2-1b-k8s |
| 95 | + hf_token: ${HF_TOKEN} |
| 96 | +``` |
| 97 | +
|
| 98 | +What AutoModel does for you: |
| 99 | +
|
| 100 | +- writes a launcher-free copy of the training config to `skypilot_jobs/<timestamp>/job_config.yaml` |
| 101 | +- syncs the repo to the SkyPilot workdir |
| 102 | +- runs `torchrun` on the Kubernetes worker pod |
| 103 | +- forwards your training config unchanged after removing the `skypilot:` section |
| 104 | + |
| 105 | +Example submission output: |
| 106 | + |
| 107 | +```text |
| 108 | +$ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml |
| 109 | +INFO Config: /workspace/Automodel/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml |
| 110 | +INFO Recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction |
| 111 | +INFO Launching job via SkyPilot |
| 112 | +INFO SkyPilot job artifacts in: /workspace/Automodel/skypilot_jobs/1712150400 |
| 113 | +``` |
| 114 | + |
| 115 | +Then watch the cluster come up: |
| 116 | + |
| 117 | +```bash |
| 118 | +sky status |
| 119 | +sky logs llama3-2-1b-k8s |
| 120 | +kubectl get pods |
| 121 | +``` |
| 122 | + |
| 123 | +Example log snippet: |
| 124 | + |
| 125 | +```text |
| 126 | +$ sky status |
| 127 | +Clusters |
| 128 | +NAME LAUNCHED RESOURCES STATUS |
| 129 | +llama3-2-1b-k8s 1m ago 1x Kubernetes(H100:1) UP |
| 130 | +
|
| 131 | +$ sky logs llama3-2-1b-k8s |
| 132 | +... |
| 133 | +torchrun --nproc_per_node=1 ~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py -c /tmp/automodel_job_config.yaml |
| 134 | +... |
| 135 | +``` |
| 136 | + |
| 137 | +## Step 3: Scale to two nodes |
| 138 | + |
| 139 | +Once the single-node job works, scaling out is just a small YAML change. |
| 140 | + |
| 141 | +Use the two-node example at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml): |
| 142 | + |
| 143 | +```bash |
| 144 | +automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml |
| 145 | +``` |
| 146 | + |
| 147 | +The launcher block looks like this: |
| 148 | + |
| 149 | +```yaml |
| 150 | +skypilot: |
| 151 | + cloud: kubernetes |
| 152 | + accelerators: H100:1 |
| 153 | + num_nodes: 2 |
| 154 | + use_spot: false |
| 155 | + disk_size: 200 |
| 156 | + job_name: llama3-2-1b-k8s-2nodes |
| 157 | + hf_token: ${HF_TOKEN} |
| 158 | +``` |
| 159 | + |
| 160 | +For multi-node jobs, AutoModel switches the generated command to a distributed `torchrun` launch that uses SkyPilot's node metadata: |
| 161 | + |
| 162 | +```text |
| 163 | +torchrun \ |
| 164 | + --nproc_per_node=1 \ |
| 165 | + --nnodes=$SKYPILOT_NUM_NODES \ |
| 166 | + --node_rank=$SKYPILOT_NODE_RANK \ |
| 167 | + --rdzv_backend=c10d \ |
| 168 | + --master_addr=$(echo $SKYPILOT_NODE_IPS | head -n1) \ |
| 169 | + --master_port=12375 \ |
| 170 | + ~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py \ |
| 171 | + -c /tmp/automodel_job_config.yaml |
| 172 | +``` |
| 173 | + |
| 174 | +That means you do not need to hand-build rendezvous arguments yourself. |
| 175 | + |
| 176 | +Use these commands while the job is starting: |
| 177 | + |
| 178 | +```bash |
| 179 | +sky status |
| 180 | +sky logs llama3-2-1b-k8s-2nodes |
| 181 | +kubectl get pods -o wide |
| 182 | +``` |
| 183 | + |
| 184 | +What you want to see: |
| 185 | + |
| 186 | +- two SkyPilot-managed worker pods |
| 187 | +- both pods scheduled onto GPU nodes |
| 188 | +- logs that include `--nnodes=$SKYPILOT_NUM_NODES` |
| 189 | + |
| 190 | +## Step 4: Clean up |
| 191 | + |
| 192 | +When the run is finished, tear the cluster down so it stops consuming resources: |
| 193 | + |
| 194 | +```bash |
| 195 | +sky down llama3-2-1b-k8s |
| 196 | +sky down llama3-2-1b-k8s-2nodes |
| 197 | +``` |
| 198 | + |
| 199 | +You can remove old local launcher artifacts too: |
| 200 | + |
| 201 | +```bash |
| 202 | +rm -rf skypilot_jobs |
| 203 | +``` |
| 204 | + |
| 205 | +## Common first-run issues |
| 206 | + |
| 207 | +### `sky check kubernetes` fails |
| 208 | + |
| 209 | +Usually this means SkyPilot cannot use your current kubeconfig context yet. Re-check the context with `kubectl config current-context`, then compare it with SkyPilot's Kubernetes setup guide. |
| 210 | + |
| 211 | +### `sky show-gpus --infra k8s` shows no GPUs |
| 212 | + |
| 213 | +SkyPilot can only schedule GPUs that Kubernetes exposes. Make sure the GPU device plugin or operator is installed and the GPU nodes are healthy. |
| 214 | + |
| 215 | +### The job starts, but model download fails |
| 216 | + |
| 217 | +For gated models, make sure `HF_TOKEN` is exported in the shell that runs `automodel`. The SkyPilot launcher forwards it to the remote job. |
| 218 | + |
| 219 | +### Multi-node launch stalls during rendezvous |
| 220 | + |
| 221 | +Start with the single-node example first. If that works, check that: |
| 222 | + |
| 223 | +- your cluster has enough free GPU nodes for `num_nodes` |
| 224 | +- worker pods can talk to each other over the cluster network |
| 225 | +- the logs include the generated `torchrun` multi-node arguments shown above |
| 226 | + |
| 227 | +## Which file should I edit? |
| 228 | + |
| 229 | +If you want to adapt this tutorial for your own model, the quickest path is: |
| 230 | + |
| 231 | +1. Copy [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml). |
| 232 | +2. Change the `model` and dataset sections. |
| 233 | +3. Keep the `skypilot:` block small until the first run succeeds. |
| 234 | + |
| 235 | +That way, when something goes wrong, you only have a few knobs to inspect. |
0 commit comments