diff --git a/docs/index.md b/docs/index.md index 214523055c..f4e8560d80 100644 --- a/docs/index.md +++ b/docs/index.md @@ -276,6 +276,7 @@ Local Workstation SLURM Cluster NeMo-Run SkyPilot +SkyPilot k8s :::: ::::{toctree} diff --git a/docs/launcher/overview.md b/docs/launcher/overview.md index 9bd583315f..f4c28c4661 100644 --- a/docs/launcher/overview.md +++ b/docs/launcher/overview.md @@ -7,13 +7,13 @@ NeMo AutoModel provides several ways to launch training. The right choice depend | Launcher | Best for | GPUs | Guide | |---|---|---|---| | **Local Workstation** | Getting started, debugging, single-node training | 1-8 on one machine | [Local Workstation](./local-workstation.md) | -| **Slurm** | Multi-node batch jobs on HPC clusters | 8+ across nodes | [Slurm](./slurm.md) | | **NeMo-Run** | Managed execution on Slurm, Kubernetes, Docker, local | 1+ | [NeMo-Run](./nemo-run.md) | -| **SkyPilot** | Cloud training (AWS, GCP, Azure) with spot pricing | Any | [SkyPilot](./skypilot.md) | +| **SkyPilot** | Cloud training or Kubernetes clusters | Any | [SkyPilot](./skypilot.md) | +| **Slurm** | Multi-node batch jobs on HPC clusters | 8+ across nodes | [Slurm](./slurm.md) | -### I have 1-2 GPUs on my workstation +### I Have 1–2 GPUs on My Workstation -Use the **interactive** launcher. No scheduler or cluster software needed: +Use the **interactive** launcher. No scheduler or cluster software is needed: ```bash automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml @@ -21,7 +21,7 @@ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml See the [Local Workstation](./local-workstation.md) guide. -### I have access to a Slurm cluster +### I Have Access to a Slurm Cluster Add a `slurm:` section to your YAML config and submit with the same `automodel` command. The CLI generates the `torchrun` invocation and calls `sbatch` for you: @@ -31,7 +31,7 @@ automodel config_with_slurm.yaml See the [Slurm](./slurm.md) guide. -### I want managed job submission (Slurm, Kubernetes, Docker) +### I Want Managed Job Submission (Slurm, Kubernetes, Docker) Add a `nemo_run:` section to your YAML config. NeMo-Run loads a pre-configured executor for your compute target and submits the job: @@ -41,7 +41,7 @@ automodel config_with_nemo_run.yaml See the [NeMo-Run](./nemo-run.md) guide. -### I want to train on the cloud +### I Want to Train on the Cloud Add a `skypilot:` section to your YAML config. SkyPilot provisions VMs on any major cloud and handles spot-instance preemption automatically: @@ -51,6 +51,16 @@ automodel config_with_skypilot.yaml See the [SkyPilot](./skypilot.md) guide. +### I Want to Train on Kubernetes with SkyPilot + +Use the same `skypilot:` launcher, but set `cloud: kubernetes`. This is a good fit when your team already has a GPU-backed Kubernetes cluster and you want SkyPilot to handle job submission and multi-node orchestration: + +```bash +automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml +``` + +See the [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). + ## All Launchers Use the Same Config Every launcher shares the same YAML recipe format. The only difference is an optional launcher section (`slurm:`, `nemo_run:`, or `skypilot:`) that tells the CLI where to run. Without a launcher section, training runs interactively on the current machine. diff --git a/docs/launcher/skypilot-kubernetes.md b/docs/launcher/skypilot-kubernetes.md new file mode 100644 index 0000000000..3b8692c252 --- /dev/null +++ b/docs/launcher/skypilot-kubernetes.md @@ -0,0 +1,235 @@ +# SkyPilot + Kubernetes Tutorial + +This tutorial shows how to run NeMo AutoModel on a Kubernetes cluster through SkyPilot. + +You will: + +1. Check that SkyPilot can see your Kubernetes cluster and GPUs. +2. Launch a small NeMo AutoModel fine-tuning job on one GPU. +3. Scale the same job to two nodes. +4. Follow logs and clean everything up when you are done. + +This guide is written for new AutoModel users, so it keeps the moving pieces as small as possible. + +## Before you begin + +You need: + +- a working Kubernetes context in `kubectl` +- at least one GPU-backed node in the cluster +- SkyPilot installed with Kubernetes support +- a local NeMo AutoModel checkout +- a Hugging Face token in `HF_TOKEN` if you plan to use a gated model such as Llama + +If you are setting up SkyPilot on Kubernetes for the first time, the official SkyPilot Kubernetes setup guide is here: + +- + +Install the SkyPilot Kubernetes client in your AutoModel environment: + +```bash +uv pip install "skypilot[kubernetes]" +``` + +Set the token once in your shell: + +```bash +export HF_TOKEN=hf_your_token_here +``` + +## Step 1: Verify the cluster + +Start with three quick checks: + +```bash +kubectl config current-context +kubectl get nodes +sky check kubernetes +``` + +You want `sky check kubernetes` to report that Kubernetes is enabled. + +Next, ask SkyPilot which GPUs it can request from the cluster: + +```bash +sky show-gpus --infra k8s +``` + +Example output: + +```text +$ sky show-gpus --infra k8s +Kubernetes GPUs +GPU REQUESTABLE_QTY_PER_NODE UTILIZATION +L4 1, 2, 4 8 of 8 free +H100 1, 2, 4, 8 8 of 8 free + +Kubernetes per node GPU availability +NODE GPU UTILIZATION +gpu-node-a H100 8 of 8 free +``` + +If you do not see any GPUs here, stop and fix the Kubernetes or SkyPilot setup first. AutoModel is ready, but SkyPilot still cannot place GPU jobs. + +## Step 2: Run a single-node job + +The easiest starting point is a one-GPU fine-tune using the existing Llama 3.2 1B SQuAD example. + +This repository now includes a Kubernetes-flavored SkyPilot config at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml). + +Launch it from the repo root: + +```bash +automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml +``` + +The important part of that YAML is the `skypilot:` block: + +```yaml +skypilot: + cloud: kubernetes + accelerators: H100:1 + use_spot: false + disk_size: 200 + job_name: llama3-2-1b-k8s + hf_token: ${HF_TOKEN} +``` + +What AutoModel does for you: + +- writes a launcher-free copy of the training config to `skypilot_jobs//job_config.yaml` +- syncs the repo to the SkyPilot workdir +- runs `torchrun` on the Kubernetes worker pod +- forwards your training config unchanged after removing the `skypilot:` section + +Example submission output: + +```text +$ automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml +INFO Config: /workspace/Automodel/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml +INFO Recipe: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction +INFO Launching job via SkyPilot +INFO SkyPilot job artifacts in: /workspace/Automodel/skypilot_jobs/1712150400 +``` + +Then watch the cluster come up: + +```bash +sky status +sky logs llama3-2-1b-k8s +kubectl get pods +``` + +Example log snippet: + +```text +$ sky status +Clusters +NAME LAUNCHED RESOURCES STATUS +llama3-2-1b-k8s 1m ago 1x Kubernetes(H100:1) UP + +$ sky logs llama3-2-1b-k8s +... +torchrun --nproc_per_node=1 ~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py -c /tmp/automodel_job_config.yaml +... +``` + +## Step 3: Scale to two nodes + +Once the single-node job works, scaling out is just a small YAML change. + +Use the two-node example at [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml): + +```bash +automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml +``` + +The launcher block looks like this: + +```yaml +skypilot: + cloud: kubernetes + accelerators: H100:1 + num_nodes: 2 + use_spot: false + disk_size: 200 + job_name: llama3-2-1b-k8s-2nodes + hf_token: ${HF_TOKEN} +``` + +For multi-node jobs, AutoModel switches the generated command to a distributed `torchrun` launch that uses SkyPilot's node metadata: + +```text +torchrun \ + --nproc_per_node=1 \ + --nnodes=$SKYPILOT_NUM_NODES \ + --node_rank=$SKYPILOT_NODE_RANK \ + --rdzv_backend=c10d \ + --master_addr=$(echo $SKYPILOT_NODE_IPS | head -n1) \ + --master_port=12375 \ + ~/sky_workdir/nemo_automodel/recipes/llm/train_ft.py \ + -c /tmp/automodel_job_config.yaml +``` + +That means you do not need to hand-build rendezvous arguments yourself. + +Use these commands while the job is starting: + +```bash +sky status +sky logs llama3-2-1b-k8s-2nodes +kubectl get pods -o wide +``` + +What you want to see: + +- two SkyPilot-managed worker pods +- both pods scheduled onto GPU nodes +- logs that include `--nnodes=$SKYPILOT_NUM_NODES` + +## Step 4: Clean up + +When the run is finished, tear the cluster down so it stops consuming resources: + +```bash +sky down llama3-2-1b-k8s +sky down llama3-2-1b-k8s-2nodes +``` + +You can remove old local launcher artifacts too: + +```bash +rm -rf skypilot_jobs +``` + +## Common first-run issues + +### `sky check kubernetes` fails + +Usually this means SkyPilot cannot use your current kubeconfig context yet. Re-check the context with `kubectl config current-context`, then compare it with SkyPilot's Kubernetes setup guide. + +### `sky show-gpus --infra k8s` shows no GPUs + +SkyPilot can only schedule GPUs that Kubernetes exposes. Make sure the GPU device plugin or operator is installed and the GPU nodes are healthy. + +### The job starts, but model download fails + +For gated models, make sure `HF_TOKEN` is exported in the shell that runs `automodel`. The SkyPilot launcher forwards it to the remote job. + +### Multi-node launch stalls during rendezvous + +Start with the single-node example first. If that works, check that: + +- your cluster has enough free GPU nodes for `num_nodes` +- worker pods can talk to each other over the cluster network +- the logs include the generated `torchrun` multi-node arguments shown above + +## Which file should I edit? + +If you want to adapt this tutorial for your own model, the quickest path is: + +1. Copy [`examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml`](../../examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml). +2. Change the `model` and dataset sections. +3. Keep the `skypilot:` block small until the first run succeeds. + +That way, when something goes wrong, you only have a few knobs to inspect. diff --git a/docs/launcher/skypilot.md b/docs/launcher/skypilot.md index 4e435e7263..55d4061fac 100644 --- a/docs/launcher/skypilot.md +++ b/docs/launcher/skypilot.md @@ -1,6 +1,6 @@ -# Run on Any Cloud with SkyPilot +# Run with SkyPilot -In this guide, you will learn how to launch NeMo AutoModel training jobs on any major cloud provider (AWS, GCP, Azure, Lambda, Kubernetes) using [SkyPilot](https://skypilot.readthedocs.io). For on-premises cluster usage, see [Run on a Cluster (Slurm)](./slurm.md). For single-node workstation usage, see [Run on Your Local Workstation](./local-workstation.md). +In this guide, you will learn how to launch NeMo AutoModel training jobs with [SkyPilot](https://docs.skypilot.co/en/stable/docs/). SkyPilot can target public clouds such as AWS, GCP, Azure, and Lambda, and it can also submit jobs to Kubernetes clusters. For a beginner-friendly Kubernetes walkthrough, see [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). For on-premises cluster usage without SkyPilot, see [Run on a Cluster (Slurm)](./slurm.md). For single-node workstation usage, see [Run on Your Local Workstation](./local-workstation.md). SkyPilot is an open-source framework that abstracts cloud infrastructure so you can train on whichever cloud is cheapest or most available at launch time — including automatic spot-instance handling for significant cost savings. @@ -8,17 +8,17 @@ SkyPilot is an open-source framework that abstracts cloud infrastructure so you Complete the following setup steps before launching your first AutoModel job on a cloud provider. -1. **Install SkyPilot** with the connector for your target cloud: +1. **Install SkyPilot** with the connector for your target infrastructure: ```bash -pip install "skypilot[gcp]" # Google Cloud -pip install "skypilot[aws]" # Amazon Web Services -pip install "skypilot[azure]" # Microsoft Azure -pip install "skypilot[lambda]" # Lambda Cloud -pip install "skypilot[kubernetes]" # Any Kubernetes cluster +uv pip install "skypilot[gcp]" # Google Cloud +uv pip install "skypilot[aws]" # Amazon Web Services +uv pip install "skypilot[azure]" # Microsoft Azure +uv pip install "skypilot[lambda]" # Lambda Cloud +uv pip install "skypilot[kubernetes]" # Any Kubernetes cluster ``` -2. **Configure your cloud credentials** by following the SkyPilot credential setup guide for your cloud, then verify: +2. **Configure access** for your target infrastructure, then verify: ```bash sky check @@ -38,7 +38,7 @@ export WANDB_API_KEY=... # Optional: Weights & Biases logging Add a `skypilot:` section to any existing config YAML, then run the same `automodel` command you already know: ```bash -automodel finetune llm -c your_config_with_skypilot.yaml +automodel your_config_with_skypilot.yaml ``` The CLI detects the `skypilot:` key, strips it from the training config, uploads the code and config to a cloud VM, and launches training — all in one command. @@ -118,7 +118,7 @@ skypilot: hf_token: ${HF_TOKEN} ``` -### GCP — spot V100, 8 GPUs (single node) +### GCP — Spot V100, 8 GPUs (Single Node) ```yaml skypilot: @@ -130,7 +130,7 @@ skypilot: hf_token: ${HF_TOKEN} ``` -### Multi-node distributed training (2 × 8 × A100) +### Multi-Node Distributed Training (2 x 8 x A100) ```yaml skypilot: @@ -142,7 +142,7 @@ skypilot: hf_token: ${HF_TOKEN} ``` -For multi-node jobs the launcher automatically adds the SkyPilot rendezvous environment variables (`$SKYPILOT_NODE_RANK`, `$SKYPILOT_NUM_NODES`, `$SKYPILOT_NODE_IPS`) to the `torchrun` command. +For multi-node jobs, the launcher automatically adds the SkyPilot rendezvous environment variables (`$SKYPILOT_NODE_RANK`, `$SKYPILOT_NUM_NODES`, `$SKYPILOT_NODE_IPS`) to the `torchrun` command. ## Monitor and Manage Jobs @@ -172,10 +172,19 @@ sky down # Terminate the cluster and stop billing Override any training parameter from the command line, same as with local runs: ```bash -automodel finetune llm -c config_with_skypilot.yaml \ +automodel config_with_skypilot.yaml \ --model.pretrained_model_name_or_path meta-llama/Llama-3.2-3B ``` +## Kubernetes Users + +If you want to run on a Kubernetes cluster, use `cloud: kubernetes` and follow the dedicated [SkyPilot + Kubernetes tutorial](./skypilot-kubernetes.md). That guide includes: + +- a copy-paste single-node config +- a two-node example +- sample `sky` and `kubectl` output to help you sanity-check your setup +- a short troubleshooting section for common first-run issues + ## When to Use SkyPilot vs. Slurm | | SkyPilot | Slurm | diff --git a/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml b/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml new file mode 100644 index 0000000000..c98110425d --- /dev/null +++ b/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml @@ -0,0 +1,110 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Fine-tune Llama-3.2-1B on SQuAD using SkyPilot on Kubernetes. +# +# Usage: +# automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes.yaml +# +# Prerequisites: +# uv pip install "skypilot[kubernetes]" +# sky check kubernetes +# sky show-gpus --infra k8s + +skypilot: + cloud: kubernetes + accelerators: H100:1 + use_spot: false + disk_size: 200 + num_nodes: 1 + job_name: llama3-2-1b-k8s + hf_token: ${HF_TOKEN} + +recipe: + _target_: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction + +step_scheduler: + global_batch_size: 64 + local_batch_size: 8 + ckpt_every_steps: 1000 + val_every_steps: 10 + num_epochs: 1 + +dist_env: + backend: nccl + timeout_minutes: 1 + +rng: + _target_: nemo_automodel.components.training.rng.StatefulRNG + seed: 1111 + ranked: true + +model: + _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained + pretrained_model_name_or_path: meta-llama/Llama-3.2-1B + +compile: + enabled: false + mode: "default" + fullgraph: false + dynamic: true + backend: null + +clip_grad_norm: + max_norm: 1.0 + +distributed: + strategy: fsdp2 + dp_size: none + tp_size: 1 + cp_size: 1 + +loss_fn: + _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy + +dataset: + _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset + dataset_name: rajpurkar/squad + split: train + +packed_sequence: + packed_sequence_size: 0 + +dataloader: + _target_: torchdata.stateful_dataloader.StatefulDataLoader + collate_fn: + _target_: nemo_automodel.components.datasets.utils.default_collater + shuffle: false + +validation_dataset: + _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset + dataset_name: rajpurkar/squad + split: validation + limit_dataset_samples: 64 + +validation_dataloader: + _target_: torchdata.stateful_dataloader.StatefulDataLoader + collate_fn: + _target_: nemo_automodel.components.datasets.utils.default_collater + +optimizer: + _target_: torch.optim.Adam + betas: [0.9, 0.999] + eps: 1e-8 + lr: 1.0e-5 + weight_decay: 0 + +lr_scheduler: + lr_decay_style: cosine + min_lr: 1.0e-6 diff --git a/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml b/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml new file mode 100644 index 0000000000..3a167c18bf --- /dev/null +++ b/examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml @@ -0,0 +1,105 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Two-node SkyPilot + Kubernetes example for Llama-3.2-1B on SQuAD. +# +# Usage: +# automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad_skypilot_kubernetes_2nodes.yaml + +skypilot: + cloud: kubernetes + accelerators: H100:1 + num_nodes: 2 + use_spot: false + disk_size: 200 + job_name: llama3-2-1b-k8s-2nodes + hf_token: ${HF_TOKEN} + +recipe: + _target_: nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction + +step_scheduler: + global_batch_size: 64 + local_batch_size: 8 + ckpt_every_steps: 1000 + val_every_steps: 10 + num_epochs: 1 + +dist_env: + backend: nccl + timeout_minutes: 1 + +rng: + _target_: nemo_automodel.components.training.rng.StatefulRNG + seed: 1111 + ranked: true + +model: + _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained + pretrained_model_name_or_path: meta-llama/Llama-3.2-1B + +compile: + enabled: false + mode: "default" + fullgraph: false + dynamic: true + backend: null + +clip_grad_norm: + max_norm: 1.0 + +distributed: + strategy: fsdp2 + dp_size: none + tp_size: 1 + cp_size: 1 + +loss_fn: + _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy + +dataset: + _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset + dataset_name: rajpurkar/squad + split: train + +packed_sequence: + packed_sequence_size: 0 + +dataloader: + _target_: torchdata.stateful_dataloader.StatefulDataLoader + collate_fn: + _target_: nemo_automodel.components.datasets.utils.default_collater + shuffle: false + +validation_dataset: + _target_: nemo_automodel.components.datasets.llm.squad.make_squad_dataset + dataset_name: rajpurkar/squad + split: validation + limit_dataset_samples: 64 + +validation_dataloader: + _target_: torchdata.stateful_dataloader.StatefulDataLoader + collate_fn: + _target_: nemo_automodel.components.datasets.utils.default_collater + +optimizer: + _target_: torch.optim.Adam + betas: [0.9, 0.999] + eps: 1e-8 + lr: 1.0e-5 + weight_decay: 0 + +lr_scheduler: + lr_decay_style: cosine + min_lr: 1.0e-6 diff --git a/nemo_automodel/cli/app.py b/nemo_automodel/cli/app.py index ce6e0d6dda..6be82311aa 100644 --- a/nemo_automodel/cli/app.py +++ b/nemo_automodel/cli/app.py @@ -52,6 +52,7 @@ from pathlib import Path from nemo_automodel.cli.utils import load_yaml, resolve_recipe_name +from nemo_automodel.components.config.loader import resolve_yaml_env_vars # When launched via external torchrun each worker imports this module. # Suppress non-rank-0 CLI output before setup_logging installs RankFilter. @@ -135,7 +136,13 @@ def main(): logger.info("Launching job via SkyPilot") from nemo_automodel.components.launcher.skypilot.launcher import SkyPilotLauncher - return SkyPilotLauncher().launch(config, config_path, recipe_target, skypilot_config, extra) + return SkyPilotLauncher().launch( + config, + config_path, + recipe_target, + resolve_yaml_env_vars(skypilot_config), + extra, + ) elif nemo_run_config := config.pop("nemo_run", None): logger.info("Launching job via NeMo-Run") diff --git a/tests/unit_tests/_cli/test_app.py b/tests/unit_tests/_cli/test_app.py index 402e6a0f1e..6948db5a76 100644 --- a/tests/unit_tests/_cli/test_app.py +++ b/tests/unit_tests/_cli/test_app.py @@ -191,7 +191,7 @@ def test_build_parser_accepts_config(tmp_path): cfg = tmp_path / "test.yaml" cfg.write_text("foo: bar") parser = module.build_parser() - args, extra = parser.parse_known_args([str(cfg)]) + args, _ = parser.parse_known_args([str(cfg)]) assert args.config == cfg assert args.nproc_per_node is None @@ -258,6 +258,43 @@ def launch(self, config, config_path, recipe_target, launcher_config, extra): assert launched["launcher_config"]["executor"] == "local" +def test_main_resolves_skypilot_env_vars(monkeypatch, tmp_path): + cfg = tmp_path / "skypilot.yaml" + cfg.write_text( + yaml.dump( + { + "recipe": { + "_target_": "nemo_automodel.recipes.llm.train_ft.TrainFinetuneRecipeForNextTokenPrediction", + }, + "skypilot": { + "cloud": "kubernetes", + "hf_token": "${HF_TOKEN}", + "env_vars": {"WANDB_API_KEY": "${WANDB_API_KEY,}"}, + }, + } + ) + ) + monkeypatch.setenv("HF_TOKEN", "hf_test_token") + monkeypatch.setenv("WANDB_API_KEY", "wandb_test_key") + monkeypatch.setattr("sys.argv", ["automodel", str(cfg)]) + + launched = {} + + class FakeSkyPilotLauncher: + def launch(self, config, config_path, recipe_target, launcher_config, extra): + launched["launcher_config"] = launcher_config + return 0 + + monkeypatch.setattr( + "nemo_automodel.components.launcher.skypilot.launcher.SkyPilotLauncher", + FakeSkyPilotLauncher, + ) + result = module.main() + assert result == 0 + assert launched["launcher_config"]["hf_token"] == "hf_test_token" + assert launched["launcher_config"]["env_vars"]["WANDB_API_KEY"] == "wandb_test_key" + + def test_main_passes_extra_args(monkeypatch, recipe_yaml): """Extra CLI args should be forwarded to the launcher.""" monkeypatch.setattr(