Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions examples/llm_finetune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# LLM Fine-Tuning Examples

This directory contains NeMo AutoModel LLM fine-tuning recipes organized by model family. Each subdirectory provides YAML configs for a specific family, such as Llama, Mistral, Qwen, Gemma, Nemotron, and others. The main AutoModel README identifies `examples/llm_finetune/` as the location for LLM fine-tune configs and shows these recipes being launched through the `automodel` CLI.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This directory contains NeMo AutoModel LLM fine-tuning recipes organized by model family. Each subdirectory provides YAML configs for a specific family, such as Llama, Mistral, Qwen, Gemma, Nemotron, and others. The main AutoModel README identifies `examples/llm_finetune/` as the location for LLM fine-tune configs and shows these recipes being launched through the `automodel` CLI.
This directory holds YAML recipes for fine-tuning LLMs with NeMo AutoModel. Each recipe pairs a config (the YAML) with a recipe class (here, `TrainFinetuneRecipeForNextTokenPrediction`); you launch it with the `automodel` CLI.
Pick your path:
| Goal | Recipe variant | Launch |
| ------------------------ | --------------------------------------- | ------------------------------------- |
| Full SFT, single node | `<family>/<model>_<dataset>.yaml` | `automodel <yaml> --nproc-per-node N` |
| LoRA / PEFT, single node | `<family>/<model>_<dataset>_peft.yaml` | same as above |
| Multi-node on SLURM | any of the above | `sbatch` (see *Multi-Node Launches*) |
Subdirectories group recipes by model family (Llama, Mistral, Qwen, Gemma, Nemotron, …).


## Running a Recipe

Set up the environment with `uv`, then launch a recipe with `automodel`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Set up the environment with `uv`, then launch a recipe with `automodel`:
Recipes are launched through the `automodel` CLI (or its short alias `am`) — both are console scripts wrapping [`nemo_automodel/cli/app.py`](../../nemo_automodel/cli/app.py). For full setup and CLI options, see the [main README](../../README.md#getting-started); for end-to-end examples, see the [LLM SFT](../../README.md#llm-supervised-fine-tuning-sft) and [PEFT](../../README.md#llm-parameter-efficient-fine-tuning-peft) sections. Full reference docs: [docs.nvidia.com/nemo/automodel](https://docs.nvidia.com/nemo/automodel/latest/index.html).
Set up the environment with `uv`, then run a recipe:


```bash
uv venv
uv sync --frozen
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml
```

To run on multiple GPUs on a single node:

```bash
automodel examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml --nproc-per-node 8
```

These commands follow the repository's documented setup and launch pattern.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These commands follow the repository's documented setup and launch pattern.


## Important Note on `finetune.py`

A legacy `finetune.py` entry point exists in this directory, but it is deprecated. The script emits a deprecation warning and explicitly instructs users to launch recipes with:

```bash
automodel <config.yaml> [--nproc-per-node N]
```

So new documentation in this directory should prefer `automodel` over `python finetune.py`. This is also consistent with the main README's documented usage. The inspected script loads a config, constructs `TrainFinetuneRecipeForNextTokenPrediction`, then runs `setup()` followed by `run_train_validation_loop()`, which confirms that these examples are training-entry recipes rather than deployment scripts.
Comment on lines +23 to +31
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Important Note on `finetune.py`
A legacy `finetune.py` entry point exists in this directory, but it is deprecated. The script emits a deprecation warning and explicitly instructs users to launch recipes with:
```bash
automodel <config.yaml> [--nproc-per-node N]
```
So new documentation in this directory should prefer `automodel` over `python finetune.py`. This is also consistent with the main README's documented usage. The inspected script loads a config, constructs `TrainFinetuneRecipeForNextTokenPrediction`, then runs `setup()` followed by `run_train_validation_loop()`, which confirms that these examples are training-entry recipes rather than deployment scripts.
> [!NOTE]
> A legacy `finetune.py` still exists in this directory but is deprecated — it prints a `DeprecationWarning` and tells you to use `automodel` instead. Do not write new docs or examples around it.


## Multi-Node Launches

For SLURM-based multi-node runs, copy the reference `slurm.sub` script, adapt it for your cluster, and submit it with `sbatch`:

```bash
cp slurm.sub my_cluster.sub
sbatch my_cluster.sub
Comment on lines +38 to +39
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slurm.sub is at the repo root, not in this directory.

Suggested change
cp slurm.sub my_cluster.sub
sbatch my_cluster.sub
cp ../../slurm.sub my_cluster.sub # slurm.sub lives at the repo root
# edit my_cluster.sub: --nodes, --partition, container image, mounts, recipe path
sbatch my_cluster.sub

```

Cluster-specific settings such as nodes, GPUs, partition, container, and mounts should be defined in the sbatch script. NeMo-Run sections are also supported through the cluster guide.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cluster-specific settings such as nodes, GPUs, partition, container, and mounts should be defined in the sbatch script. NeMo-Run sections are also supported through the cluster guide.
Cluster-specific settings (`--nodes`, `--gpus`, `--partition`, container image, mounts, recipe path) live in the sbatch script. For the NeMo-Run launcher, see [`docs/launcher/slurm.md`](../../docs/launcher/slurm.md).


## After Fine-Tuning

These recipes are focused on training. After fine-tuning completes, the resulting checkpoints can be used in downstream evaluation, inference, or deployment workflows. The main README also highlights checkpointing and interoperability with Hugging Face and other NeMo ecosystem components.

## Deployment Guidance

This examples directory does not currently document a single canonical deployment command for all fine-tuned LLM recipes. Based on the materials reviewed here, the safest documented guidance is:

1. **Use the generated checkpoints in your follow-up evaluation or inference workflow.**
2. **Use AutoModel's documented container workflow** when you want a reproducible GPU-backed environment. The contributing guide documents both the AutoModel container path and a custom Docker build path.
3. **Refer to the broader NeMo and AutoModel documentation for production deployment architecture**, rather than assuming a serving/export API directly from these training examples. The repository positions AutoModel as part of the broader NeMo ecosystem for scalable training and deployment-oriented environments.
Comment on lines +48 to +54
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Deployment Guidance
This examples directory does not currently document a single canonical deployment command for all fine-tuned LLM recipes. Based on the materials reviewed here, the safest documented guidance is:
1. **Use the generated checkpoints in your follow-up evaluation or inference workflow.**
2. **Use AutoModel's documented container workflow** when you want a reproducible GPU-backed environment. The contributing guide documents both the AutoModel container path and a custom Docker build path.
3. **Refer to the broader NeMo and AutoModel documentation for production deployment architecture**, rather than assuming a serving/export API directly from these training examples. The repository positions AutoModel as part of the broader NeMo ecosystem for scalable training and deployment-oriented environments.
## Deployment
These examples are training recipes; this directory does not own a deployment path. See the [main README](../../README.md) and the [NeMo AutoModel docs](https://docs.nvidia.com/nemo/automodel/latest/index.html) for serving and export guidance.


## Development Notes

If you update documentation here, the contributing guide points contributors to the documentation development guide and requires signed-off commits:

```bash
git commit -s -m "docs: add llm finetune README"
```

Unsigned commits are not accepted.
Comment on lines +56 to +64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Development Notes
If you update documentation here, the contributing guide points contributors to the documentation development guide and requires signed-off commits:
```bash
git commit -s -m "docs: add llm finetune README"
```
Unsigned commits are not accepted.

Loading