Skip to content

Commit 393d889

Browse files
committed
docs: add MSC cloud storage documentation to checkpointing guide
Signed-off-by: Edison <edisonggacc@gmail.com>
1 parent db91b76 commit 393d889

1 file changed

Lines changed: 71 additions & 0 deletions

File tree

docs/guides/checkpointing.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -342,3 +342,74 @@ class NewState:
342342
```
343343

344344
Inside your recipe class, define the new state as an instance attribute using `self.new_state = NewState(...)`.
345+
346+
## Cloud Storage Checkpoints (MSC)
347+
348+
NeMo Automodel supports saving and loading checkpoints directly to cloud object storage
349+
using NVIDIA's [Multi-Storage Client (MSC)](https://nvidia.github.io/multi-storage-client/).
350+
This is useful when training on cloud clusters where local disk is ephemeral or too small
351+
to hold full distributed checkpoints.
352+
353+
### Installation
354+
355+
```bash
356+
pip install multi-storage-client --index-url https://pypi.nvidia.com
357+
```
358+
359+
### MSC Profile Configuration
360+
361+
MSC authenticates with your storage provider via a profile configuration file at
362+
`~/.msc_config.yaml`. The profile name **must match the bucket name** in your
363+
`msc://` path — this is the most common source of errors when first setting up MSC.
364+
365+
For example, if your checkpoint path is `msc://my-bucket/checkpoints`, your config
366+
must have a profile named `my-bucket`:
367+
```yaml
368+
profiles:
369+
my-bucket:
370+
storage_provider:
371+
type: s3
372+
options:
373+
region_name: us-east-1
374+
credentials:
375+
type: s3
376+
options:
377+
access_key: YOUR_ACCESS_KEY
378+
secret_key: YOUR_SECRET_KEY
379+
```
380+
381+
MSC supports AWS S3, Azure Blob Storage, Google Cloud Storage, and NVIDIA AIStore.
382+
See the [MSC documentation](https://nvidia.github.io/multi-storage-client/) for
383+
provider-specific configuration.
384+
385+
### Usage
386+
387+
Set `checkpoint_dir` to an `msc://` path — everything else works the same as local
388+
checkpointing:
389+
```yaml
390+
checkpoint:
391+
checkpoint_dir: msc://my-bucket/checkpoints
392+
```
393+
```bash
394+
automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
395+
--checkpoint.checkpoint_dir msc://my-bucket/checkpoints \
396+
--checkpoint.model_save_format safetensors \
397+
--checkpoint.save_consolidated true
398+
```
399+
400+
After 20 steps you should see:
401+
402+
>Saving checkpoint to msc://my-bucket/checkpoints/epoch_0_step_20
403+
404+
405+
The checkpoint layout in cloud storage is identical to the local layout described
406+
above. Resume works the same way — rerunning the command picks up from the
407+
`LATEST` symlink automatically:
408+
409+
>Loading checkpoint from msc://my-bucket/checkpoints/epoch_0_step_20
410+
411+
412+
::: {note}
413+
Asynchronous checkpointing (`is_async: true`) is supported with MSC paths and is
414+
recommended for large models where synchronous cloud writes would stall training.
415+
:::

0 commit comments

Comments
 (0)