@@ -342,3 +342,74 @@ class NewState:
342342```
343343
344344Inside your recipe class, define the new state as an instance attribute using ` self.new_state = NewState(...) ` .
345+
346+ ## Cloud Storage Checkpoints (MSC)
347+
348+ NeMo Automodel supports saving and loading checkpoints directly to cloud object storage
349+ using NVIDIA's [ Multi-Storage Client (MSC)] ( https://nvidia.github.io/multi-storage-client/ ) .
350+ This is useful when training on cloud clusters where local disk is ephemeral or too small
351+ to hold full distributed checkpoints.
352+
353+ ### Installation
354+
355+ ``` bash
356+ pip install multi-storage-client --index-url https://pypi.nvidia.com
357+ ```
358+
359+ ### MSC Profile Configuration
360+
361+ MSC authenticates with your storage provider via a profile configuration file at
362+ ` ~/.msc_config.yaml ` . The profile name ** must match the bucket name** in your
363+ ` msc:// ` path — this is the most common source of errors when first setting up MSC.
364+
365+ For example, if your checkpoint path is ` msc://my-bucket/checkpoints ` , your config
366+ must have a profile named ` my-bucket ` :
367+ ``` yaml
368+ profiles :
369+ my-bucket :
370+ storage_provider :
371+ type : s3
372+ options :
373+ region_name : us-east-1
374+ credentials :
375+ type : s3
376+ options :
377+ access_key : YOUR_ACCESS_KEY
378+ secret_key : YOUR_SECRET_KEY
379+ ` ` `
380+
381+ MSC supports AWS S3, Azure Blob Storage, Google Cloud Storage, and NVIDIA AIStore.
382+ See the [MSC documentation](https://nvidia.github.io/multi-storage-client/) for
383+ provider-specific configuration.
384+
385+ ### Usage
386+
387+ Set ` checkpoint_dir` to an `msc://` path — everything else works the same as local
388+ checkpointing :
389+ ` ` ` yaml
390+ checkpoint:
391+ checkpoint_dir: msc://my-bucket/checkpoints
392+ ` ` `
393+ ` ` ` bash
394+ automodel --nproc-per-node=2 examples/llm_finetune/llama3_2/llama3_2_1b_squad.yaml \
395+ --checkpoint.checkpoint_dir msc://my-bucket/checkpoints \
396+ --checkpoint.model_save_format safetensors \
397+ --checkpoint.save_consolidated true
398+ ` ` `
399+
400+ After 20 steps you should see :
401+
402+ >Saving checkpoint to msc://my-bucket/checkpoints/epoch_0_step_20
403+
404+
405+ The checkpoint layout in cloud storage is identical to the local layout described
406+ above. Resume works the same way — rerunning the command picks up from the
407+ `LATEST` symlink automatically :
408+
409+ >Loading checkpoint from msc://my-bucket/checkpoints/epoch_0_step_20
410+
411+
412+ :: : {note}
413+ Asynchronous checkpointing (`is_async : true`) is supported with MSC paths and is
414+ recommended for large models where synchronous cloud writes would stall training.
415+ :: :
0 commit comments