feat: add MSC cloud storage support for dcp checkpoints#1709
feat: add MSC cloud storage support for dcp checkpoints#1709edjson wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
@adil-a can you review? |
|
/claude review |
| "torchao", | ||
| "mlflow", | ||
| "flashoptim>=0.1.3", | ||
| "localstack>=2026.3.0", |
There was a problem hiding this comment.
localstack is a local AWS emulator used for testing — it should not be a core runtime dependency. This pulls in a large dependency tree (Docker SDK, DNS libs, etc.) for all users. Move it to an optional [project.optional-dependencies] group (e.g., test or dev).
|
|
||
| with patch("nemo_automodel.components.checkpoint.checkpointing.msc") as mock_msc, \ | ||
| patch("nemo_automodel.components.checkpoint.checkpointing.dcp"): | ||
| Checkpointer._do_save(ckptr, state_dict, path) | ||
| Checkpointer._do_load(ckptr, state_dict, path) | ||
|
|
There was a problem hiding this comment.
This test name is misleading. The path "msc://bucket/step-100" does not contain /model, so is_model will be False in _do_load, and the PEFT branch (self.config.is_peft and is_model) is never taken. This test actually exercises the normal DCP cloud load path, not PEFT loading.
To actually test PEFT + cloud, the path would need /model in it — but that would expose the bug where load_file(os.path.join(...)) doesn't work with msc:// paths.
398a3d9 to
5a56bbd
Compare
|
Hi @adil-a can you review? 🙇 |
jgerh
left a comment
There was a problem hiding this comment.
Completed a tech pubs review and added a few copyedits.
|
/ok to test ebffa7a |
50b10f6 to
eab36b0
Compare
|
Hi @edjson , I see latest commit pulled in a lot other commits, if it's helpful to you/ less time-consuming git-fighting, please feel free to open another PR. |
0ae0ac4 to
2cbe848
Compare
Signed-off-by: Edison <edisonggacc@gmail.com>
2cbe848 to
e509944
Compare
Signed-off-by: Edison <edisonggacc@gmail.com>
|
/ok to test 124c1b0 |
Signed-off-by: Edison <edisonggacc@gmail.com>
Signed-off-by: Edison <edisonggacc@gmail.com>
|
/ok to test c983bd6 |
What does this PR do ?
Adds support for saving and loading DCP checkpoints to cloud storage using NVIDIA's multi storage client (msc). Users can now specify "msc://" paths for checkpoint directories instead of being limited to the local disks.
Changelog
Added optional
multistorageclientimport with a fallback if not installedAdded
is_cloud_path()helper to detect MSC cloud pathsAdded
_ensure_msc_available()to raise an error if MSC is not installedModified '_ensure_dirs()' to skip 'os.makedirs' for cloud paths
Modified 'save_config()' to use 'msc.open()' for cloud paths
Modified '_do_save()' to use 'msc.torch.MultiStorageFileSystemWriter' for cloud paths
Modified '_do_load()' to use 'msc.torch.MultiStorageFileSystemReader' for cloud paths
Added 15 unit tests covering all new functionality
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
-Tested locally using MinIO as the s3 substitute