Skip to content

feat: add MSC cloud storage support for dcp checkpoints#1709

Open
edjson wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
edjson:feat/s3-dcp-support
Open

feat: add MSC cloud storage support for dcp checkpoints#1709
edjson wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
edjson:feat/s3-dcp-support

Conversation

@edjson
Copy link
Copy Markdown

@edjson edjson commented Apr 7, 2026

What does this PR do ?

Adds support for saving and loading DCP checkpoints to cloud storage using NVIDIA's multi storage client (msc). Users can now specify "msc://" paths for checkpoint directories instead of being limited to the local disks.

Changelog

  • Added optional multistorageclient import with a fallback if not installed

  • Added is_cloud_path() helper to detect MSC cloud paths

  • Added _ensure_msc_available() to raise an error if MSC is not installed

  • Modified '_ensure_dirs()' to skip 'os.makedirs' for cloud paths

  • Modified 'save_config()' to use 'msc.open()' for cloud paths

  • Modified '_do_save()' to use 'msc.torch.MultiStorageFileSystemWriter' for cloud paths

  • Modified '_do_load()' to use 'msc.torch.MultiStorageFileSystemReader' for cloud paths

  • Added 15 unit tests covering all new functionality

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

-Tested locally using MinIO as the s3 substitute

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@edjson edjson changed the title Feat/s3 dcp support feat: add MSC cloud storage support for checkpoints and added test cases for the functionality Apr 7, 2026
@edjson edjson changed the title feat: add MSC cloud storage support for checkpoints and added test cases for the functionality feat: add MSC cloud storage support for dcp checkpoints Apr 7, 2026
@edjson edjson marked this pull request as draft April 7, 2026 07:44
@edjson edjson marked this pull request as ready for review April 7, 2026 07:59
@edjson edjson requested a review from jgerh as a code owner April 7, 2026 07:59
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 7, 2026

@adil-a can you review?

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 7, 2026

/claude review

Comment thread nemo_automodel/components/checkpoint/checkpointing.py Outdated
Comment thread pyproject.toml Outdated
"torchao",
"mlflow",
"flashoptim>=0.1.3",
"localstack>=2026.3.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localstack is a local AWS emulator used for testing — it should not be a core runtime dependency. This pulls in a large dependency tree (Docker SDK, DNS libs, etc.) for all users. Move it to an optional [project.optional-dependencies] group (e.g., test or dev).

Comment thread nemo_automodel/components/checkpoint/checkpointing.py
Comment on lines +937 to +942

with patch("nemo_automodel.components.checkpoint.checkpointing.msc") as mock_msc, \
patch("nemo_automodel.components.checkpoint.checkpointing.dcp"):
Checkpointer._do_save(ckptr, state_dict, path)
Checkpointer._do_load(ckptr, state_dict, path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test name is misleading. The path "msc://bucket/step-100" does not contain /model, so is_model will be False in _do_load, and the PEFT branch (self.config.is_peft and is_model) is never taken. This test actually exercises the normal DCP cloud load path, not PEFT loading.

To actually test PEFT + cloud, the path would need /model in it — but that would expose the bug where load_file(os.path.join(...)) doesn't work with msc:// paths.

@edjson edjson force-pushed the feat/s3-dcp-support branch from 398a3d9 to 5a56bbd Compare April 8, 2026 01:26
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 8, 2026

Hi @adil-a can you review? 🙇

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a tech pubs review and added a few copyedits.

Comment thread docs/guides/checkpointing.md Outdated
Comment thread docs/guides/checkpointing.md
Comment thread docs/guides/checkpointing.md
Comment thread docs/guides/checkpointing.md
Comment thread docs/guides/checkpointing.md
Comment thread docs/guides/checkpointing.md Outdated
Comment thread docs/guides/checkpointing.md Outdated
Comment thread docs/guides/checkpointing.md Outdated
Comment thread docs/guides/checkpointing.md Outdated
Comment thread docs/guides/checkpointing.md Outdated
@edjson edjson requested a review from jgerh April 8, 2026 23:06
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 9, 2026

/ok to test ebffa7a

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 17, 2026

Hi @edjson , I see latest commit pulled in a lot other commits, if it's helpful to you/ less time-consuming git-fighting, please feel free to open another PR.

@edjson edjson force-pushed the feat/s3-dcp-support branch 2 times, most recently from 0ae0ac4 to 2cbe848 Compare April 18, 2026 00:16
Signed-off-by: Edison <edisonggacc@gmail.com>
@edjson edjson force-pushed the feat/s3-dcp-support branch from 2cbe848 to e509944 Compare April 18, 2026 00:16
Signed-off-by: Edison <edisonggacc@gmail.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 18, 2026

/ok to test 124c1b0

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 19, 2026

/ok to test c983bd6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

S3 + DCP

5 participants