Skip to content

Add configurable checkpoint filenames and state-dict key remapping#1705

Open
Thabhelo wants to merge 2 commits into
NVIDIA:mainfrom
Thabhelo:feat/checkpoint-filename-state-dict-compat
Open

Add configurable checkpoint filenames and state-dict key remapping#1705
Thabhelo wants to merge 2 commits into
NVIDIA:mainfrom
Thabhelo:feat/checkpoint-filename-state-dict-compat

Conversation

@Thabhelo
Copy link
Copy Markdown

@Thabhelo Thabhelo commented Jun 6, 2026

Summary

  • Add optional filename_format to save_checkpoint so model checkpoint basenames can use custom layouts (e.g. zero-padded epochs) while preserving legacy naming when unset.
  • Add Module._backward_compat_state_dict_mapper and apply it in from_checkpoint before load_state_dict, mirroring the existing constructor-arg backward-compat hook for refactored parameter names.

Closes #1175
Closes #1173

Test plan

  • pytest test/utils/test_checkpoint.py::test_save_checkpoint_filename_format
  • pytest test/utils/test_checkpoint.py::test_save_checkpoint_filename_format_invalid_placeholder
  • pytest test/models/test_from_checkpoint.py::test_from_checkpoint_state_dict_mapper
  • pre-commit on changed files

Expose filename_format on save_checkpoint for custom model checkpoint
basenames, and add Module._backward_compat_state_dict_mapper for loading
checkpoints after parameter renames. Closes NVIDIA#1175 and NVIDIA#1173.

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 6, 2026

Greptile Summary

This PR adds two independent utilities: a filename_format parameter to save_checkpoint for custom model-checkpoint filenames (e.g., zero-padded epochs), and a _backward_compat_state_dict_mapper hook on Module for version-aware state-dict key remapping in from_checkpoint.

  • The state-dict mapper feature (module.py) is well-structured and correctly integrated into both the zip and tar from_checkpoint code paths using the existing mdlus_file_version metadata key.
  • The filename_format feature (checkpoint.py) works correctly when epoch is explicitly specified, but _resolve_checkpoint_index uses a legacy regex that cannot discover custom-formatted filenames; auto-indexing (epoch=None with a custom format) will always resolve to index 0 and silently overwrite the first checkpoint on every subsequent save.
  • _resolve_checkpoint_index is invoked unconditionally but its result is ignored on the legacy (filename_format=None) code path, triggering a redundant filesystem scan on every call.

Important Files Changed

Filename Overview
physicsnemo/utils/checkpoint.py Introduces _resolve_checkpoint_index helper and filename_format parameter; auto-indexing is broken when filename_format is used without an explicit epoch because the helper's regex only matches legacy filenames, and the helper result is unused on the legacy code path causing a double scan.
physicsnemo/core/module.py Adds _backward_compat_state_dict_mapper classmethod (no-op base) and _apply_backward_compat_state_dict static method; correctly wired into both zip and tar from_checkpoint paths using the existing mdlus_file_version metadata key.
test/utils/test_checkpoint.py Adds two tests for filename_format: happy-path with zero-padded epoch and invalid-placeholder error. Tests always pass an explicit epoch, so the broken auto-index path is not exercised.
test/models/test_from_checkpoint.py Adds end-to-end test for state-dict key remapping across model versions; covers key rename and verifies weight values are preserved.
CHANGELOG.md Adds changelog entries for both new features under the Unreleased section.

Reviews (1): Last reviewed commit: "Add checkpoint filename_format and state..." | Re-trigger Greptile

Comment thread physicsnemo/utils/checkpoint.py
Comment thread physicsnemo/utils/checkpoint.py Outdated
Comment thread physicsnemo/utils/checkpoint.py
Resolve checkpoint indices from formatted basenames when epoch is omitted,
reuse _resolve_checkpoint_index for the legacy naming path, and add a test
for auto-incrementing custom checkpoint filenames.

Signed-off-by: Thabhelo <50872400+Thabhelo@users.noreply.github.com>
@Thabhelo
Copy link
Copy Markdown
Author

Thabhelo commented Jun 6, 2026

Greptile triage for 8e60df5:

  • P1 auto-indexing with custom filename_format: fixed. _filename_format_index_pattern parses formatted basenames, and test_save_checkpoint_filename_format_auto_index covers the epoch=None path.
  • P2 no-op ternary: fixed. plain return 0 now.
  • P2 redundant legacy scan: fixed. legacy naming reuses resolved_index directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant