A library for saving and loading the distributed checkpoints. A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr) but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration.
Using the library requires defining sharded state_dict dictionaries with functions from mapping and optimizer modules. Those state dicts can be saved or loaded with a serialization module using strategies from strategies module.
.. toctree:: :maxdepth: 4 dist_checkpointing.strategies
.. automodule:: core.dist_checkpointing.serialization :members: :undoc-members: :show-inheritance:
.. automodule:: core.dist_checkpointing.mapping :members: :undoc-members: :show-inheritance:
.. automodule:: core.dist_checkpointing.optimizer :members: :undoc-members: :show-inheritance:
.. automodule:: core.dist_checkpointing.core :members: :undoc-members: :show-inheritance:
.. automodule:: core.dist_checkpointing.dict_utils :members: :undoc-members: :show-inheritance:
.. automodule:: core.dist_checkpointing.utils :members: :undoc-members: :show-inheritance:
.. automodule:: core.dist_checkpointing :members: :undoc-members: :show-inheritance: