Skip to content

Latest commit

 

History

History
79 lines (57 loc) · 1.9 KB

File metadata and controls

79 lines (57 loc) · 1.9 KB

dist_checkpointing package

A library for saving and loading the distributed checkpoints. A "distributed checkpoint" can have various underlying formats (current default format is based on Zarr) but has a distinctive property - the checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration.

Using the library requires defining sharded state_dict dictionaries with functions from mapping and optimizer modules. Those state dicts can be saved or loaded with a serialization module using strategies from strategies module.

Subpackages

.. toctree::
   :maxdepth: 4

   dist_checkpointing.strategies

Submodules

dist_checkpointing.serialization module

.. automodule:: core.dist_checkpointing.serialization
   :members:
   :undoc-members:
   :show-inheritance:

dist_checkpointing.mapping module

.. automodule:: core.dist_checkpointing.mapping
   :members:
   :undoc-members:
   :show-inheritance:

dist_checkpointing.optimizer module

.. automodule:: core.dist_checkpointing.optimizer
   :members:
   :undoc-members:
   :show-inheritance:

dist_checkpointing.core module

.. automodule:: core.dist_checkpointing.core
   :members:
   :undoc-members:
   :show-inheritance:

dist_checkpointing.dict_utils module

.. automodule:: core.dist_checkpointing.dict_utils
   :members:
   :undoc-members:
   :show-inheritance:


dist_checkpointing.utils module

.. automodule:: core.dist_checkpointing.utils
   :members:
   :undoc-members:
   :show-inheritance:

Module contents

.. automodule:: core.dist_checkpointing
   :members:
   :undoc-members:
   :show-inheritance: