Skip to content

Commit a47f657

Browse files
Merge pull request #3538 from AI-Hypercomputer:autocheckpoint-doc
PiperOrigin-RevId: 892676949
2 parents 4a0b8cb + 052b650 commit a47f657

2 files changed

Lines changed: 11 additions & 0 deletions

File tree

docs/guides/checkpointing_solutions/emergency_checkpointing.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,16 @@ MaxText provides a set of configuration flags to control checkpointing options.
113113
| `local_checkpoint_period` | The interval, in training steps, for how often a **local checkpoint** is saved. This should be set to a much smaller value than `checkpoint_period` for frequent, low-overhead saves. | `integer` | `0` |
114114
| `checkpoint_period` | The interval, in training steps, for how often a checkpoint is saved to **persistent storage**. | `integer` | `10000` |
115115
| `enable_single_replica_ckpt_restoring` | If `True`, one replica reads the checkpoint from storage and then broadcasts it to all other replicas. This can significantly speed up restoration on multi-host systems by reducing redundant reads from storage. | `boolean` | `False` |
116+
| `enable_autocheckpoint` | If `True`, enables saving a checkpoint when a preemption signal (SIGTERM) is received. This is a reactive mechanism that saves to persistent storage. | `boolean` | `False` |
117+
118+
### Autocheckpoint vs. Emergency Checkpointing
119+
120+
While both features aim to protect against progress loss, they operate differently:
121+
122+
- **Autocheckpoint (`enable_autocheckpoint`)**: A **reactive** mechanism. When the infrastructure sends a `SIGTERM` signal (indicating imminent preemption or maintenance), MaxText immediately attempts to save a checkpoint to persistent storage (GCS). It is best for handling planned maintenance or preemptions where a short grace period is provided.
123+
- **Emergency Checkpointing (`enable_emergency_checkpoint`)**: A **proactive** mechanism. It saves checkpoints very frequently to local, high-speed storage (ramdisk). If a failure occurs *without* warning, the job can recover from the most recent local checkpoint. It is best for handling sudden hardware failures.
124+
125+
For maximum reliability, both features can be enabled simultaneously.
116126

117127
## Workload creation using XPK
118128

docs/reference/core_concepts/checkpoints.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,7 @@ MaxText automatically saves checkpoints periodically during a training run. Thes
9595
Furthermore, MaxText supports emergency checkpointing, which saves a local copy of the checkpoint that can be restored quickly after an interruption.
9696

9797
- `enable_emergency_checkpoint`: A boolean to enable or disable this feature.
98+
- `enable_autocheckpoint`: A boolean to enable or disable saving a checkpoint when a preemption signal (SIGTERM) is received.
9899
- `local_checkpoint_directory`: The local path for storing emergency checkpoints.
99100
- `local_checkpoint_period`: The interval, in training steps, for saving local checkpoints.
100101

0 commit comments

Comments
 (0)