Commit f619e36
fix(ckpt): master-only mkdir + barrier to avoid concurrent-mkdir race on parallel FS (#2699)
Path.mkdir(exist_ok=True) re-checks self.is_dir() after EEXIST and re-raises if
it returns False. On a parallel filesystem (beegfs) with concurrent mkdir from
every trainer rank, a non-master rank can hit EEXIST while its metadata cache
hasn't seen the new inode yet -> is_dir() returns False -> FileExistsError,
crashing the trainer mid-save (observed at weights/step_180).
Guard the three all-ranks mkdir sites behind world.is_master + a barrier so only
the master creates the dir and the others wait until it exists:
- CheckpointManager.save_to_path: dataloader_dir
- CheckpointManager.save: ckpt_path.parent
- WeightCheckpointManager.save: step_path
Mirrors the existing master-only guard in WeightCheckpointManager.save_to_path.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 563b37a commit f619e36
1 file changed
Lines changed: 16 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
171 | 171 | | |
172 | 172 | | |
173 | 173 | | |
174 | | - | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
175 | 180 | | |
176 | 181 | | |
177 | 182 | | |
| |||
239 | 244 | | |
240 | 245 | | |
241 | 246 | | |
242 | | - | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
243 | 252 | | |
244 | 253 | | |
245 | 254 | | |
| |||
390 | 399 | | |
391 | 400 | | |
392 | 401 | | |
393 | | - | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
394 | 407 | | |
395 | 408 | | |
396 | 409 | | |
| |||
0 commit comments