-
Notifications
You must be signed in to change notification settings - Fork 77
[feat] support keep_checkpoint_max with async checkpoint pruning #528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0917e22
a2bfa13
0273efb
d2c90ff
aebfb94
6a7c102
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -70,5 +70,7 @@ message TrainConfig { | |||||||||
| optional uint32 gradient_accumulation_steps = 18; | ||||||||||
| // dense gradient clipping config | ||||||||||
| optional GradClipping grad_clipping = 19; | ||||||||||
| // maximum number of recent checkpoints to keep; 0 keeps all. | ||||||||||
| optional uint32 keep_checkpoint_max = 20 [default = 0]; | ||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment reads as a hard cap, but when
Suggested change
|
||||||||||
| // TBD: qcomm config | ||||||||||
| } | ||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
close()only runs if the training loop completes normally. If any step between the firstsave()(which starts the daemon worker) and here raises,close()is skipped: the final coalesced prune pass is abandoned and the worker thread is leaked (harmless across process exit since it's a daemon, but it breaks the documented "on-disk state settled before export readsmodel_dir" contract if anything downstream runs in the same process). Consider wrapping the loop body soclose()runs in afinally.