Add transformers-like checkpoint parameters (--save-total-limit, --save-strategy, and so on) by thechaos16 · Pull Request #547 · sgl-project/SpecForge

thechaos16 · 2026-04-27T14:41:41Z

Motivation

Storage Management
- Prevents disk overflow during long training sessions by strictly limiting the total number of saved checkpoints.
Optimal Model Selection
- Eliminates the manual effort of identifying the best-performing state by automatically tracking and preserving the checkpoint with the best evaluation metrics.

Modifications

add arguments
- --save-strategy
  - step: same as before
  - best: keep the best one
- --save-total-limit
  - number of checkpoints can be saved
- --metric-for-best
  - metric to compare (by default, acc_0)
- --greater-is-better
- --load-best-mode-at-end
add rotate_checkpoints and sort_checkpoints
- almost same implementation with transformers
- slightly different because of the file naming convention
keep the best model
- similar logic with transformers, but different implementation due to details (transformers uses Trainer class, so it can send best checkpoint via self, but here, we need to use external variable)
save best model at end
- differently with transformers, simply copy best checkpoint with pre-defined directory (best).
documents and example

Accuracy Test

I believe that it is not related to model-code

Benchmark & Profiling

I belive that is not related to benchmark performance

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://sgl-fru7574.slack.com/archives/C09784E3EN6 to discuss your PR.

gemini-code-assist

Code Review

This pull request introduces fine-grained checkpoint management to the training pipeline, allowing users to automatically track and preserve the best-performing model based on evaluation metrics while limiting the total number of saved checkpoints to manage disk space. Key updates include new command-line arguments in train_eagle3.py, documentation updates, and the addition of checkpoint rotation utilities in specforge/utils.py. Review feedback identifies a critical typo in an attribute name that would cause a runtime error, significant indentation and logic issues in the sort_checkpoints function, and an unused variable initialization.

gemini-code-assist · 2026-04-27T14:45:44Z

+    best_metric = float("-inf") if args.greater_is_better else float("inf")
+    best_model_checkpoint = None
+    current_is_best = False
+    is_best = False


The variable is_best is initialized here but shadowed by a local variable of the same name inside the training loop (line 1038). It appears to be unused in this scope.

…d save the best parameter at end

thechaos16 requested review from FlamingoPg, shuaills and sleepcoo as code owners April 27, 2026 14:41

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

feat: add save-total-limit with rotation, save-strategy with best, an…

10bb1c7

…d save the best parameter at end

thechaos16 force-pushed the add-save-tota-limit-and-save-strategy branch from e896c4f to 10bb1c7 Compare April 27, 2026 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transformers-like checkpoint parameters (--save-total-limit, --save-strategy, and so on)#547

Add transformers-like checkpoint parameters (--save-total-limit, --save-strategy, and so on)#547
thechaos16 wants to merge 1 commit into
sgl-project:mainfrom
thechaos16:add-save-tota-limit-and-save-strategy

thechaos16 commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thechaos16 commented Apr 27, 2026

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant