Skip to content

Add transformers-like checkpoint parameters (--save-total-limit, --save-strategy, and so on)#547

Open
thechaos16 wants to merge 1 commit into
sgl-project:mainfrom
thechaos16:add-save-tota-limit-and-save-strategy
Open

Add transformers-like checkpoint parameters (--save-total-limit, --save-strategy, and so on)#547
thechaos16 wants to merge 1 commit into
sgl-project:mainfrom
thechaos16:add-save-tota-limit-and-save-strategy

Conversation

@thechaos16
Copy link
Copy Markdown

Motivation

  • Storage Management
    • Prevents disk overflow during long training sessions by strictly limiting the total number of saved checkpoints.
  • Optimal Model Selection
    • Eliminates the manual effort of identifying the best-performing state by automatically tracking and preserving the checkpoint with the best evaluation metrics.

Modifications

  • add arguments
    • --save-strategy
      • step: same as before
      • best: keep the best one
    • --save-total-limit
      • number of checkpoints can be saved
    • --metric-for-best
      • metric to compare (by default, acc_0)
    • --greater-is-better
    • --load-best-mode-at-end
  • add rotate_checkpoints and sort_checkpoints
    • almost same implementation with transformers
    • slightly different because of the file naming convention
  • keep the best model
    • similar logic with transformers, but different implementation due to details (transformers uses Trainer class, so it can send best checkpoint via self, but here, we need to use external variable)
  • save best model at end
    • differently with transformers, simply copy best checkpoint with pre-defined directory (best).
  • documents and example

Accuracy Test

I believe that it is not related to model-code

Benchmark & Profiling

I belive that is not related to benchmark performance

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fine-grained checkpoint management to the training pipeline, allowing users to automatically track and preserve the best-performing model based on evaluation metrics while limiting the total number of saved checkpoints to manage disk space. Key updates include new command-line arguments in train_eagle3.py, documentation updates, and the addition of checkpoint rotation utilities in specforge/utils.py. Review feedback identifies a critical typo in an attribute name that would cause a runtime error, significant indentation and logic issues in the sort_checkpoints function, and an unused variable initialization.

Comment thread scripts/train_eagle3.py Outdated
Comment thread specforge/utils.py Outdated
Comment thread scripts/train_eagle3.py
best_metric = float("-inf") if args.greater_is_better else float("inf")
best_model_checkpoint = None
current_is_best = False
is_best = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable is_best is initialized here but shadowed by a local variable of the same name inside the training loop (line 1038). It appears to be unused in this scope.

@thechaos16 thechaos16 force-pushed the add-save-tota-limit-and-save-strategy branch from e896c4f to 10bb1c7 Compare April 27, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant