Skip to content

【训练营】学习率调度器实现#113

Open
littleotherut wants to merge 8 commits intoInfiniTensor:masterfrom
littleotherut:lr_scheduler
Open

【训练营】学习率调度器实现#113
littleotherut wants to merge 8 commits intoInfiniTensor:masterfrom
littleotherut:lr_scheduler

Conversation

@littleotherut
Copy link
Copy Markdown

No description provided.


std::shared_ptr<Optimizer> optimizer_;
int64_t last_step_;
float current_lr_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_lr_ 似乎也有点冗余,语义上 current_lr_ 和 optimizer_->GetLearningRate() 的值在任何时候应等价,现在在你的设计里看到这二者存在各自分开存且混用的状态(读完发现目前的 current_lr_ 像是 optimizer_->GetLearningRate() 的一个副本);目前的数值正确性上你处理的没问题,但是这种设计交给后人来扩展的时候很可能带来歧义。

建议针对“当前学习率”只保留唯一真状态来源,要么就全程由 optimizer_->GetLearningRate() 跟踪,lr scheduler 里面就不存 current lr 了;要么就由 lr scheduler 跟踪,每次计算完再 set 回 optimizer。个人认为前者较合适。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改,由于需要调度器具备恢复训练的能力,而如SequentialLR或ChainedScheduler等不支持closed-form计算,无法根据base_lr和last_epoch快速得到学习率,因此保留接口仅用于学习率恢复,并调整命名为recover_lr避免混淆。

Copilot AI review requested due to automatic review settings March 20, 2026 20:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a learning-rate scheduler system to infini_train, integrates it with optimizers (including distributed optimizer), and adds standalone C++ test executables plus example CLI wiring to exercise the new schedulers.

Changes:

  • Add LRScheduler base + concrete schedulers (ConstantLR/StepLR/LinearLR/LambdaLR/SequentialLR/ChainedScheduler) and a CreateLRScheduler factory.
  • Extend Optimizer with runtime-settable learning rate and initial learning rate tracking; propagate LR to DistributedOptimizer.
  • Add scheduler coverage tests and wire scheduler flags into example/gpt2 and example/llama3; register new test executables in CMake.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
infini_train/include/lr_scheduler.h Declares scheduler APIs, configs, and concrete scheduler types.
infini_train/src/lr_scheduler.cc Implements scheduler logic, factory creation, state save/load, sequential/chained behavior.
infini_train/include/optimizer.h Adds LR getters/setters + initial LR tracking to support schedulers.
infini_train/src/optimizer.cc Implements optimizer LR plumbing and updates SGD/Adam to use base LR storage.
infini_train/include/nn/parallel/ddp/distributed_optimizer.h Overrides LR get/set for distributed optimizer so schedulers affect the real base optimizer.
infini_train/src/nn/parallel/ddp/distributed_optimizer.cc Implements LR propagation to/from the wrapped base optimizer.
example/gpt2/main.cc Adds scheduler CLI flags and steps the scheduler during training.
example/llama3/main.cc Adds scheduler CLI flags and steps the scheduler during training.
test/lr_scheduler/test_helpers.h Shared minimal test helpers/macros for scheduler tests.
test/lr_scheduler/test_*.cc Adds functional + state + validation tests for schedulers.
CMakeLists.txt Adds new scheduler test executables to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Chamberlain0w0
Copy link
Copy Markdown
Contributor

另外几个在开发规范上需要修改的地方:

  1. 使用英文注释,目前 test/ 下面存在一些中文注释,全局修改一下
  2. rebase 到 master,解决一下冲突
  3. 目前 commit 较多较散,建议按照功能实现/修改的要点 squash 成几个主要的 commit

kinorw and others added 8 commits April 2, 2026 00:06
…r accessors, passthrough SetLearningRate/GetLearningRate, and add initial_learning_rate and it's accessors
…base class, add factory method Create<T>() with two-phase init and update all tests to use Create<T>() factory method.

- Change Step() to virtual with default implementation
- Add pure virtual ComputeLR() for subclasses to implement.
- Adapt test helpers (IdentityScheduler, LinearDecayScheduler) to  implement ComputeLR() instead of Step().
- All existing tests pass without behavioral changes.

BREAKING CHANGE: Subclasses must implement ComputeLR() instead of Step().
…closed and chained form, adjust LinearLR、SequentialLR

- enhance LRScheduler with chained and closed form learning rate methods
- adapt methods(Step, InitialStep, GetClosedFormLR, GetChainedFormLR) to match PyTorch‘s design
- add tests for consistency
- refactor LinearLR: add end_factor, and rename this class
- add SequentialLR InitialStep and UndoChildInitialSteps

BREAKING CHANGE: Subclasses must implement GetClosedFormLR instead of ComputeLR(). Should use LinearLR instead of LinearwarmupLR.
- Add LRSchedulerConfig struct with parameters for all basic schedulers(constant, linear, step)
- Add CreateLRScheduler() factory function
- Support automatic warmup wrapping via SequentialLR when warmup_steps > 0
- Adapt test files
…ogs, and integrate scheduler into training loop
…s, add validation tests for learning rate schedulers

- it now only be used for learning rate recovery when using loadstate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants