Skip to content

feat: lr scheduler#151

Open
Chamberlain0w0 wants to merge 11 commits intomasterfrom
feat/lr_scheduler
Open

feat: lr scheduler#151
Chamberlain0w0 wants to merge 11 commits intomasterfrom
feat/lr_scheduler

Conversation

@Chamberlain0w0
Copy link
Copy Markdown
Contributor

@Chamberlain0w0 Chamberlain0w0 commented May 8, 2026

2025 年训练营项目选题“学习率调度器实现”中, @littleotherut 完成了学习率调度器模块的基本实现(PR #113 )。这个 PR 是在他实现的基础上进一步修改接口、规范相关,使之符合我们项目实际应用的需求。

设计文档可以参考:https://gxtctab8no8.feishu.cn/wiki/Bd6Pw1BeeiQ7QfktiT8cbSiTnFb?from=from_copylink。

核心修改:

  1. 接口上的修改:采用 megatron 风格的学习率调度相关参数,main.cc 里面加了相应的 gflags。
DEFINE_double(learning_rate, ..., "Peak learning rate.");
DEFINE_double(min_lr, 0.0, "Minimum learning rate.");
DEFINE_string(lr_decay_style, "constant", "LR decay style: none|constant|linear|cosine|inverse-square-root");
DEFINE_int64(lr_warmup_iters, 0, "Number of linear warmup iterations.");
DEFINE_double(lr_warmup_init, 0.0, "Initial learning rate at the start of warmup.");
DEFINE_int64(lr_decay_iters, 0, "Number of iterations to decay LR over (0 = num_iteration).");

获得上述参数后,构造相应的 TrainingLRSchedulerConfig 结构体,然后同优化器一并传给 CreateLRScheduler() 构造得到对应的学习率调度器。

  1. 设计上的修改
    a. 新增 LRScheduler 基类以及各种调度策略对应的派生类,与 torch 实现以及使用方式均对齐。训练循环中在 optimizer.step() 之后再调用 scheduler.step() 即完成学习率更新。
    b. LRScheduler 需要与 Optimizer 交互来获取、更新、同步学习率,所以给 Optimizer 基类也加了对应的 setter 和 getter。
    c. LRScheduler::State() 的部分仍是一个较 naive 的实现,后续等 ckpt 机制完成以后再进一步修改。

kinorw and others added 9 commits April 2, 2026 00:06
…r accessors, passthrough SetLearningRate/GetLearningRate, and add initial_learning_rate and it's accessors
…base class, add factory method Create<T>() with two-phase init and update all tests to use Create<T>() factory method.

- Change Step() to virtual with default implementation
- Add pure virtual ComputeLR() for subclasses to implement.
- Adapt test helpers (IdentityScheduler, LinearDecayScheduler) to  implement ComputeLR() instead of Step().
- All existing tests pass without behavioral changes.

BREAKING CHANGE: Subclasses must implement ComputeLR() instead of Step().
…closed and chained form, adjust LinearLR、SequentialLR

- enhance LRScheduler with chained and closed form learning rate methods
- adapt methods(Step, InitialStep, GetClosedFormLR, GetChainedFormLR) to match PyTorch‘s design
- add tests for consistency
- refactor LinearLR: add end_factor, and rename this class
- add SequentialLR InitialStep and UndoChildInitialSteps

BREAKING CHANGE: Subclasses must implement GetClosedFormLR instead of ComputeLR(). Should use LinearLR instead of LinearwarmupLR.
- Add LRSchedulerConfig struct with parameters for all basic schedulers(constant, linear, step)
- Add CreateLRScheduler() factory function
- Support automatic warmup wrapping via SequentialLR when warmup_steps > 0
- Adapt test files
…ogs, and integrate scheduler into training loop
…s, add validation tests for learning rate schedulers

- it now only be used for learning rate recovery when using loadstate
@Chamberlain0w0 Chamberlain0w0 self-assigned this May 8, 2026
@JYMiracle305 JYMiracle305 self-requested a review May 8, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants