Skip to content

Commit 156dc05

Browse files
committed
doc
1 parent 7ca9655 commit 156dc05

File tree

1 file changed

+156
-5
lines changed

1 file changed

+156
-5
lines changed

doc/train/learning-rate.md

Lines changed: 156 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# Learning rate
22

3-
DeePMD-kit supports two learning rate schedules:
3+
DeePMD-kit supports three learning rate schedules:
44

55
- **`exp`**: Exponential decay with optional stepped or smooth mode
66
- **`cosine`**: Cosine annealing for smooth decay curve
7+
- **`wsd`**: Warmup-stable-decay with configurable final decay rule
78

8-
Both schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target {ref}`start_lr <learning_rate/start_lr>`.
9+
All schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target {ref}`start_lr <learning_rate/start_lr>`.
910

1011
This page focuses on schedule behavior, examples, and formulas. For the canonical argument definitions, see {ref}`learning_rate <learning_rate>`.
1112

@@ -32,14 +33,26 @@ This page focuses on schedule behavior, examples, and formulas. For the canonica
3233
}
3334
```
3435

36+
### Warmup-stable-decay
37+
38+
```json
39+
"learning_rate": {
40+
"type": "wsd",
41+
"start_lr": 0.001,
42+
"stop_lr": 1e-6,
43+
"decay_phase_ratio": 0.1
44+
}
45+
```
46+
3547
## Common parameters
3648

3749
Use {ref}`learning_rate <learning_rate>` as the canonical parameter reference. This page only highlights the argument combinations that matter when choosing a schedule:
3850

39-
- Shared by both {ref}`exp <learning_rate[exp]>` and {ref}`cosine <learning_rate[cosine]>`: {ref}`start_lr <learning_rate/start_lr>` plus exactly one of {ref}`stop_lr <learning_rate/stop_lr>` or {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`.
40-
- Optional warmup for both schedules: {ref}`warmup_steps <learning_rate/warmup_steps>` or {ref}`warmup_ratio <learning_rate/warmup_ratio>`, with optional {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`.
41-
- Optional distributed scaling for both schedules: {ref}`scale_by_worker <learning_rate/scale_by_worker>`.
51+
- Shared by {ref}`exp <learning_rate[exp]>`, {ref}`cosine <learning_rate[cosine]>`, and {ref}`wsd <learning_rate[wsd]>`: {ref}`start_lr <learning_rate/start_lr>` plus exactly one of {ref}`stop_lr <learning_rate/stop_lr>` or {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`.
52+
- Optional warmup for all schedules: {ref}`warmup_steps <learning_rate/warmup_steps>` or {ref}`warmup_ratio <learning_rate/warmup_ratio>`, with optional {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`.
53+
- Optional distributed scaling for all schedules: {ref}`scale_by_worker <learning_rate/scale_by_worker>`.
4254
- Additional options for {ref}`exp <learning_rate[exp]>`: {ref}`decay_steps <learning_rate[exp]/decay_steps>`, {ref}`decay_rate <learning_rate[exp]/decay_rate>`, and {ref}`smooth <learning_rate[exp]/smooth>`.
55+
- Additional options for {ref}`wsd <learning_rate[wsd]>`: {ref}`decay_phase_ratio <learning_rate[wsd]/decay_phase_ratio>` and {ref}`decay_type <learning_rate[wsd]/decay_type>`.
4356
- {ref}`cosine <learning_rate[cosine]>` has no extra schedule-specific arguments beyond the shared ones.
4457

4558
See [Mathematical Theory](#mathematical-theory) for complete formulas.
@@ -163,6 +176,79 @@ After warmup, the learning rate follows a cosine curve from {ref}`start_lr <lear
163176
}
164177
```
165178

179+
## Warmup-Stable-Decay Schedule
180+
181+
The warmup-stable-decay ({ref}`wsd <learning_rate[wsd]>`) schedule keeps the learning rate at {ref}`start_lr <learning_rate/start_lr>` for most of the post-warmup training steps and then applies a shorter final decay phase.
182+
183+
The length of the final decay phase is controlled by {ref}`decay_phase_ratio <learning_rate[wsd]/decay_phase_ratio>`. The remaining post-warmup steps form the stable phase. The decay rule is selected by {ref}`decay_type <learning_rate[wsd]/decay_type>`, which supports `inverse_linear` (default), `cosine`, and `linear`.
184+
185+
### Examples
186+
187+
**Basic WSD with default inverse-linear decay:**
188+
189+
```json
190+
"learning_rate": {
191+
"type": "wsd",
192+
"start_lr": 0.001,
193+
"stop_lr": 1e-6,
194+
"decay_phase_ratio": 0.1
195+
}
196+
```
197+
198+
This configuration uses a stable phase for most of the post-warmup training and reserves the final 10% of total training steps for the decay phase.
199+
200+
**Using {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`:**
201+
202+
```json
203+
"learning_rate": {
204+
"type": "wsd",
205+
"start_lr": 0.001,
206+
"stop_lr_ratio": 1e-3,
207+
"decay_phase_ratio": 0.1
208+
}
209+
```
210+
211+
Equivalent to `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
212+
213+
**With warmup:**
214+
215+
```json
216+
"learning_rate": {
217+
"type": "wsd",
218+
"start_lr": 0.001,
219+
"stop_lr": 1e-6,
220+
"decay_phase_ratio": 0.1,
221+
"warmup_steps": 5000,
222+
"warmup_start_factor": 0.0
223+
}
224+
```
225+
226+
Warmup first increases the learning rate to {ref}`start_lr <learning_rate/start_lr>`. After warmup, the schedule enters the stable phase and finally decays during the last WSD decay phase.
227+
228+
**WSD with cosine decay phase:**
229+
230+
```json
231+
"learning_rate": {
232+
"type": "wsd",
233+
"start_lr": 0.001,
234+
"stop_lr": 1e-6,
235+
"decay_phase_ratio": 0.1,
236+
"decay_type": "cosine"
237+
}
238+
```
239+
240+
**WSD with linear decay phase:**
241+
242+
```json
243+
"learning_rate": {
244+
"type": "wsd",
245+
"start_lr": 0.001,
246+
"stop_lr": 1e-6,
247+
"decay_phase_ratio": 0.1,
248+
"decay_type": "linear"
249+
}
250+
```
251+
166252
## Warmup Mechanism
167253

168254
Warmup is a technique to stabilize training in early stages by gradually increasing the learning rate from {ref}`warmup_start_factor <learning_rate/warmup_start_factor>` `*` {ref}`start_lr <learning_rate/start_lr>` to {ref}`start_lr <learning_rate/start_lr>`.
@@ -185,6 +271,10 @@ The exact piecewise warmup formula is given in [Mathematical Theory](#mathematic
185271
| $f^{\text{warmup}}$ | {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`: Initial warmup LR factor |
186272
| $s$ | {ref}`decay_steps <learning_rate[exp]/decay_steps>`: Decay period for exponential schedule |
187273
| $r$ | {ref}`decay_rate <learning_rate[exp]/decay_rate>`: Decay rate for exponential schedule |
274+
| $\rho^{\text{wsd}}$ | {ref}`decay_phase_ratio <learning_rate[wsd]/decay_phase_ratio>`: Ratio of WSD decay phase |
275+
| $\tau^{\text{wsd}}$ | Number of WSD decay-phase steps |
276+
| $\tau^{\text{stable}}$ | Number of WSD stable-phase steps |
277+
| $\hat{\tau}$ | Normalized progress within the WSD decay phase |
188278

189279
### Complete warmup formula
190280

@@ -230,6 +320,67 @@ Equivalently, using $\alpha = \gamma^{\text{stop}} / \gamma^0$:
230320
\gamma(\tau) = \gamma^0 \cdot \left[\alpha + \frac{1 - \alpha}{2}\left(1 + \cos\left(\frac{\pi \cdot (\tau - \tau^{\text{warmup}})}{\tau^{\text{decay}}}\right)\right)\right]
231321
```
232322

323+
### Warmup-stable-decay
324+
325+
For WSD, define the final decay-phase length as:
326+
327+
```math
328+
\tau^{\text{wsd}} = \left\lfloor \rho^{\text{wsd}} \cdot \tau^{\text{stop}} \right\rfloor
329+
```
330+
331+
and the stable-phase length as:
332+
333+
```math
334+
\tau^{\text{stable}} = \tau^{\text{decay}} - \tau^{\text{wsd}}
335+
```
336+
337+
For steps in the stable phase,
338+
339+
```math
340+
\gamma(\tau) = \gamma^0, \qquad
341+
\tau^{\text{warmup}} \leq \tau < \tau^{\text{warmup}} + \tau^{\text{stable}}
342+
```
343+
344+
For steps in the final decay phase, define the normalized decay progress:
345+
346+
```math
347+
\hat{\tau} =
348+
\frac{
349+
\tau - \tau^{\text{warmup}} - \tau^{\text{stable}}
350+
}{
351+
\tau^{\text{wsd}}
352+
}
353+
```
354+
355+
Then the decay-phase formulas are:
356+
357+
**Inverse-linear decay (`decay_type: "inverse_linear"`):**
358+
359+
```math
360+
\gamma(\tau) =
361+
\frac{1}{
362+
\hat{\tau} / \gamma^{\text{stop}} + (1 - \hat{\tau}) / \gamma^0
363+
}
364+
```
365+
366+
**Cosine decay (`decay_type: "cosine"`):**
367+
368+
```math
369+
\gamma(\tau) =
370+
\gamma^{\text{stop}} +
371+
\frac{\gamma^0 - \gamma^{\text{stop}}}{2}
372+
\left(1 + \cos\left(\pi \hat{\tau}\right)\right)
373+
```
374+
375+
**Linear decay (`decay_type: "linear"`):**
376+
377+
```math
378+
\gamma(\tau) =
379+
\gamma^0 + \left(\gamma^{\text{stop}} - \gamma^0\right)\hat{\tau}
380+
```
381+
382+
For steps beyond the end of the decay phase, the learning rate stays at $\gamma^{\text{stop}}$.
383+
233384
## Migration from versions before 3.1.3
234385

235386
In version 3.1.2 and earlier, {ref}`start_lr <learning_rate/start_lr>` and {ref}`stop_lr <learning_rate/stop_lr>` / {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>` had default values and could be omitted. Starting from version 3.1.3, these parameters are **required** and must be explicitly specified.

0 commit comments

Comments
 (0)