You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/train/learning-rate.md
+156-5Lines changed: 156 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,12 @@
1
1
# Learning rate
2
2
3
-
DeePMD-kit supports two learning rate schedules:
3
+
DeePMD-kit supports three learning rate schedules:
4
4
5
5
-**`exp`**: Exponential decay with optional stepped or smooth mode
6
6
-**`cosine`**: Cosine annealing for smooth decay curve
7
+
-**`wsd`**: Warmup-stable-decay with configurable final decay rule
7
8
8
-
Both schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target {ref}`start_lr <learning_rate/start_lr>`.
9
+
All schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target {ref}`start_lr <learning_rate/start_lr>`.
9
10
10
11
This page focuses on schedule behavior, examples, and formulas. For the canonical argument definitions, see {ref}`learning_rate <learning_rate>`.
11
12
@@ -32,14 +33,26 @@ This page focuses on schedule behavior, examples, and formulas. For the canonica
32
33
}
33
34
```
34
35
36
+
### Warmup-stable-decay
37
+
38
+
```json
39
+
"learning_rate": {
40
+
"type": "wsd",
41
+
"start_lr": 0.001,
42
+
"stop_lr": 1e-6,
43
+
"decay_phase_ratio": 0.1
44
+
}
45
+
```
46
+
35
47
## Common parameters
36
48
37
49
Use {ref}`learning_rate <learning_rate>` as the canonical parameter reference. This page only highlights the argument combinations that matter when choosing a schedule:
38
50
39
-
- Shared by both {ref}`exp <learning_rate[exp]>` and {ref}`cosine <learning_rate[cosine]>`: {ref}`start_lr <learning_rate/start_lr>` plus exactly one of {ref}`stop_lr <learning_rate/stop_lr>` or {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`.
40
-
- Optional warmup for both schedules: {ref}`warmup_steps <learning_rate/warmup_steps>` or {ref}`warmup_ratio <learning_rate/warmup_ratio>`, with optional {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`.
41
-
- Optional distributed scaling for both schedules: {ref}`scale_by_worker <learning_rate/scale_by_worker>`.
51
+
- Shared by {ref}`exp <learning_rate[exp]>`, {ref}`cosine <learning_rate[cosine]>`, and {ref}`wsd <learning_rate[wsd]>`: {ref}`start_lr <learning_rate/start_lr>` plus exactly one of {ref}`stop_lr <learning_rate/stop_lr>` or {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`.
52
+
- Optional warmup for all schedules: {ref}`warmup_steps <learning_rate/warmup_steps>` or {ref}`warmup_ratio <learning_rate/warmup_ratio>`, with optional {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`.
53
+
- Optional distributed scaling for all schedules: {ref}`scale_by_worker <learning_rate/scale_by_worker>`.
42
54
- Additional options for {ref}`exp <learning_rate[exp]>`: {ref}`decay_steps <learning_rate[exp]/decay_steps>`, {ref}`decay_rate <learning_rate[exp]/decay_rate>`, and {ref}`smooth <learning_rate[exp]/smooth>`.
55
+
- Additional options for {ref}`wsd <learning_rate[wsd]>`: {ref}`decay_phase_ratio <learning_rate[wsd]/decay_phase_ratio>` and {ref}`decay_type <learning_rate[wsd]/decay_type>`.
43
56
- {ref}`cosine <learning_rate[cosine]>` has no extra schedule-specific arguments beyond the shared ones.
44
57
45
58
See [Mathematical Theory](#mathematical-theory) for complete formulas.
@@ -163,6 +176,79 @@ After warmup, the learning rate follows a cosine curve from {ref}`start_lr <lear
163
176
}
164
177
```
165
178
179
+
## Warmup-Stable-Decay Schedule
180
+
181
+
The warmup-stable-decay ({ref}`wsd <learning_rate[wsd]>`) schedule keeps the learning rate at {ref}`start_lr <learning_rate/start_lr>` for most of the post-warmup training steps and then applies a shorter final decay phase.
182
+
183
+
The length of the final decay phase is controlled by {ref}`decay_phase_ratio <learning_rate[wsd]/decay_phase_ratio>`. The remaining post-warmup steps form the stable phase. The decay rule is selected by {ref}`decay_type <learning_rate[wsd]/decay_type>`, which supports `inverse_linear` (default), `cosine`, and `linear`.
184
+
185
+
### Examples
186
+
187
+
**Basic WSD with default inverse-linear decay:**
188
+
189
+
```json
190
+
"learning_rate": {
191
+
"type": "wsd",
192
+
"start_lr": 0.001,
193
+
"stop_lr": 1e-6,
194
+
"decay_phase_ratio": 0.1
195
+
}
196
+
```
197
+
198
+
This configuration uses a stable phase for most of the post-warmup training and reserves the final 10% of total training steps for the decay phase.
Equivalent to `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
212
+
213
+
**With warmup:**
214
+
215
+
```json
216
+
"learning_rate": {
217
+
"type": "wsd",
218
+
"start_lr": 0.001,
219
+
"stop_lr": 1e-6,
220
+
"decay_phase_ratio": 0.1,
221
+
"warmup_steps": 5000,
222
+
"warmup_start_factor": 0.0
223
+
}
224
+
```
225
+
226
+
Warmup first increases the learning rate to {ref}`start_lr <learning_rate/start_lr>`. After warmup, the schedule enters the stable phase and finally decays during the last WSD decay phase.
227
+
228
+
**WSD with cosine decay phase:**
229
+
230
+
```json
231
+
"learning_rate": {
232
+
"type": "wsd",
233
+
"start_lr": 0.001,
234
+
"stop_lr": 1e-6,
235
+
"decay_phase_ratio": 0.1,
236
+
"decay_type": "cosine"
237
+
}
238
+
```
239
+
240
+
**WSD with linear decay phase:**
241
+
242
+
```json
243
+
"learning_rate": {
244
+
"type": "wsd",
245
+
"start_lr": 0.001,
246
+
"stop_lr": 1e-6,
247
+
"decay_phase_ratio": 0.1,
248
+
"decay_type": "linear"
249
+
}
250
+
```
251
+
166
252
## Warmup Mechanism
167
253
168
254
Warmup is a technique to stabilize training in early stages by gradually increasing the learning rate from {ref}`warmup_start_factor <learning_rate/warmup_start_factor>``*` {ref}`start_lr <learning_rate/start_lr>` to {ref}`start_lr <learning_rate/start_lr>`.
@@ -185,6 +271,10 @@ The exact piecewise warmup formula is given in [Mathematical Theory](#mathematic
185
271
| $f^{\text{warmup}}$ | {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`: Initial warmup LR factor |
186
272
| $s$ | {ref}`decay_steps <learning_rate[exp]/decay_steps>`: Decay period for exponential schedule |
For steps beyond the end of the decay phase, the learning rate stays at $\gamma^{\text{stop}}$.
383
+
233
384
## Migration from versions before 3.1.3
234
385
235
386
In version 3.1.2 and earlier, {ref}`start_lr <learning_rate/start_lr>` and {ref}`stop_lr <learning_rate/stop_lr>` / {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>` had default values and could be omitted. Starting from version 3.1.3, these parameters are **required** and must be explicitly specified.
0 commit comments