Skip to content

Commit 9cf1e50

Browse files
committed
rebase
1 parent 292c3b1 commit 9cf1e50

1 file changed

Lines changed: 36 additions & 98 deletions

File tree

doc/train/learning-rate.md

Lines changed: 36 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,9 @@ DeePMD-kit supports two learning rate schedules:
55
- **`exp`**: Exponential decay with optional stepped or smooth mode
66
- **`cosine`**: Cosine annealing for smooth decay curve
77

8-
Both schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target `start_lr`.
8+
Both schedules support an optional warmup phase where the learning rate gradually increases from a small initial value to the target {ref}`start_lr <learning_rate/start_lr>`.
9+
10+
This page focuses on schedule behavior, examples, and formulas. For the canonical argument definitions, see {ref}`learning_rate <learning_rate>`.
911

1012
## Quick Start
1113

@@ -32,54 +34,25 @@ Both schedules support an optional warmup phase where the learning rate graduall
3234

3335
## Common parameters
3436

35-
The following parameters are shared by both `exp` and `cosine` schedules.
36-
37-
### Required parameters
38-
39-
- `start_lr`: The learning rate at the start of training (after warmup).
40-
- `stop_lr` or `stop_lr_ratio` (must provide exactly one):
41-
- `stop_lr`: The learning rate at the end of training.
42-
- `stop_lr_ratio`: The ratio of `stop_lr` to `start_lr`. Computed as `stop_lr = start_lr * stop_lr_ratio`.
43-
44-
### Optional parameters
45-
46-
- `warmup_steps` or `warmup_ratio` (mutually exclusive):
47-
- `warmup_steps`: Number of steps for warmup. Learning rate increases linearly from `warmup_start_factor * start_lr` to `start_lr`.
48-
- `warmup_ratio`: Ratio of warmup steps to total training steps. `warmup_steps = int(warmup_ratio * numb_steps)`.
49-
- `warmup_start_factor`: Factor for initial warmup learning rate (default: 0.0). Warmup starts from `warmup_start_factor * start_lr`.
50-
- `scale_by_worker`: How to alter learning rate in parallel training. Options: `"linear"`, `"sqrt"`, `"none"` (default: `"linear"`).
51-
52-
### Type-specific parameters
53-
54-
**Exponential decay (`type: "exp"`):**
55-
56-
- `decay_steps`: Interval (in steps) at which learning rate decays (default: 5000).
57-
- `decay_rate`: Explicit decay rate. If not provided, computed from `start_lr` and `stop_lr`.
58-
- `smooth`: If `true`, use smooth exponential decay at every step. If `false`, use stepped decay (default: `false`).
37+
Use {ref}`learning_rate <learning_rate>` as the canonical parameter reference. This page only highlights the argument combinations that matter when choosing a schedule:
5938

60-
**Cosine annealing (`type: "cosine"`):**
39+
- Shared by both {ref}`exp <learning_rate[exp]>` and {ref}`cosine <learning_rate[cosine]>`: {ref}`start_lr <learning_rate/start_lr>` plus exactly one of {ref}`stop_lr <learning_rate/stop_lr>` or {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`.
40+
- Optional warmup for both schedules: {ref}`warmup_steps <learning_rate/warmup_steps>` or {ref}`warmup_ratio <learning_rate/warmup_ratio>`, with optional {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`.
41+
- Optional distributed scaling for both schedules: {ref}`scale_by_worker <learning_rate/scale_by_worker>`.
42+
- Additional options for {ref}`exp <learning_rate[exp]>`: {ref}`decay_steps <learning_rate[exp]/decay_steps>`, {ref}`decay_rate <learning_rate[exp]/decay_rate>`, and {ref}`smooth <learning_rate[exp]/smooth>`.
43+
- {ref}`cosine <learning_rate[cosine]>` has no extra schedule-specific arguments beyond the shared ones.
6144

62-
No type-specific parameters. The decay follows a cosine curve from `start_lr` to `stop_lr`.
63-
64-
See [Mathematical Theory](#mathematical-theory) section for complete formulas.
45+
See [Mathematical Theory](#mathematical-theory) for complete formulas.
6546

6647
## Exponential Decay Schedule
6748

68-
The exponential decay schedule reduces the learning rate exponentially over training steps. It is the default schedule when `type` is omitted.
49+
The exponential decay schedule reduces the learning rate exponentially over training steps. It is the default schedule when {ref}`type <learning_rate/type>` is omitted.
6950

7051
### Stepped vs smooth mode
7152

72-
By setting `smooth: true`, the learning rate decays smoothly at every step instead of in a stepped manner. This provides a more gradual decay curve similar to PyTorch's `ExponentialLR`, whereas the default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`.
73-
74-
### Decay rate computation
53+
By setting {ref}`smooth <learning_rate[exp]/smooth>` to `true`, the learning rate decays smoothly at every step instead of in a stepped manner. This provides a more gradual decay curve similar to PyTorch's `ExponentialLR`, whereas the default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`.
7554

76-
If `decay_rate` is not explicitly provided, it is computed from `start_lr` and `stop_lr` to ensure the learning rate reaches `stop_lr` at the end of training:
77-
78-
```text
79-
decay_rate = (stop_lr / start_lr) ^ (decay_steps / (numb_steps - warmup_steps))
80-
```
81-
82-
where `numb_steps` is the internal total number of training steps (derived from `training.numb_steps` in the training configuration).
55+
If {ref}`decay_rate <learning_rate[exp]/decay_rate>` is not explicitly provided, DeePMD-kit computes it from {ref}`start_lr <learning_rate/start_lr>` and the requested final learning rate so that the schedule reaches the target by {ref}`numb_steps <training/numb_steps>`. The exact expression is given in [Mathematical Theory](#mathematical-theory).
8356

8457
### Examples
8558

@@ -94,7 +67,7 @@ where `numb_steps` is the internal total number of training steps (derived from
9467
}
9568
```
9669

97-
**Using `stop_lr_ratio`:**
70+
**Using {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`:**
9871

9972
```json
10073
"learning_rate": {
@@ -107,7 +80,7 @@ where `numb_steps` is the internal total number of training steps (derived from
10780

10881
Equivalent to `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
10982

110-
**With warmup (using `warmup_steps`):**
83+
**With warmup (using {ref}`warmup_steps <learning_rate/warmup_steps>`):**
11184

11285
```json
11386
"learning_rate": {
@@ -120,9 +93,9 @@ Equivalent to `stop_lr: 1e-6` (i.e., `0.001 * 1e-3`).
12093
}
12194
```
12295

123-
Learning rate starts from `0.0001` (i.e., `0.1 * 0.001`), increases linearly to `0.001` over 10,000 steps, then decays exponentially.
96+
Learning rate starts from `0.0001` (i.e., {ref}`warmup_start_factor <learning_rate/warmup_start_factor>` `*` {ref}`start_lr <learning_rate/start_lr>`), increases linearly to {ref}`start_lr <learning_rate/start_lr>` over 10,000 steps, then decays exponentially.
12497

125-
**With warmup (using `warmup_ratio`):**
98+
**With warmup (using {ref}`warmup_ratio <learning_rate/warmup_ratio>`):**
12699

127100
```json
128101
"learning_rate": {
@@ -134,9 +107,9 @@ Learning rate starts from `0.0001` (i.e., `0.1 * 0.001`), increases linearly to
134107
}
135108
```
136109

137-
If `numb_steps` is 1,000,000, warmup lasts 50,000 steps. Learning rate starts from `0.0` (default `warmup_start_factor`) and increases to `0.001`.
110+
If {ref}`numb_steps <training/numb_steps>` is 1,000,000, warmup lasts 50,000 steps. Learning rate starts from `0.0` (default {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`) and increases to {ref}`start_lr <learning_rate/start_lr>`.
138111

139-
**Smooth exponential decay:**
112+
**Smooth exponential decay (with {ref}`smooth <learning_rate[exp]/smooth>`):**
140113

141114
```json
142115
"learning_rate": {
@@ -148,21 +121,13 @@ If `numb_steps` is 1,000,000, warmup lasts 50,000 steps. Learning rate starts fr
148121
}
149122
```
150123

151-
With `smooth: true`, the learning rate decays continuously at every step, similar to PyTorch's `ExponentialLR`. The default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`.
124+
With {ref}`smooth <learning_rate[exp]/smooth>` set to `true`, the learning rate decays continuously at every step, similar to PyTorch's `ExponentialLR`. The default stepped mode (`smooth: false`) is similar to PyTorch's `StepLR`.
152125

153126
## Cosine Annealing Schedule
154127

155128
The cosine annealing schedule smoothly decreases the learning rate following a cosine curve. It often provides better convergence than exponential decay.
156129

157-
### Formula
158-
159-
During the decay phase (after warmup), the learning rate follows:
160-
161-
```text
162-
lr(t) = stop_lr + (start_lr - stop_lr) / 2 * (1 + cos(pi * (t - warmup_steps) / (numb_steps - warmup_steps)))
163-
```
164-
165-
At the middle of training (relative to decay phase), the learning rate is approximately `(start_lr + stop_lr) / 2`.
130+
After warmup, the learning rate follows a cosine curve from {ref}`start_lr <learning_rate/start_lr>` to {ref}`stop_lr <learning_rate/stop_lr>` or the value implied by {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`. The exact expression is given in [Mathematical Theory](#mathematical-theory).
166131

167132
### Examples
168133

@@ -176,7 +141,7 @@ At the middle of training (relative to decay phase), the learning rate is approx
176141
}
177142
```
178143

179-
**Using `stop_lr_ratio`:**
144+
**Using {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>`:**
180145

181146
```json
182147
"learning_rate": {
@@ -200,53 +165,26 @@ At the middle of training (relative to decay phase), the learning rate is approx
200165

201166
## Warmup Mechanism
202167

203-
Warmup is a technique to stabilize training in early stages by gradually increasing the learning rate from a small initial value.
204-
205-
### Warmup formula
206-
207-
During warmup phase ($0 \leq \tau < \tau^{\text{warmup}}$):
208-
209-
```math
210-
\gamma(\tau) = \gamma^{\text{warmup}} + (\gamma^0 - \gamma^{\text{warmup}}) \cdot \frac{\tau}{\tau^{\text{warmup}}}
211-
```
212-
213-
where:
214-
215-
- $\tau$ is the current step index
216-
- $\tau^{\text{warmup}}$ is the number of warmup steps
217-
- $\gamma^0$ is `start_lr`
218-
- $\gamma^{\text{warmup}} = f^{\text{warmup}} \cdot \gamma^0$ is the initial warmup learning rate
219-
- $f^{\text{warmup}}$ is `warmup_start_factor`
220-
221-
When `warmup_start_factor` is 0.0 (default), warmup starts from 0:
222-
223-
```math
224-
\gamma(\tau) = \gamma^0 \cdot \frac{\tau}{\tau^{\text{warmup}}}
225-
```
226-
227-
### Specifying warmup duration
228-
229-
You can specify warmup duration using either `warmup_steps` (absolute) or `warmup_ratio` (relative):
168+
Warmup is a technique to stabilize training in early stages by gradually increasing the learning rate from {ref}`warmup_start_factor <learning_rate/warmup_start_factor>` `*` {ref}`start_lr <learning_rate/start_lr>` to {ref}`start_lr <learning_rate/start_lr>`.
230169

231-
- `warmup_steps`: Explicit number of warmup steps
232-
- `warmup_ratio`: Ratio of total training steps. Computed as `int(warmup_ratio * numb_steps)`, where `numb_steps` is derived from `training.numb_steps`
170+
You can specify warmup duration using either {ref}`warmup_steps <learning_rate/warmup_steps>` (absolute) or {ref}`warmup_ratio <learning_rate/warmup_ratio>` (relative to {ref}`numb_steps <training/numb_steps>`). These are mutually exclusive.
233171

234-
These are mutually exclusive.
172+
The exact piecewise warmup formula is given in [Mathematical Theory](#mathematical-theory).
235173

236174
## Mathematical Theory
237175

238176
### Notation
239177

240-
| Symbol | Description |
241-
| ---------------------- | ---------------------------------------------------- |
242-
| $\tau$ | Global step index (0-indexed) |
243-
| $\tau^{\text{warmup}}$ | Number of warmup steps |
244-
| $\tau^{\text{decay}}$ | Number of decay steps = `numb_steps - warmup_steps` |
245-
| $\gamma^0$ | `start_lr`: Learning rate at start of decay phase |
246-
| $\gamma^{\text{stop}}$ | `stop_lr`: Learning rate at end of training |
247-
| $f^{\text{warmup}}$ | `warmup_start_factor`: Initial warmup LR factor |
248-
| $s$ | `decay_steps`: Decay period for exponential schedule |
249-
| $r$ | `decay_rate`: Decay rate for exponential schedule |
178+
| Symbol | Description |
179+
| ---------------------- | ------------------------------------------------------------------------------------------ |
180+
| $\tau$ | Global step index (0-indexed) |
181+
| $\tau^{\text{warmup}}$ | Number of warmup steps |
182+
| $\tau^{\text{decay}}$ | Number of decay steps = `numb_steps - warmup_steps` |
183+
| $\gamma^0$ | {ref}`start_lr <learning_rate/start_lr>`: Learning rate at start of decay phase |
184+
| $\gamma^{\text{stop}}$ | {ref}`stop_lr <learning_rate/stop_lr>`: Learning rate at end of training |
185+
| $f^{\text{warmup}}$ | {ref}`warmup_start_factor <learning_rate/warmup_start_factor>`: Initial warmup LR factor |
186+
| $s$ | {ref}`decay_steps <learning_rate[exp]/decay_steps>`: Decay period for exponential schedule |
187+
| $r$ | {ref}`decay_rate <learning_rate[exp]/decay_rate>`: Decay rate for exponential schedule |
250188

251189
### Complete warmup formula
252190

@@ -294,7 +232,7 @@ Equivalently, using $\alpha = \gamma^{\text{stop}} / \gamma^0$:
294232

295233
## Migration from versions before 3.1.3
296234

297-
In version 3.1.2 and earlier, `start_lr` and `stop_lr`/`stop_lr_ratio` had default values and could be omitted. Starting from version 3.1.3, these parameters are **required** and must be explicitly specified.
235+
In version 3.1.2 and earlier, {ref}`start_lr <learning_rate/start_lr>` and {ref}`stop_lr <learning_rate/stop_lr>` / {ref}`stop_lr_ratio <learning_rate/stop_lr_ratio>` had default values and could be omitted. Starting from version 3.1.3, these parameters are **required** and must be explicitly specified.
298236

299237
**Configuration in version 3.1.2:**
300238

0 commit comments

Comments
 (0)