Commit 66ddc48
committed
fix: raise RuntimeError when checkpoint step >= config.steps
When a user sets steps=x and there is already a checkpoint saved at step x,
the job should fail with a clear error message instead of performing no
computation or failing with a confusing profiling error.
We add an early check in setup_train_loop (train_utils.py) and a fallback check in
train_loop (train.py) to fail fast before loading the checkpoint/initializing TPU
or before the expensive TPU compilation step. Both checks are standardized to
use a shared validation helper. Unit tests are added to verify the validation logic.
TAG=agy
CONV=88c01cb5-28b2-4b67-8895-4a290d332d3f1 parent fe529ee commit 66ddc48
3 files changed
Lines changed: 47 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
639 | 639 | | |
640 | 640 | | |
641 | 641 | | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
642 | 645 | | |
643 | 646 | | |
644 | 647 | | |
| |||
682 | 685 | | |
683 | 686 | | |
684 | 687 | | |
685 | | - | |
686 | | - | |
687 | 688 | | |
688 | 689 | | |
689 | 690 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
240 | 240 | | |
241 | 241 | | |
242 | 242 | | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
243 | 248 | | |
244 | 249 | | |
245 | 250 | | |
| |||
405 | 410 | | |
406 | 411 | | |
407 | 412 | | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
22 | 26 | | |
23 | 27 | | |
24 | 28 | | |
| |||
185 | 189 | | |
186 | 190 | | |
187 | 191 | | |
188 | | - | |
189 | 192 | | |
190 | | - | |
191 | 193 | | |
192 | 194 | | |
193 | 195 | | |
194 | 196 | | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
195 | 218 | | |
196 | 219 | | |
| 220 | + | |
0 commit comments