Commit 08990d6
gh#9: stage_config CLI override fix + smart-init pair_proj_dist + strict=False
Three fixes from the LR-sweep instability (loss in the 1000s, grad norm
in the 30k-150k range):
1. StageConfig silently overrode CLI --lr. Training loop called
get_lr(step, config, stage.get("lr")) and "stage_lr or config.lr"
meant the hardcoded stage_config[lr]=1e-3 always won. So every
distinct --lr in the LR sweep ended up running at lr=1e-3.
Switch to get_lr(step, config) — config.lr from CLI is now the only
knob. Crop-size staging unaffected.
2. pair_proj_dist starts at default-random init, so distogram-mode
training feeds garbage into the diffusion module out of the gate.
Smart-init: copy pair_proj's relpe-column weights into
pair_proj_dist's relpe columns (so the relpe contribution matches
z-mode at step 0), zero the distogram-input columns. Diffusion module
starts in a near-legacy state and learns to use the distogram signal
gradually. Smoketest: diff_loss drops from 1000s to ~13 with this
init.
3. load_checkpoint used strict=True, so loading a legacy "z"-mode
checkpoint into a distogram-mode model failed on the missing
pair_proj_dist keys. Switch to strict=False; return the missing
keys list. train() runs the smart init when pair_proj_dist was
missing from the loaded state (i.e., resuming from a "z" seed or
starting fresh) — skipped when resuming an already-distogram run.
Tests still green (39/39 in test_diffusion_pair_source + test_data).
load_checkpoint signature changed to return (step, missing_keys) tuple
— callers in this file already updated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent a745e40 commit 08990d6
1 file changed
Lines changed: 76 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
152 | 152 | | |
153 | 153 | | |
154 | 154 | | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
155 | 188 | | |
156 | 189 | | |
157 | 190 | | |
| |||
232 | 265 | | |
233 | 266 | | |
234 | 267 | | |
235 | | - | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
236 | 276 | | |
237 | 277 | | |
238 | | - | |
239 | | - | |
240 | | - | |
241 | | - | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
242 | 290 | | |
243 | 291 | | |
244 | 292 | | |
| |||
251 | 299 | | |
252 | 300 | | |
253 | 301 | | |
254 | | - | |
| 302 | + | |
255 | 303 | | |
256 | 304 | | |
257 | 305 | | |
| |||
460 | 508 | | |
461 | 509 | | |
462 | 510 | | |
| 511 | + | |
463 | 512 | | |
464 | | - | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
465 | 529 | | |
466 | 530 | | |
467 | 531 | | |
| |||
526 | 590 | | |
527 | 591 | | |
528 | 592 | | |
529 | | - | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
530 | 598 | | |
531 | 599 | | |
532 | 600 | | |
| |||
0 commit comments