Local finetuning crashes with AssertionError: No inf checks were recorded for this optimizer. Precision is chosen by environment, not config: check_cluster() (train.py:45-48) probes SLURM_JOB_ID, so local runs use fp16 (AMP GradScaler) while SLURM uses bf16 (no scaler) — there's no config switch for bf16 locally. The SkipNanGrad callback calls zero_grad() on a NaN gradient, which wipes the scaled grads and breaks the fp16 scaler's scale→unscale→step bookkeeping → the assertion.
Severity: Medium · Status: workaround verified; upstream open
Steps to reproduce
- Run
python -m proteinfoundation.train --config-name training_local_latents +single=true ... locally (no SLURM_JOB_ID).
- On the first NaN-grad step → assertion crash.
- Confirmed by toggling: fp16 +
skip_nan_grad=True ❌ · fp16 + skip_nan_grad=False ✅ · fp32 ✅.
Fix
Add ++opt.skip_nan_grad=False (keeps mixed precision) or ++force_precision_f32=True. Upstream: make SkipNanGrad scaler-aware (skip when an AMP scaler is active), and/or allow bf16-mixed locally (A100 supports it).
Environment
NVIDIA A100 80GB PCIe · driver 565.57.01 · CUDA 12.7 · repo branch dev @ 916eaae · UV runtime.
Local finetuning crashes with
AssertionError: No inf checks were recorded for this optimizer. Precision is chosen by environment, not config:check_cluster()(train.py:45-48) probesSLURM_JOB_ID, so local runs use fp16 (AMPGradScaler) while SLURM uses bf16 (no scaler) — there's no config switch for bf16 locally. TheSkipNanGradcallback callszero_grad()on a NaN gradient, which wipes the scaled grads and breaks the fp16 scaler'sscale→unscale→stepbookkeeping → the assertion.Severity: Medium · Status: workaround verified; upstream open
Steps to reproduce
python -m proteinfoundation.train --config-name training_local_latents +single=true ...locally (noSLURM_JOB_ID).skip_nan_grad=True❌ · fp16 +skip_nan_grad=False✅ · fp32 ✅.Fix
Add
++opt.skip_nan_grad=False(keeps mixed precision) or++force_precision_f32=True. Upstream: makeSkipNanGradscaler-aware (skip when an AMP scaler is active), and/or allow bf16-mixed locally (A100 supports it).Environment
NVIDIA A100 80GB PCIe · driver 565.57.01 · CUDA 12.7 · repo branch
dev@916eaae· UV runtime.