[Proteina-Complexa] Local finetune crashes: SkipNanGrad incompatible with the fp16 GradScaler

Local finetuning crashes with `AssertionError: No inf checks were recorded for this optimizer`. Precision is chosen by environment, not config: `check_cluster()` (`train.py:45-48`) probes `SLURM_JOB_ID`, so local runs use **fp16** (AMP `GradScaler`) while SLURM uses **bf16** (no scaler) — there's no config switch for bf16 locally. The `SkipNanGrad` callback calls `zero_grad()` on a NaN gradient, which wipes the scaled grads and breaks the fp16 scaler's `scale→unscale→step` bookkeeping → the assertion.

**Severity:** Medium · **Status:** workaround verified; upstream open

### Steps to reproduce
1. Run `python -m proteinfoundation.train --config-name training_local_latents +single=true ...` locally (no `SLURM_JOB_ID`).
2. On the first NaN-grad step → assertion crash.
3. Confirmed by toggling: fp16 + `skip_nan_grad=True` ❌ · fp16 + `skip_nan_grad=False` ✅ · fp32 ✅.

### Fix
Add `++opt.skip_nan_grad=False` (keeps mixed precision) or `++force_precision_f32=True`. Upstream: make `SkipNanGrad` scaler-aware (skip when an AMP scaler is active), and/or allow bf16-mixed locally (A100 supports it).

### Environment
NVIDIA A100 80GB PCIe · driver 565.57.01 · CUDA 12.7 · repo branch `dev` @ `916eaae` · UV runtime.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proteina-Complexa] Local finetune crashes: SkipNanGrad incompatible with the fp16 GradScaler #44

Steps to reproduce

Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Proteina-Complexa] Local finetune crashes: SkipNanGrad incompatible with the fp16 GradScaler #44

Description

Steps to reproduce

Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions