Skip to content

[Proteina-Complexa] Local finetune crashes: SkipNanGrad incompatible with the fp16 GradScaler #44

Description

@xinyu-dev

Local finetuning crashes with AssertionError: No inf checks were recorded for this optimizer. Precision is chosen by environment, not config: check_cluster() (train.py:45-48) probes SLURM_JOB_ID, so local runs use fp16 (AMP GradScaler) while SLURM uses bf16 (no scaler) — there's no config switch for bf16 locally. The SkipNanGrad callback calls zero_grad() on a NaN gradient, which wipes the scaled grads and breaks the fp16 scaler's scale→unscale→step bookkeeping → the assertion.

Severity: Medium · Status: workaround verified; upstream open

Steps to reproduce

  1. Run python -m proteinfoundation.train --config-name training_local_latents +single=true ... locally (no SLURM_JOB_ID).
  2. On the first NaN-grad step → assertion crash.
  3. Confirmed by toggling: fp16 + skip_nan_grad=True ❌ · fp16 + skip_nan_grad=False ✅ · fp32 ✅.

Fix

Add ++opt.skip_nan_grad=False (keeps mixed precision) or ++force_precision_f32=True. Upstream: make SkipNanGrad scaler-aware (skip when an AMP scaler is active), and/or allow bf16-mixed locally (A100 supports it).

Environment

NVIDIA A100 80GB PCIe · driver 565.57.01 · CUDA 12.7 · repo branch dev @ 916eaae · UV runtime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions