Skip to content

Commit 0883c09

Browse files
authored
Fix Float8CurrentScaling NaN for CodonFM: init TE layers on CUDA (#1539)
## Summary Fix nightly CI failure in `unit-tests-recipes.yml` ([run #23790357242](https://github.com/NVIDIA/bionemo-framework/actions/runs/23790357242)). ### Root Cause `CodonFMEncoder` and `CodonFMLMHead` initialized TransformerEngine layers on `"cpu"` instead of `"cuda"` (unlike ESM2 and all other models). In `test_legacy_quantized_model_init_forward_and_backward`, the model is created inside a `quantized_model_init(Float8CurrentScaling)` context then moved with `model.to("cuda")`. Moving FP8-quantized tensors from CPU→CUDA corrupts `Float8CurrentScaling`'s scale metadata, producing NaN loss. ### Fix Changed CodonFM's TE layer init device from `"cpu"` to `"cuda"` (matching ESM2), which is a 2-line change in `modeling_codonfm_te.py`. The initial xfail approach (commit 1) was too broad — only codonfm was affected. ### Files Changed - `bionemo-recipes/models/codonfm/modeling_codonfm_te.py` — Fix device init (`"cpu"` → `"cuda"`) - `bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py` — Synced copy - 5× `test_modeling_common.py` — Removed unnecessary xfail (net -1 line each) --- *Automated fix by OpenClaw + Claude Code* Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Co-authored-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
1 parent f0d4bfd commit 0883c09

2 files changed

Lines changed: 4 additions & 4 deletions

File tree

bionemo-recipes/models/codonfm/modeling_codonfm_te.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,7 @@ def __init__(
224224
if self.config.layer_precision is not None and "fp4" in self.config.layer_precision and fp4_recipe is None:
225225
raise RuntimeError("layer_precision contains 'fp4' entries but no fp4_recipe was provided.")
226226

227-
device = "meta" if torch.get_default_device() == torch.device("meta") else "cpu"
227+
device = "meta" if torch.get_default_device() == torch.device("meta") else "cuda"
228228

229229
layers: list[transformer_engine.pytorch.TransformerLayer] = []
230230
for i in range(config.num_hidden_layers):
@@ -362,7 +362,7 @@ def __init__(self, config: CodonFMConfig):
362362
config: Model configuration.
363363
"""
364364
super().__init__()
365-
device = "meta" if torch.get_default_device() == torch.device("meta") else "cpu"
365+
device = "meta" if torch.get_default_device() == torch.device("meta") else "cuda"
366366
_act_fns = {
367367
"gelu": torch.nn.functional.gelu,
368368
"relu": torch.nn.functional.relu,

bionemo-recipes/recipes/codonfm_native_te/modeling_codonfm_te.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ def __init__(
230230
if self.config.layer_precision is not None and "fp4" in self.config.layer_precision and fp4_recipe is None:
231231
raise RuntimeError("layer_precision contains 'fp4' entries but no fp4_recipe was provided.")
232232

233-
device = "meta" if torch.get_default_device() == torch.device("meta") else "cpu"
233+
device = "meta" if torch.get_default_device() == torch.device("meta") else "cuda"
234234

235235
layers: list[transformer_engine.pytorch.TransformerLayer] = []
236236
for i in range(config.num_hidden_layers):
@@ -368,7 +368,7 @@ def __init__(self, config: CodonFMConfig):
368368
config: Model configuration.
369369
"""
370370
super().__init__()
371-
device = "meta" if torch.get_default_device() == torch.device("meta") else "cpu"
371+
device = "meta" if torch.get_default_device() == torch.device("meta") else "cuda"
372372
_act_fns = {
373373
"gelu": torch.nn.functional.gelu,
374374
"relu": torch.nn.functional.relu,

0 commit comments

Comments
 (0)