fix(recipes): remove deprecated fully_shard() kwargs (NVIDIA-BioNeMo#1585)

svc-bionemo · pstjohn · web-flow · commit 40d64e01579f · 2026-06-02T12:10:24.000Z
## Summary Remove deprecated `fully_shard()` keyword arguments that were removed in a recent megatron-core update. ### Changes - **vit/train.py**: Remove `grad_reduce_in_fp32` and `preserve_fp32_weights` from `fully_shard()` call - **vit/config/defaults.yaml + vit_base_patch16_224.yaml**: Remove orphaned config keys - **esm2_native_te/hydra_config/defaults.yaml**: Remove `check_for_nan_in_grad`, `grad_reduce_in_fp32`, `preserve_fp32_weights` - **geneformer_native_te_mfsdp_fp8/hydra_config/defaults.yaml**: Same ### Root Cause The installed megatron-core in CI has updated `fully_shard()` API, removing these parameters: - `grad_reduce_in_fp32` - `preserve_fp32_weights` - `check_for_nan_in_grad` These were simple removals — no replacement mechanism needed for the affected recipes. Fixes NVIDIA-BioNeMo#1584 (partially — the `nvidia-resiliency-ext` version mismatch is a container image issue requiring an image rebuild) ### Testing - Pre-commit passes on all changed files - Import verification confirms `fully_shard` loads without the removed kwargs Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Co-authored-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Co-authored-by: Peter St. John <pstjohn@nvidia.com>
diff --git a/bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml b/bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml
@@ -34,9 +34,6 @@ fully_shard_kwargs:
   zero_dp_strategy: "optim_grads_params"
   calculate_per_token_loss: false
   init_model_with_meta_device: ${use_meta_device}
-  check_for_nan_in_grad: true
-  grad_reduce_in_fp32: false
-  preserve_fp32_weights: true
   overlap_grad_reduce: true
   overlap_param_gather: true
   sync_model_each_microbatch: true
diff --git a/bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/hydra_config/defaults.yaml b/bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/hydra_config/defaults.yaml
@@ -31,9 +31,6 @@ training:
     zero_dp_strategy: "optim_grads_params"
     calculate_per_token_loss: false
     init_model_with_meta_device: true
-    check_for_nan_in_grad: true
-    grad_reduce_in_fp32: false
-    preserve_fp32_weights: true
     overlap_grad_reduce: true
     overlap_param_gather: true
     sync_model_each_microbatch: true
diff --git a/bionemo-recipes/recipes/vit/config/defaults.yaml b/bionemo-recipes/recipes/vit/config/defaults.yaml
@@ -55,8 +55,6 @@ fsdp:
     - torch.nn.LayerNorm
     - torch.nn.Linear
   outer_dp_sharding_strategy: "optim"
-  grad_reduce_in_fp32: false
-  preserve_fp32_weights: true
 
 training:
   steps: 10
diff --git a/bionemo-recipes/recipes/vit/config/vit_base_patch16_224.yaml b/bionemo-recipes/recipes/vit/config/vit_base_patch16_224.yaml
@@ -53,8 +53,6 @@ fsdp:
     - torch.nn.LayerNorm
     - torch.nn.Linear
   outer_dp_sharding_strategy: 1
-  grad_reduce_in_fp32: false
-  preserve_fp32_weights: true
 
 training:
   steps: 500
diff --git a/bionemo-recipes/recipes/vit/train.py b/bionemo-recipes/recipes/vit/train.py
@@ -95,10 +95,6 @@ def main(cfg) -> None:
             hybrid_fsdp_group=device_mesh["hsdp"].get_group(),
             # Load the model on device in shards to avoid OOM. Requires device("meta")-init for model.
             init_model_with_meta_device=cfg.fsdp.init_model_with_meta_device,
-            # Reduce gradients in FP32.
-            grad_reduce_in_fp32=cfg.fsdp.grad_reduce_in_fp32,
-            # Store distributed optimization state in FP32.
-            preserve_fp32_weights=cfg.fsdp.preserve_fp32_weights,
             # Sync model parameters and gradients each step. Allows for param and gradient mods after BWD
             # pass, but deactivates compute-communication overlap going into the subsequent training step.
             sync_model_each_microbatch=True,