Skip to content

Commit 40d64e0

Browse files
svc-bionemopstjohn
andauthored
fix(recipes): remove deprecated fully_shard() kwargs (NVIDIA-BioNeMo#1585)
## Summary Remove deprecated `fully_shard()` keyword arguments that were removed in a recent megatron-core update. ### Changes - **vit/train.py**: Remove `grad_reduce_in_fp32` and `preserve_fp32_weights` from `fully_shard()` call - **vit/config/defaults.yaml + vit_base_patch16_224.yaml**: Remove orphaned config keys - **esm2_native_te/hydra_config/defaults.yaml**: Remove `check_for_nan_in_grad`, `grad_reduce_in_fp32`, `preserve_fp32_weights` - **geneformer_native_te_mfsdp_fp8/hydra_config/defaults.yaml**: Same ### Root Cause The installed megatron-core in CI has updated `fully_shard()` API, removing these parameters: - `grad_reduce_in_fp32` - `preserve_fp32_weights` - `check_for_nan_in_grad` These were simple removals — no replacement mechanism needed for the affected recipes. Fixes NVIDIA-BioNeMo#1584 (partially — the `nvidia-resiliency-ext` version mismatch is a container image issue requiring an image rebuild) ### Testing - Pre-commit passes on all changed files - Import verification confirms `fully_shard` loads without the removed kwargs Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Co-authored-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com> Co-authored-by: Peter St. John <pstjohn@nvidia.com>
1 parent 85a8dcc commit 40d64e0

5 files changed

Lines changed: 0 additions & 14 deletions

File tree

bionemo-recipes/recipes/esm2_native_te/hydra_config/defaults.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,6 @@ fully_shard_kwargs:
3434
zero_dp_strategy: "optim_grads_params"
3535
calculate_per_token_loss: false
3636
init_model_with_meta_device: ${use_meta_device}
37-
check_for_nan_in_grad: true
38-
grad_reduce_in_fp32: false
39-
preserve_fp32_weights: true
4037
overlap_grad_reduce: true
4138
overlap_param_gather: true
4239
sync_model_each_microbatch: true

bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/hydra_config/defaults.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,6 @@ training:
3131
zero_dp_strategy: "optim_grads_params"
3232
calculate_per_token_loss: false
3333
init_model_with_meta_device: true
34-
check_for_nan_in_grad: true
35-
grad_reduce_in_fp32: false
36-
preserve_fp32_weights: true
3734
overlap_grad_reduce: true
3835
overlap_param_gather: true
3936
sync_model_each_microbatch: true

bionemo-recipes/recipes/vit/config/defaults.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,6 @@ fsdp:
5555
- torch.nn.LayerNorm
5656
- torch.nn.Linear
5757
outer_dp_sharding_strategy: "optim"
58-
grad_reduce_in_fp32: false
59-
preserve_fp32_weights: true
6058

6159
training:
6260
steps: 10

bionemo-recipes/recipes/vit/config/vit_base_patch16_224.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,6 @@ fsdp:
5353
- torch.nn.LayerNorm
5454
- torch.nn.Linear
5555
outer_dp_sharding_strategy: 1
56-
grad_reduce_in_fp32: false
57-
preserve_fp32_weights: true
5856

5957
training:
6058
steps: 500

bionemo-recipes/recipes/vit/train.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -95,10 +95,6 @@ def main(cfg) -> None:
9595
hybrid_fsdp_group=device_mesh["hsdp"].get_group(),
9696
# Load the model on device in shards to avoid OOM. Requires device("meta")-init for model.
9797
init_model_with_meta_device=cfg.fsdp.init_model_with_meta_device,
98-
# Reduce gradients in FP32.
99-
grad_reduce_in_fp32=cfg.fsdp.grad_reduce_in_fp32,
100-
# Store distributed optimization state in FP32.
101-
preserve_fp32_weights=cfg.fsdp.preserve_fp32_weights,
10298
# Sync model parameters and gradients each step. Allows for param and gradient mods after BWD
10399
# pass, but deactivates compute-communication overlap going into the subsequent training step.
104100
sync_model_each_microbatch=True,

0 commit comments

Comments
 (0)