Skip to content

Commit 4bf3878

Browse files
authored
rename nvfsdp to mfsdp globally (#1137)
Rename folders and files to no longer use the deprecated nvFSDP name, now uses 'megatron-fsdp' or 'mfsdp' <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features/Improvements - Migrate project references and workflows from NVFSDP to Megatron-FSDP (mfsdp); W&B run names and recipe paths updated; simplified setup for one recipe by removing an external tarball download. - Refactor - Renamed config/API flags from use_nvfsdp → use_mfsdp and updated checkpointing and save/load flows to the mfsdp backend. - Documentation - All NVFSDP mentions, examples, and readmes replaced with Megatron-FSDP terminology. - Tests - Test names, configs, and orchestration updated to use mfsdp. - Chores - Re-enabled linting for a previously excluded directory; disabled a small masking-ratio assertion in one test. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent 223b211 commit 4bf3878

53 files changed

Lines changed: 150 additions & 156 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

bionemo-recipes.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The biological AI community is actively prototyping model architectures and need
88

99
- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
1010
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
11-
- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency
11+
- **Performance optimization**: Leverages TransformerEngine and megatron-fsdp for state-of-the-art training efficiency
1212
- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
1313

1414
### Use Cases
@@ -35,7 +35,7 @@ Example models include ESM-2, Geneformer, and AMPLIFY.
3535
Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with:
3636

3737
- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning
38-
- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing
38+
- **Feature demonstrations**: FP8 training, megatron-fsdp, context parallelism, sequence packing
3939
- **Scaling strategies**: Single-GPU to multi-node training patterns
4040
- **Benchmarked performance**: Validated throughput and convergence metrics
4141

@@ -57,7 +57,7 @@ tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
5757

5858
```bash
5959
# Navigate to a recipe
60-
cd recipes/esm2_native_te_nvfsdp
60+
cd recipes/esm2_native_te_mfsdp
6161

6262
# Build and run
6363
docker build -t esm2_recipe .
@@ -191,4 +191,4 @@ For technical support and questions:
191191

192192
- Check existing issues before opening a new one
193193
- Review our training recipes for implementation examples
194-
- Consult the TransformerEngine and nvFSDP documentation for underlying technologies
194+
- Consult the TransformerEngine and megatron-fsdp documentation for underlying technologies

models/.ruff.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ exclude = [
4646
"dist",
4747
"node_modules",
4848
"venv",
49-
"packages/nvFSDP/",
5049
]
5150

5251
# Ignore import violations in all `__init__.py` files.

recipes/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This directory contains self-contained training examples that demonstrate best practices for scaling
44
biological foundation models using [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
5-
and [nvFSDP](https://github.com/NVIDIA-NeMo/nvFSDP). Each recipe is a complete Docker environment with
5+
and [megatron-fsdp](https://pypi.org/project/megatron-fsdp/). Each recipe is a complete Docker environment with
66
benchmarked training scripts that users can learn from and adapt for their own research.
77

88
## Philosophy
@@ -49,7 +49,7 @@ Follow this naming pattern to clearly communicate what your recipe demonstrates:
4949

5050
Examples:
5151

52-
- `esm2_native_te_nvfsdp/` - ESM-2 with vanilla PyTorch, TransformerEngine, and nvFSDP
52+
- `esm2_native_te_mfsdp/` - ESM-2 with vanilla PyTorch, TransformerEngine, and megatron-fsdp
5353
- `amplify_accelerate_fp8/` - AMPLIFY with HuggingFace Accelerate and FP8 training
5454
- `geneformer_lightning_context_parallel/` - Geneformer with PyTorch Lightning and context parallelism
5555

@@ -115,16 +115,16 @@ Your `train.py` should be educational and self-explanatory:
115115
```python
116116
#!/usr/bin/env python3
117117
"""
118-
ESM-2 training with TransformerEngine and nvFSDP.
118+
ESM-2 training with TransformerEngine and megatron-fsdp.
119119
120120
This script demonstrates how to:
121121
1. Load and prepare biological sequence data
122122
2. Initialize ESM-2 with TransformerEngine layers
123-
3. Configure nvFSDP for memory-efficient multi-GPU training
123+
3. Configure megatron-fsdp for memory-efficient multi-GPU training
124124
4. Implement a training loop with proper checkpointing
125125
126126
Key design decisions:
127-
- We use nvFSDP ZeRO-3 for maximum memory efficiency
127+
- We use megatron-fsdp ZeRO-3 for maximum memory efficiency
128128
- TransformerEngine FP8 is enabled for H100+ hardware
129129
- Context parallelism handles long biological sequences
130130
"""
@@ -197,7 +197,7 @@ optimizer:
197197
# Distributed training
198198
distributed:
199199
backend: nccl
200-
nvfsdp:
200+
mfsdp:
201201
enable: true
202202
sharding_strategy: zero3
203203

@@ -242,7 +242,7 @@ training:
242242
num_train_steps: 100 # Enough steps for stable metrics
243243
244244
wandb:
245-
name: "esm2_nvfsdp_benchmark"
245+
name: "esm2_mfsdp_benchmark"
246246
tags: ["L1", "benchmark", "performance"]
247247
```
248248

@@ -411,7 +411,7 @@ docker run --rm -it --gpus all my_recipe pytest -v .
411411

412412
For reference implementations, examine existing recipes:
413413

414-
- **`esm2_native_te_nvfsdp/`**: Comprehensive example showing vanilla PyTorch with TE and nvFSDP
414+
- **`esm2_native_te_mfsdp/`**: Comprehensive example showing vanilla PyTorch with TE and megatron-fsdp
415415
- **`amplify_accelerate_fp8/`**: HuggingFace Accelerate integration with FP8 training
416416
- **`geneformer_lightning_context_parallel/`**: PyTorch Lightning with context parallelism for long sequences
417417

recipes/esm2_native_te_mfsdp/hydra_config/L0_sanity.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ num_train_steps: 250
88

99
# WandB config
1010
wandb_init_args:
11-
name: "esm2_t6_8M_UR50D_nvfsdp_sanity"
11+
name: "esm2_t6_8M_UR50D_mfsdp_sanity"
1212
mode: "offline"
1313

1414
# Learning rate scheduler config

recipes/esm2_native_te_mfsdp/hydra_config/L1_15B_perf_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ num_train_steps: 500
88

99
# WandB config
1010
wandb_init_args:
11-
name: "esm2_t48_15B_UR50D_nvfsdp_L1_perf"
11+
name: "esm2_t48_15B_UR50D_mfsdp_L1_perf"
1212
project: "bionemo-recipes"
1313

1414
# Optimizer config

recipes/esm2_native_te_mfsdp/train_ddp.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def is_main_process(self) -> bool:
5151

5252
@hydra.main(config_path="hydra_config", config_name="L0_sanity", version_base="1.2")
5353
def main(args: DictConfig) -> float | None:
54-
"""Train ESM-2 with TE layers using nvFSDP.
54+
"""Train ESM-2 with TE layers using mfsdp.
5555
5656
Model names are valid ESM-2 model sizes, e.g.:
5757
- "esm2_t6_8M_UR50D"
@@ -63,7 +63,7 @@ def main(args: DictConfig) -> float | None:
6363
"""
6464
# Initialize distributed training and create a device mesh for FSDP.
6565
# We have to create a dummy mesh dimension for context parallel and tensor parallel for things
66-
# to work correctly with nvFSDP.
66+
# to work correctly with mfsdp.
6767
dist.init_process_group(backend="nccl")
6868
dist_config = DistributedConfig()
6969
torch.cuda.set_device(dist_config.local_rank)

recipes/esm2_native_te_mfsdp/train_mfsdp.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ def is_main_process(self) -> bool:
5555

5656
@hydra.main(config_path="hydra_config", config_name="L0_sanity", version_base="1.2")
5757
def main(args: DictConfig) -> float | None:
58-
"""Train ESM-2 with TE layers using nvFSDP.
58+
"""Train ESM-2 with TE layers using mfsdp.
5959
6060
Model names are valid ESM-2 model sizes, e.g.:
6161
- "esm2_t6_8M_UR50D"
@@ -67,7 +67,7 @@ def main(args: DictConfig) -> float | None:
6767
"""
6868
# Initialize distributed training and create a device mesh for FSDP.
6969
# We have to create a dummy mesh dimension for context parallel and tensor parallel for things
70-
# to work correctly with nvFSDP.
70+
# to work correctly with mfsdp.
7171
dist.init_process_group(backend="nccl")
7272
dist_config = DistributedConfig()
7373
torch.cuda.set_device(dist_config.local_rank)

recipes/esm2_native_te_nvfsdp_thd/.devcontainer/devcontainer.json renamed to recipes/esm2_native_te_mfsdp_thd/.devcontainer/devcontainer.json

File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)