Skip to content

Commit ccbe766

Browse files
committed
rename nvfsdp to mfsdp globally
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
1 parent a114094 commit ccbe766

53 files changed

Lines changed: 142 additions & 149 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

bionemo-recipes.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The biological AI community is actively prototyping model architectures and need
88

99
- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
1010
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
11-
- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency
11+
- **Performance optimization**: Leverages TransformerEngine and megatron-fsdp for state-of-the-art training efficiency
1212
- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
1313

1414
### Use Cases
@@ -35,7 +35,7 @@ Example models include ESM-2, Geneformer, and AMPLIFY.
3535
Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with:
3636

3737
- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning
38-
- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing
38+
- **Feature demonstrations**: FP8 training, megatron-fsdp, context parallelism, sequence packing
3939
- **Scaling strategies**: Single-GPU to multi-node training patterns
4040
- **Benchmarked performance**: Validated throughput and convergence metrics
4141

@@ -57,7 +57,7 @@ tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
5757

5858
```bash
5959
# Navigate to a recipe
60-
cd recipes/esm2_native_te_nvfsdp
60+
cd recipes/esm2_native_te_mfsdp
6161

6262
# Build and run
6363
docker build -t esm2_recipe .
@@ -191,4 +191,4 @@ For technical support and questions:
191191

192192
- Check existing issues before opening a new one
193193
- Review our training recipes for implementation examples
194-
- Consult the TransformerEngine and nvFSDP documentation for underlying technologies
194+
- Consult the TransformerEngine and megatron-fsdp documentation for underlying technologies

models/.ruff.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ exclude = [
4646
"dist",
4747
"node_modules",
4848
"venv",
49-
"packages/nvFSDP/",
5049
]
5150

5251
# Ignore import violations in all `__init__.py` files.

recipes/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This directory contains self-contained training examples that demonstrate best practices for scaling
44
biological foundation models using [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
5-
and [nvFSDP](https://github.com/NVIDIA-NeMo/nvFSDP). Each recipe is a complete Docker environment with
5+
and [megatron-fsdp](https://pypi.org/project/megatron-fsdp/). Each recipe is a complete Docker environment with
66
benchmarked training scripts that users can learn from and adapt for their own research.
77

88
## Philosophy
@@ -49,7 +49,7 @@ Follow this naming pattern to clearly communicate what your recipe demonstrates:
4949

5050
Examples:
5151

52-
- `esm2_native_te_nvfsdp/` - ESM-2 with vanilla PyTorch, TransformerEngine, and nvFSDP
52+
- `esm2_native_te_mfsdp/` - ESM-2 with vanilla PyTorch, TransformerEngine, and megatron-fsdp
5353
- `amplify_accelerate_fp8/` - AMPLIFY with HuggingFace Accelerate and FP8 training
5454
- `geneformer_lightning_context_parallel/` - Geneformer with PyTorch Lightning and context parallelism
5555

@@ -115,16 +115,16 @@ Your `train.py` should be educational and self-explanatory:
115115
```python
116116
#!/usr/bin/env python3
117117
"""
118-
ESM-2 training with TransformerEngine and nvFSDP.
118+
ESM-2 training with TransformerEngine and megatron-fsdp.
119119
120120
This script demonstrates how to:
121121
1. Load and prepare biological sequence data
122122
2. Initialize ESM-2 with TransformerEngine layers
123-
3. Configure nvFSDP for memory-efficient multi-GPU training
123+
3. Configure megatron-fsdp for memory-efficient multi-GPU training
124124
4. Implement a training loop with proper checkpointing
125125
126126
Key design decisions:
127-
- We use nvFSDP ZeRO-3 for maximum memory efficiency
127+
- We use megatron-fsdp ZeRO-3 for maximum memory efficiency
128128
- TransformerEngine FP8 is enabled for H100+ hardware
129129
- Context parallelism handles long biological sequences
130130
"""

recipes/esm2_native_te_mfsdp/hydra_config/L0_sanity.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ num_train_steps: 250
88

99
# WandB config
1010
wandb_init_args:
11-
name: "esm2_t6_8M_UR50D_nvfsdp_sanity"
11+
name: "esm2_t6_8M_UR50D_mfsdp_sanity"
1212
mode: "offline"
1313

1414
# Learning rate scheduler config

recipes/esm2_native_te_mfsdp/hydra_config/L1_15B_perf_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ num_train_steps: 500
88

99
# WandB config
1010
wandb_init_args:
11-
name: "esm2_t48_15B_UR50D_nvfsdp_L1_perf"
11+
name: "esm2_t48_15B_UR50D_mfsdp_L1_perf"
1212
project: "bionemo-recipes"
1313

1414
# Optimizer config

recipes/esm2_native_te_mfsdp/train_ddp.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def is_main_process(self) -> bool:
5151

5252
@hydra.main(config_path="hydra_config", config_name="L0_sanity", version_base="1.2")
5353
def main(args: DictConfig) -> float | None:
54-
"""Train ESM-2 with TE layers using nvFSDP.
54+
"""Train ESM-2 with TE layers using mfsdp.
5555
5656
Model names are valid ESM-2 model sizes, e.g.:
5757
- "esm2_t6_8M_UR50D"
@@ -63,7 +63,7 @@ def main(args: DictConfig) -> float | None:
6363
"""
6464
# Initialize distributed training and create a device mesh for FSDP.
6565
# We have to create a dummy mesh dimension for context parallel and tensor parallel for things
66-
# to work correctly with nvFSDP.
66+
# to work correctly with mfsdp.
6767
dist.init_process_group(backend="nccl")
6868
dist_config = DistributedConfig()
6969
torch.cuda.set_device(dist_config.local_rank)

recipes/esm2_native_te_mfsdp/train_mfsdp.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ def is_main_process(self) -> bool:
5555

5656
@hydra.main(config_path="hydra_config", config_name="L0_sanity", version_base="1.2")
5757
def main(args: DictConfig) -> float | None:
58-
"""Train ESM-2 with TE layers using nvFSDP.
58+
"""Train ESM-2 with TE layers using mfsdp.
5959
6060
Model names are valid ESM-2 model sizes, e.g.:
6161
- "esm2_t6_8M_UR50D"
@@ -67,7 +67,7 @@ def main(args: DictConfig) -> float | None:
6767
"""
6868
# Initialize distributed training and create a device mesh for FSDP.
6969
# We have to create a dummy mesh dimension for context parallel and tensor parallel for things
70-
# to work correctly with nvFSDP.
70+
# to work correctly with mfsdp.
7171
dist.init_process_group(backend="nccl")
7272
dist_config = DistributedConfig()
7373
torch.cuda.set_device(dist_config.local_rank)

recipes/esm2_native_te_nvfsdp_thd/.devcontainer/devcontainer.json renamed to recipes/esm2_native_te_mfsdp_thd/.devcontainer/devcontainer.json

File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)