Skip to content

Commit b727e51

Browse files
Merge branch 'main' into pbinder/changes_for_scdl_profile
2 parents 497a66b + 4bf3878 commit b727e51

73 files changed

Lines changed: 2577 additions & 204 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.devcontainer/recipes/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,6 @@ megatron-fsdp==0.1.0rc0
77
torchmetrics
88
tqdm
99
transformer_engine
10-
transformers @ git+https://github.com/huggingface/transformers.git
10+
transformers
1111
typer
1212
wandb

bionemo-recipes.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The biological AI community is actively prototyping model architectures and need
88

99
- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
1010
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
11-
- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency
11+
- **Performance optimization**: Leverages TransformerEngine and megatron-fsdp for state-of-the-art training efficiency
1212
- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
1313

1414
### Use Cases
@@ -35,7 +35,7 @@ Example models include ESM-2, Geneformer, and AMPLIFY.
3535
Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with:
3636

3737
- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning
38-
- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing
38+
- **Feature demonstrations**: FP8 training, megatron-fsdp, context parallelism, sequence packing
3939
- **Scaling strategies**: Single-GPU to multi-node training patterns
4040
- **Benchmarked performance**: Validated throughput and convergence metrics
4141

@@ -57,7 +57,7 @@ tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
5757

5858
```bash
5959
# Navigate to a recipe
60-
cd recipes/esm2_native_te_nvfsdp
60+
cd recipes/esm2_native_te_mfsdp
6161

6262
# Build and run
6363
docker build -t esm2_recipe .
@@ -191,4 +191,4 @@ For technical support and questions:
191191

192192
- Check existing issues before opening a new one
193193
- Review our training recipes for implementation examples
194-
- Consult the TransformerEngine and nvFSDP documentation for underlying technologies
194+
- Consult the TransformerEngine and megatron-fsdp documentation for underlying technologies

ci/benchmarks/partial-conv/evo2_finetuning.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,11 @@ script: |-
8989
--devices=${gpus} \
9090
--num-nodes=${nodes} \
9191
--val-check-interval=${val_check_interval} \
92-
--wandb-project=${wandb_project_name} \
93-
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
9492
--create-tensorboard-logger \
9593
--activation-checkpoint-recompute-num-layers=${activation_checkpoint_layers} \
9694
--disable-checkpointing \
9795
--early-stop-on-step=${stop_steps} \
96+
--wandb-project=${wandb_project_name} \
97+
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
98+
--wandb-job-type=${pipeline_label} \
9899
--garbage-collect-at-inference;

ci/benchmarks/partial-conv/evo2_pretrain.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ key_segments:
77
lr: False
88
min_lr: False
99
wu_steps: False
10-
artefacts_url: False
10+
pckg_url: False
1111
file_name_wheel: False
1212
script_args:
1313
# All arguments referenced in the script string must be specified here.
1414
# Arguments not referenced in the script string must have the 'arg' field specified.
1515
# See jet/core/configs.py for the specification of the configuration class
1616
workspace: /workspace/bionemo2
1717
data_path: /data/evo2
18-
artefacts_url: https://__token__:${{JET_GITLAB_TOKEN}}@gitlab-master.nvidia.com/api/v4/projects/180496/packages/pypi/simple
18+
pckg_url: gitlab-master.nvidia.com/api/v4/projects/180496/packages/pypi/simple/
1919
file_name_wheel: subquadratic-ops
2020
model: evo2
2121
variant: train
@@ -40,7 +40,7 @@ script_args:
4040
script: |-
4141
INSTALL_FLAG="/tmp/install_done_${{SLURMD_NODENAME}}";
4242
if [ "$SLURM_LOCALID" = "0" ]; then
43-
pip install ${file_name_wheel} --index-url ${artefacts_url}
43+
pip install ${file_name_wheel} --index-url https://oauth2:$JET_GITLAB_TOKEN@${pckg_url} --extra-index-url https://pypi.org/simple/
4444
touch $INSTALL_FLAG
4545
fi
4646
# All ranks wait until install flag file appears

ci/benchmarks/perf/esm2_pretrain.yaml

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,23 @@ script_args:
4141
tp: 1
4242
dfpnl: ""
4343
script: |-
44+
COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
45+
NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
46+
if [ "$SLURM_LOCALID" = "0" ]; then
47+
df -h;
48+
echo $NEW_DATA_PATH;
49+
time cp -r ${data_path}/ $NEW_DATA_PATH;
50+
touch $COPY_FLAG
51+
fi
52+
# All ranks wait until install flag file appears
53+
while [ ! -f $COPY_FLAG ]; do
54+
sleep 1
55+
done
4456
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
45-
--train-cluster-path=${data_path}/train_clusters.parquet \
46-
--train-database-path=${data_path}/train.db \
47-
--valid-cluster-path=${data_path}/valid_clusters.parquet \
48-
--valid-database-path=${data_path}/validation.db \
57+
--train-cluster-path=$NEW_DATA_PATH/train_clusters.parquet \
58+
--train-database-path=$NEW_DATA_PATH/train.db \
59+
--valid-cluster-path=$NEW_DATA_PATH/valid_clusters.parquet \
60+
--valid-database-path=$NEW_DATA_PATH/validation.db \
4961
--micro-batch-size=${batch_size} \
5062
--num-nodes=${nodes} \
5163
--num-gpus=${gpus} \

ci/benchmarks/perf/geneformer_pretrain.yaml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,20 @@ script_args:
2727
batch_size: 32
2828

2929
script: |-
30-
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
31-
--data-dir ${data_path} \
30+
COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
31+
NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
32+
if [ "$SLURM_LOCALID" = "0" ]; then
33+
df -h;
34+
echo $NEW_DATA_PATH;
35+
time cp -r ${data_path}/ $NEW_DATA_PATH;
36+
touch $COPY_FLAG
37+
fi
38+
# All ranks wait until install flag file appears
39+
while [ ! -f $COPY_FLAG ]; do
40+
sleep 1
41+
done
42+
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
43+
--data-dir $NEW_DATA_PATH \
3244
--experiment-name ${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
3345
--num-gpus ${gpus} \
3446
--save-last-checkpoint \

docs/docs/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ hide:
2222
</span>
2323
</div>
2424
<div class="card-title" style="margin: 0;">
25-
<strong>Datasets</strong>
25+
<strong>User Guide</strong>
2626
</div>
2727
</div>
2828
<hr />
@@ -42,7 +42,7 @@ hide:
4242
</span>
4343
</div>
4444
<div class="card-title" style="margin: 0;">
45-
<strong>Datasets</strong>
45+
<strong>API Reference</strong>
4646
</div>
4747
</div>
4848
<hr />
@@ -62,7 +62,7 @@ hide:
6262
</span>
6363
</div>
6464
<div class="card-title" style="margin: 0;">
65-
<strong>Datasets</strong>
65+
<strong>Models</strong>
6666
</div>
6767
</div>
6868
<hr />

models/.ruff.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ exclude = [
4646
"dist",
4747
"node_modules",
4848
"venv",
49-
"packages/nvFSDP/",
5049
]
5150

5251
# Ignore import violations in all `__init__.py` files.

recipes/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This directory contains self-contained training examples that demonstrate best practices for scaling
44
biological foundation models using [TransformerEngine](https://github.com/NVIDIA/TransformerEngine)
5-
and [nvFSDP](https://github.com/NVIDIA-NeMo/nvFSDP). Each recipe is a complete Docker environment with
5+
and [megatron-fsdp](https://pypi.org/project/megatron-fsdp/). Each recipe is a complete Docker environment with
66
benchmarked training scripts that users can learn from and adapt for their own research.
77

88
## Philosophy
@@ -49,7 +49,7 @@ Follow this naming pattern to clearly communicate what your recipe demonstrates:
4949

5050
Examples:
5151

52-
- `esm2_native_te_nvfsdp/` - ESM-2 with vanilla PyTorch, TransformerEngine, and nvFSDP
52+
- `esm2_native_te_mfsdp/` - ESM-2 with vanilla PyTorch, TransformerEngine, and megatron-fsdp
5353
- `amplify_accelerate_fp8/` - AMPLIFY with HuggingFace Accelerate and FP8 training
5454
- `geneformer_lightning_context_parallel/` - Geneformer with PyTorch Lightning and context parallelism
5555

@@ -115,16 +115,16 @@ Your `train.py` should be educational and self-explanatory:
115115
```python
116116
#!/usr/bin/env python3
117117
"""
118-
ESM-2 training with TransformerEngine and nvFSDP.
118+
ESM-2 training with TransformerEngine and megatron-fsdp.
119119
120120
This script demonstrates how to:
121121
1. Load and prepare biological sequence data
122122
2. Initialize ESM-2 with TransformerEngine layers
123-
3. Configure nvFSDP for memory-efficient multi-GPU training
123+
3. Configure megatron-fsdp for memory-efficient multi-GPU training
124124
4. Implement a training loop with proper checkpointing
125125
126126
Key design decisions:
127-
- We use nvFSDP ZeRO-3 for maximum memory efficiency
127+
- We use megatron-fsdp ZeRO-3 for maximum memory efficiency
128128
- TransformerEngine FP8 is enabled for H100+ hardware
129129
- Context parallelism handles long biological sequences
130130
"""
@@ -197,7 +197,7 @@ optimizer:
197197
# Distributed training
198198
distributed:
199199
backend: nccl
200-
nvfsdp:
200+
mfsdp:
201201
enable: true
202202
sharding_strategy: zero3
203203

@@ -242,7 +242,7 @@ training:
242242
num_train_steps: 100 # Enough steps for stable metrics
243243
244244
wandb:
245-
name: "esm2_nvfsdp_benchmark"
245+
name: "esm2_mfsdp_benchmark"
246246
tags: ["L1", "benchmark", "performance"]
247247
```
248248

@@ -411,7 +411,7 @@ docker run --rm -it --gpus all my_recipe pytest -v .
411411

412412
For reference implementations, examine existing recipes:
413413

414-
- **`esm2_native_te_nvfsdp/`**: Comprehensive example showing vanilla PyTorch with TE and nvFSDP
414+
- **`esm2_native_te_mfsdp/`**: Comprehensive example showing vanilla PyTorch with TE and megatron-fsdp
415415
- **`amplify_accelerate_fp8/`**: HuggingFace Accelerate integration with FP8 training
416416
- **`geneformer_lightning_context_parallel/`**: PyTorch Lightning with context parallelism for long sequences
417417

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11
defaults:
22
- defaults
3+
- _self_
34

45
model_tag: "nvidia/esm2_t6_8M_UR50D"
5-
stop_after_n_steps: 4
6+
stop_after_n_steps: 250
7+
68
trainer:
79
run_name: "esm2_t6_8M_UR50D_sanity"
810
per_device_train_batch_size: 2
911
per_device_eval_batch_size: 2
10-
save_steps: 2
11-
eval_steps: 2
12-
logging_steps: 1
12+
save_steps: 1000
13+
eval_steps: 1000
14+
logging_steps: 10
1315
report_to: "none"
1416
dataloader_num_workers: 0
17+
warmup_steps: 0

0 commit comments

Comments
 (0)