Skip to content

Commit 579705f

Browse files
committed
Merge remote-tracking branch 'origin' into dorotat/update-jet-tests
2 parents 7000320 + cd74c2b commit 579705f

99 files changed

Lines changed: 1676 additions & 624 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@ With a locally cloned repository and initialized submodules, build the BioNeMo c
8282
docker buildx build . -t my-container-tag
8383
```
8484

85+
If you see an error message like `No file descriptors available (os error 24)`, add the option `--ulimit nofile=65535:65535` to the docker build command.
86+
8587
#### VSCode Devcontainer for Interactive Debugging
8688

8789
We distribute a [development container](https://devcontainers.github.io/) configuration for vscode

bionemo-recipes.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The biological AI community is actively prototyping model architectures and need
88

99
- **Flexible scaling**: Scale from single-GPU prototyping to multi-node training without complex parallelism configurations
1010
- **Framework compatibility**: Works with popular frameworks like HuggingFace Accelerate, PyTorch Lightning, and vanilla PyTorch
11-
- **Performance optimization**: Leverages TransformerEngine and nvFSDP for state-of-the-art training efficiency
11+
- **Performance optimization**: Leverages TransformerEngine and megatron-fsdp for state-of-the-art training efficiency
1212
- **Research-friendly**: Hackable, readable code that researchers can easily adapt for their experiments
1313

1414
### Use Cases
@@ -35,7 +35,7 @@ Example models include ESM-2, Geneformer, and AMPLIFY.
3535
Self-contained training examples demonstrating best practices for scaling biological foundation models. Each recipe is a complete Docker container with:
3636

3737
- **Framework examples**: Vanilla PyTorch, HuggingFace Accelerate, PyTorch Lightning
38-
- **Feature demonstrations**: FP8 training, nvFSDP, context parallelism, sequence packing
38+
- **Feature demonstrations**: FP8 training, megatron-fsdp, context parallelism, sequence packing
3939
- **Scaling strategies**: Single-GPU to multi-node training patterns
4040
- **Benchmarked performance**: Validated throughput and convergence metrics
4141

@@ -57,7 +57,7 @@ tokenizer = AutoTokenizer.from_pretrained("nvidia/AMPLIFY_120M")
5757

5858
```bash
5959
# Navigate to a recipe
60-
cd recipes/esm2_native_te_nvfsdp
60+
cd recipes/esm2_native_te_mfsdp
6161

6262
# Build and run
6363
docker build -t esm2_recipe .
@@ -191,4 +191,4 @@ For technical support and questions:
191191

192192
- Check existing issues before opening a new one
193193
- Review our training recipes for implementation examples
194-
- Consult the TransformerEngine and nvFSDP documentation for underlying technologies
194+
- Consult the TransformerEngine and megatron-fsdp documentation for underlying technologies

ci/benchmarks/partial-conv/evo2_finetuning.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,11 @@ script: |-
8989
--devices=${gpus} \
9090
--num-nodes=${nodes} \
9191
--val-check-interval=${val_check_interval} \
92-
--wandb-project=${wandb_project_name} \
93-
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
9492
--create-tensorboard-logger \
9593
--activation-checkpoint-recompute-num-layers=${activation_checkpoint_layers} \
9694
--disable-checkpointing \
9795
--early-stop-on-step=${stop_steps} \
96+
--wandb-project=${wandb_project_name} \
97+
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
98+
--wandb-job-type=${pipeline_label} \
9899
--garbage-collect-at-inference;

ci/benchmarks/perf/esm2_pretrain.yaml

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,23 @@ script_args:
4141
tp: 1
4242
dfpnl: ""
4343
script: |-
44+
COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
45+
NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
46+
if [ "$SLURM_LOCALID" = "0" ]; then
47+
df -h;
48+
echo $NEW_DATA_PATH;
49+
time cp -r ${data_path}/ $NEW_DATA_PATH;
50+
touch $COPY_FLAG
51+
fi
52+
# All ranks wait until install flag file appears
53+
while [ ! -f $COPY_FLAG ]; do
54+
sleep 1
55+
done
4456
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
45-
--train-cluster-path=${data_path}/train_clusters.parquet \
46-
--train-database-path=${data_path}/train.db \
47-
--valid-cluster-path=${data_path}/valid_clusters.parquet \
48-
--valid-database-path=${data_path}/validation.db \
57+
--train-cluster-path=$NEW_DATA_PATH/train_clusters.parquet \
58+
--train-database-path=$NEW_DATA_PATH/train.db \
59+
--valid-cluster-path=$NEW_DATA_PATH/valid_clusters.parquet \
60+
--valid-database-path=$NEW_DATA_PATH/validation.db \
4961
--micro-batch-size=${batch_size} \
5062
--num-nodes=${nodes} \
5163
--num-gpus=${gpus} \

ci/benchmarks/perf/geneformer_pretrain.yaml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,20 @@ script_args:
2727
batch_size: 32
2828

2929
script: |-
30-
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
31-
--data-dir ${data_path} \
30+
COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
31+
NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
32+
if [ "$SLURM_LOCALID" = "0" ]; then
33+
df -h;
34+
echo $NEW_DATA_PATH;
35+
time cp -r ${data_path}/ $NEW_DATA_PATH;
36+
touch $COPY_FLAG
37+
fi
38+
# All ranks wait until install flag file appears
39+
while [ ! -f $COPY_FLAG ]; do
40+
sleep 1
41+
done
42+
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
43+
--data-dir $NEW_DATA_PATH \
3244
--experiment-name ${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
3345
--num-gpus ${gpus} \
3446
--save-last-checkpoint \

models/.ruff.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ exclude = [
4646
"dist",
4747
"node_modules",
4848
"venv",
49-
"packages/nvFSDP/",
5049
]
5150

5251
# Ignore import violations in all `__init__.py` files.

models/amplify/.devcontainer/devcontainer.json

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,11 @@
22
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
33
{
44
"name": "Existing Dockerfile",
5-
"build": {
6-
"context": "..",
7-
"dockerfile": "Dockerfile.dev"
8-
},
5+
"image": "svcbionemo023/bionemo-framework:amplify-model-devcontainer-082025",
96
"mounts": [
107
"source=${localEnv:HOME}/.cache,target=/home/ubuntu/.cache,type=bind,consistency=cached"
118
],
12-
"postCreateCommand": "pip install -e .[convert,test]",
9+
"postCreateCommand": "PIP_CONSTRAINT= pip install -e .",
1310
"remoteUser": "ubuntu",
1411
"runArgs": [
1512
"--gpus=all",

models/amplify/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ FROM nvcr.io/nvidia/pytorch:25.01-py3
44
RUN MAX_JOBS=4 pip --disable-pip-version-check --no-cache-dir install -v git+https://github.com/facebookresearch/xformers.git@v0.0.29.post1#egg=xformers
55
RUN PIP_CONSTRAINT= NVTE_FRAMEWORK=pytorch MAX_JOBS=4 pip --disable-pip-version-check --no-cache-dir install -v git+https://github.com/nvidia/TransformerEngine.git@v2.4
66

7-
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
87
WORKDIR /workspace/bionemo
98
COPY . .
10-
RUN --mount=type=cache,target=/root/.cache/uv \
9+
RUN --mount=type=cache,target=/root/.cache/pip \
1110
PIP_CONSTRAINT= pip install -e .

models/amplify/export.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
# Smoke test that the model can be loaded.
3737
model_te = AutoModelForMaskedLM.from_pretrained(
3838
f"./checkpoint_export/{tag}",
39-
torch_dtype=torch.bfloat16,
39+
dtype=torch.bfloat16,
4040
trust_remote_code=True,
4141
)
4242
del model_te

models/amplify/src/amplify/amplify_te.py

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -147,17 +147,15 @@ def __init__(self, config: AMPLIFYConfig, **kwargs):
147147
config.padded_vocab_size,
148148
config.hidden_size,
149149
padding_idx=config.pad_token_id,
150-
dtype=config.torch_dtype,
150+
dtype=config.dtype,
151151
)
152152

153153
if config.layer_norm_after_embedding:
154154
self.layer_norm_1 = (
155-
transformer_engine.pytorch.RMSNorm(
156-
config.hidden_size, config.norm_eps, params_dtype=config.torch_dtype
157-
)
155+
transformer_engine.pytorch.RMSNorm(config.hidden_size, config.norm_eps, params_dtype=config.dtype)
158156
if config.rms_norm
159157
else transformer_engine.pytorch.LayerNorm(
160-
config.hidden_size, config.norm_eps, params_dtype=config.torch_dtype
158+
config.hidden_size, config.norm_eps, params_dtype=config.dtype
161159
)
162160
)
163161

@@ -169,6 +167,9 @@ def __init__(self, config: AMPLIFYConfig, **kwargs):
169167
intermediate_size = int(2 * config.intermediate_size / 3)
170168
intermediate_size = multiple_of * ((intermediate_size + multiple_of - 1) // multiple_of)
171169

170+
else:
171+
intermediate_size = config.intermediate_size
172+
172173
self.transformer_encoder = nn.ModuleList()
173174
for layer_num in range(config.num_hidden_layers):
174175
self.transformer_encoder.append(
@@ -194,7 +195,7 @@ def __init__(self, config: AMPLIFYConfig, **kwargs):
194195
window_size=(-1, -1),
195196
rotary_pos_interleaved=True,
196197
seq_length=config.max_length,
197-
params_dtype=config.torch_dtype,
198+
params_dtype=config.dtype,
198199
)
199200
)
200201

@@ -212,7 +213,6 @@ def forward(
212213
output_hidden_states=False,
213214
output_attentions=False,
214215
labels=None,
215-
**kwargs,
216216
) -> BaseModelOutput:
217217
"""Forward pass of the AMPLIFY model.
218218
@@ -222,7 +222,6 @@ def forward(
222222
output_hidden_states (bool): Whether to output the hidden states.
223223
output_attentions (bool): Whether to output the attention weights.
224224
labels (torch.Tensor): The labels.
225-
**kwargs: Additional arguments.
226225
227226
Returns:
228227
BaseModelOutput: The output of the model.
@@ -277,7 +276,7 @@ def __init__(self, config: AMPLIFYConfig, **kwargs):
277276
config.hidden_size,
278277
config.padded_vocab_size,
279278
config.norm_eps,
280-
params_dtype=config.torch_dtype,
279+
params_dtype=config.dtype,
281280
normalization="RMSNorm" if config.rms_norm else "LayerNorm",
282281
init_method=lambda x: torch.nn.init.uniform_(
283282
x, -self.config.decoder_init_range, self.config.decoder_init_range
@@ -286,7 +285,7 @@ def __init__(self, config: AMPLIFYConfig, **kwargs):
286285

287286
else:
288287
self.decoder = transformer_engine.pytorch.Linear(
289-
config.hidden_size, config.vocab_size, params_dtype=config.torch_dtype
288+
config.hidden_size, config.vocab_size, params_dtype=config.dtype
290289
)
291290

292291
def forward(
@@ -296,7 +295,6 @@ def forward(
296295
output_hidden_states=False,
297296
output_attentions=False,
298297
labels=None,
299-
**kwargs,
300298
) -> MaskedLMOutput:
301299
"""Forward pass of the AMPLIFYForMaskedLM model.
302300
@@ -306,7 +304,6 @@ def forward(
306304
output_hidden_states (bool): Whether to output the hidden states.
307305
output_attentions (bool): Whether to output the attention weights.
308306
labels (torch.Tensor): The labels.
309-
**kwargs: Additional arguments.
310307
311308
Returns:
312309
MaskedLMOutput: The output of the model.
@@ -317,7 +314,6 @@ def forward(
317314
output_hidden_states,
318315
output_attentions,
319316
labels,
320-
**kwargs,
321317
)
322318

323319
# Classification head with layer norm

0 commit comments

Comments
 (0)