Skip to content

Commit a29272f

Browse files
nvmvlejwilberMy Le
authored
feat: Add Evo2 fine-tuning partial-conv benchmarking (#1028)
### Description This PR adds a benchmarking configuration for Evo2 fine-tuning with partial convolution support. The changes include: - A new benchmarking YAML configuration file for CI/CD pipeline to test Evo2 fine-tuning performance - Updates to the Evo2 training script to support the benchmarking workflow ### Type of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. * If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) * If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage ```python # Example usage of the benchmarking configuration # The benchmarking can be triggered through CI/CD pipeline # using the new YAML configuration at: # ci/benchmarks/partial-conv/evo2_finetuning.yaml ``` ### Pre-submit Checklist | - [x] I have tested these changes locally | - [ ] I have updated the documentation accordingly | - [ ] I have added/updated tests as needed | - [ ] All existing tests pass successfully Signed-off-by: My Le <mvle@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added an optional command-line flag to clean up GPU memory before validation/inference, improving training stability on CUDA devices; the mechanism to perform this cleanup is now included. * **Chores** * Added benchmark configurations to run Evo2 finetuning variants on partial-conv with preset training parameters and logging. * Fixed a pretraining benchmark URL token placeholder to use the correct syntax for artifact access. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: nvmvle <mvle@nvidia.com> Signed-off-by: Jared Wilber <jwilber@nvidia.com> Co-authored-by: Jared Wilber <jwilber@nvidia.com> Co-authored-by: My Le <mvle@login-eos01.eos.clusters.nvidia.com>
1 parent 345d3cc commit a29272f

4 files changed

Lines changed: 145 additions & 1 deletion

File tree

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
scope: partial-conv
2+
time_limit: 14400
3+
key_segments:
4+
# Modify keys to be renamed (str) or excluded (False) from run identifier. By default, all args under script_args are included.
5+
dataset_config: False
6+
dataset_dir: False
7+
data_base_path: False
8+
num_workers: False
9+
limit_val_batches: False
10+
val_check_interval: False
11+
experiment_name: False
12+
workspace: False
13+
restore_from_checkpoint_path: False
14+
activation_checkpoint_layers: False
15+
lora_enabled: False
16+
lr: False
17+
min_lr: False
18+
warmup_steps: False
19+
accumulate_grad_batches: False
20+
clip_grad: False
21+
weight_decay: False
22+
attention_dropout: False
23+
hidden_dropout: False
24+
precision: False
25+
seq_length: False
26+
script_args:
27+
# All arguments referenced in the script string must be specified here.
28+
# Arguments not referenced in the script string must have the 'arg' field specified.
29+
# See jet/core/configs.py for the specification of the configuration class
30+
workspace: /workspace/bionemo2
31+
data_base_path: /data/evo2
32+
restore_from_checkpoint_path: checkpoints/nemo2_evo2_1b_8k
33+
nodes: 1
34+
model: evo2
35+
config_name: 1b
36+
num_workers: 1
37+
limit_val_batches: 20
38+
dataset_config: training_data_config.yaml
39+
dataset_dir: preprocessed_data
40+
val_check_interval: 5
41+
seq_length: 8192
42+
warmup_steps: 10
43+
activation_checkpoint_layers: 2
44+
lr: 0.000015
45+
min_lr: 0.0000149
46+
accumulate_grad_batches: 4
47+
max_steps: 1000
48+
gpus: 1
49+
clip_grad: 250
50+
weight_decay: 0.001
51+
attention_dropout: 0.01
52+
hidden_dropout: 0.01
53+
stop_steps: 100
54+
batch_size: 2
55+
variant: finetune
56+
precision: fp8
57+
products:
58+
- variant: finetune
59+
lora_enabled: ""
60+
task: finetune_from_ckpt
61+
experiment_name: evo2-finetune
62+
- variant: lora_finetune
63+
lora_enabled: "--lora-finetune"
64+
task: lora_finetune_from_ckpt
65+
experiment_name: evo2-lora-finetune
66+
script: |-
67+
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY train_${model} \
68+
-d ${data_base_path}/${dataset_config} \
69+
--dataset-dir=${data_base_path}/${dataset_dir} \
70+
--ckpt-dir=${data_base_path}/${restore_from_checkpoint_path} \
71+
${lora_enabled} \
72+
--model-size=${config_name} \
73+
--max-steps=${max_steps} \
74+
--experiment-name=${experiment_name}_${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s \
75+
--lr=${lr} \
76+
--min-lr=${min_lr} \
77+
--warmup-steps=${warmup_steps} \
78+
--result-dir=${tensorboard_dir} \
79+
--micro-batch-size=${batch_size} \
80+
--grad-acc-batches=${accumulate_grad_batches} \
81+
--limit-val-batches=${limit_val_batches} \
82+
--seq-length=${seq_length} \
83+
--clip-grad=${clip_grad} \
84+
--wd=${weight_decay} \
85+
--attention-dropout=${attention_dropout} \
86+
--hidden-dropout=${hidden_dropout} \
87+
--num-layers 4 \
88+
--hybrid-override-pattern 'SDH*' \
89+
--devices=${gpus} \
90+
--num-nodes=${nodes} \
91+
--val-check-interval=${val_check_interval} \
92+
--wandb-project=${wandb_project_name} \
93+
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
94+
--create-tensorboard-logger \
95+
--activation-checkpoint-recompute-num-layers=${activation_checkpoint_layers} \
96+
--disable-checkpointing \
97+
--early-stop-on-step=${stop_steps} \
98+
--garbage-collect-at-inference;

ci/benchmarks/partial-conv/evo2_pretrain.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ script_args:
1515
# See jet/core/configs.py for the specification of the configuration class
1616
workspace: /workspace/bionemo2
1717
data_path: /data/evo2
18-
artefacts_url: https://__token__:${JET_GITLAB_TOKEN}@gitlab-master.nvidia.com/api/v4/projects/180496/packages/pypi/simple
18+
artefacts_url: https://__token__:${{JET_GITLAB_TOKEN}}@gitlab-master.nvidia.com/api/v4/projects/180496/packages/pypi/simple
1919
file_name_wheel: subquadratic-ops
2020
model: evo2
2121
variant: train

sub-packages/bionemo-evo2/src/bionemo/evo2/run/train.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@
4848

4949
from bionemo.evo2.models.mamba import MAMBA_MODEL_OPTIONS, MambaModel, mamba_no_weight_decay_cond_with_embeddings
5050
from bionemo.evo2.run.peft import Evo2LoRA
51+
from bionemo.evo2.utils.callbacks import GarbageCollectAtInferenceTime
5152
from bionemo.evo2.utils.config import hyena_no_weight_decay_cond_with_embeddings
5253
from bionemo.evo2.utils.logging.callbacks import TEVCallback
5354
from bionemo.llm.utils.datamodule_utils import infer_global_batch_size
@@ -506,6 +507,12 @@ def parse_args(args: Optional[List[str]] = None) -> argparse.Namespace:
506507
default=False,
507508
help="Skip checking for NaNs in gradients. Only use this for debugging purposes.",
508509
)
510+
parser.add_argument(
511+
"--garbage-collect-at-inference",
512+
action="store_true",
513+
default=False,
514+
help="Enable CUDA memory cleanup before validation to prevent initialization errors.",
515+
)
509516

510517
recompute_group = parser.add_mutually_exclusive_group(required=False)
511518
recompute_group.add_argument("--no-activation-checkpointing", action="store_true", default=False)
@@ -645,6 +652,9 @@ def train(args: argparse.Namespace) -> nl.Trainer:
645652
TEVCallback(),
646653
]
647654

655+
if args.garbage_collect_at_inference:
656+
callbacks.append(GarbageCollectAtInferenceTime())
657+
648658
if args.lora_finetune:
649659
callbacks.append(ModelTransform())
650660
if args.enable_preemption:
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: LicenseRef-Apache2
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
import gc
17+
18+
import torch
19+
from lightning.pytorch import Callback
20+
21+
22+
class GarbageCollectAtInferenceTime(Callback):
23+
"""Callback to clean up CUDA memory before validation to prevent initialization errors."""
24+
25+
def on_validation_start(self, trainer, pl_module) -> None:
26+
"""Clean up CUDA memory before validation to prevent initialization errors."""
27+
if torch.cuda.is_available():
28+
try:
29+
torch.cuda.empty_cache()
30+
torch.cuda.synchronize()
31+
current_device = torch.cuda.current_device()
32+
torch.cuda.set_device(current_device)
33+
torch.cuda.synchronize()
34+
gc.collect()
35+
except Exception as e:
36+
print(f"Warning: CUDA cleanup failed: {e}")

0 commit comments

Comments
 (0)