Skip to content

Commit 223b211

Browse files
authored
[BENCHMARKS] fixing evo2 finetune config (#1097)
### Description 1. Adding job type to wandb experiment for evo2 to improve grouping of experiments in wandb UI 2. Standardising how data is handled for esm2 and geneformer <!-- Provide a detailed description of the changes in this PR --> ### Type of changes <!-- Mark the relevant option with an [x] --> - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python # TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Performance** * Stage datasets to node-local in-memory storage with per-node synchronization to reduce I/O contention; training reads from the staged path and root task emits basic diagnostics. * **Chores** * Add a coordinated pre-install step to ensure a required wheel package is installed once per node before training. * Enhance experiment tracking by adding a job-type tag to run metadata. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 0d162d5 commit 223b211

3 files changed

Lines changed: 33 additions & 8 deletions

File tree

ci/benchmarks/partial-conv/evo2_finetuning.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,11 @@ script: |-
8989
--devices=${gpus} \
9090
--num-nodes=${nodes} \
9191
--val-check-interval=${val_check_interval} \
92-
--wandb-project=${wandb_project_name} \
93-
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
9492
--create-tensorboard-logger \
9593
--activation-checkpoint-recompute-num-layers=${activation_checkpoint_layers} \
9694
--disable-checkpointing \
9795
--early-stop-on-step=${stop_steps} \
96+
--wandb-project=${wandb_project_name} \
97+
--wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
98+
--wandb-job-type=${pipeline_label} \
9899
--garbage-collect-at-inference;

ci/benchmarks/perf/esm2_pretrain.yaml

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,11 +41,23 @@ script_args:
4141
tp: 1
4242
dfpnl: ""
4343
script: |-
44+
COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
45+
NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
46+
if [ "$SLURM_LOCALID" = "0" ]; then
47+
df -h;
48+
echo $NEW_DATA_PATH;
49+
time cp -r ${data_path}/ $NEW_DATA_PATH;
50+
touch $COPY_FLAG
51+
fi
52+
# All ranks wait until install flag file appears
53+
while [ ! -f $COPY_FLAG ]; do
54+
sleep 1
55+
done
4456
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
45-
--train-cluster-path=${data_path}/train_clusters.parquet \
46-
--train-database-path=${data_path}/train.db \
47-
--valid-cluster-path=${data_path}/valid_clusters.parquet \
48-
--valid-database-path=${data_path}/validation.db \
57+
--train-cluster-path=$NEW_DATA_PATH/train_clusters.parquet \
58+
--train-database-path=$NEW_DATA_PATH/train.db \
59+
--valid-cluster-path=$NEW_DATA_PATH/valid_clusters.parquet \
60+
--valid-database-path=$NEW_DATA_PATH/validation.db \
4961
--micro-batch-size=${batch_size} \
5062
--num-nodes=${nodes} \
5163
--num-gpus=${gpus} \

ci/benchmarks/perf/geneformer_pretrain.yaml

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,20 @@ script_args:
2727
batch_size: 32
2828

2929
script: |-
30-
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
31-
--data-dir ${data_path} \
30+
COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
31+
NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
32+
if [ "$SLURM_LOCALID" = "0" ]; then
33+
df -h;
34+
echo $NEW_DATA_PATH;
35+
time cp -r ${data_path}/ $NEW_DATA_PATH;
36+
touch $COPY_FLAG
37+
fi
38+
# All ranks wait until install flag file appears
39+
while [ ! -f $COPY_FLAG ]; do
40+
sleep 1
41+
done
42+
WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
43+
--data-dir $NEW_DATA_PATH \
3244
--experiment-name ${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
3345
--num-gpus ${gpus} \
3446
--save-last-checkpoint \

0 commit comments

Comments
 (0)