[BENCHMARKS] fixing evo2 finetune config (#1097)

dorotat-nv · web-flow · commit 223b2111d93f · 2025-09-09T11:08:34.000Z
### Description 1. Adding job type to wandb experiment for evo2 to improve grouping of experiments in wandb UI 2. Standardising how data is handled for esm2 and geneformer  ### Type of changes  - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage  ```python # TODO: Add code snippet ``` ### Pre-submit Checklist  - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully  ## Summary by CodeRabbit * **Performance** * Stage datasets to node-local in-memory storage with per-node synchronization to reduce I/O contention; training reads from the staged path and root task emits basic diagnostics. * **Chores** * Add a coordinated pre-install step to ensure a required wheel package is installed once per node before training. * Enhance experiment tracking by adding a job-type tag to run metadata.
diff --git a/ci/benchmarks/partial-conv/evo2_finetuning.yaml b/ci/benchmarks/partial-conv/evo2_finetuning.yaml
@@ -89,10 +89,11 @@ script: |-
     --devices=${gpus} \
     --num-nodes=${nodes} \
     --val-check-interval=${val_check_interval} \
-    --wandb-project=${wandb_project_name} \
-    --wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
     --create-tensorboard-logger \
     --activation-checkpoint-recompute-num-layers=${activation_checkpoint_layers} \
     --disable-checkpointing \
     --early-stop-on-step=${stop_steps} \
+    --wandb-project=${wandb_project_name} \
+    --wandb-group=${model}_${variant}_${config_name}_${task}_${target} \
+    --wandb-job-type=${pipeline_label} \
     --garbage-collect-at-inference;
diff --git a/ci/benchmarks/perf/esm2_pretrain.yaml b/ci/benchmarks/perf/esm2_pretrain.yaml
@@ -41,11 +41,23 @@ script_args:
       tp: 1
       dfpnl: ""
 script: |-
+  COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
+  NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
+  if [ "$SLURM_LOCALID" = "0" ]; then
+      df -h;
+      echo $NEW_DATA_PATH;
+      time cp -r ${data_path}/ $NEW_DATA_PATH;
+      touch $COPY_FLAG
+  fi
+  # All ranks wait until install flag file appears
+  while [ ! -f $COPY_FLAG ]; do
+      sleep 1
+  done
   WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
-    --train-cluster-path=${data_path}/train_clusters.parquet \
-    --train-database-path=${data_path}/train.db \
-    --valid-cluster-path=${data_path}/valid_clusters.parquet \
-    --valid-database-path=${data_path}/validation.db \
+    --train-cluster-path=$NEW_DATA_PATH/train_clusters.parquet \
+    --train-database-path=$NEW_DATA_PATH/train.db \
+    --valid-cluster-path=$NEW_DATA_PATH/valid_clusters.parquet \
+    --valid-database-path=$NEW_DATA_PATH/validation.db \
     --micro-batch-size=${batch_size} \
     --num-nodes=${nodes} \
     --num-gpus=${gpus} \
diff --git a/ci/benchmarks/perf/geneformer_pretrain.yaml b/ci/benchmarks/perf/geneformer_pretrain.yaml
@@ -27,8 +27,20 @@ script_args:
       batch_size: 32
 
 script: |-
-   WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
-    --data-dir ${data_path} \
+  COPY_FLAG="/tmp/copy_done_${{SLURMD_NODENAME}}";
+  NEW_DATA_PATH="/dev/shm/data_path_${{SLURMD_NODENAME}}";
+  if [ "$SLURM_LOCALID" = "0" ]; then
+      df -h;
+      echo $NEW_DATA_PATH;
+      time cp -r ${data_path}/ $NEW_DATA_PATH;
+      touch $COPY_FLAG
+  fi
+  # All ranks wait until install flag file appears
+  while [ ! -f $COPY_FLAG ]; do
+      sleep 1
+  done
+  WANDB_API_KEY=$BIONEMO_WANDB_API_KEY ${variant}_${model} \
+    --data-dir $NEW_DATA_PATH \
     --experiment-name ${batch_size}bs_${nodes}node_${gpus}gpu_${max_steps}s_${precision}prec \
     --num-gpus ${gpus} \
     --save-last-checkpoint \