Skip to content

Commit f873667

Browse files
authored
Initial version porting the eden configs over to the new evo2 recipe (#1502)
### Description This PR adds Eden (Llama 3.1) model support, Savanna/Vortex checkpoint converters, and a standardized model naming convention to the Megatron Bridge–based Evo2 recipe (`bionemo-recipes/recipes/evo2_megatron/`). **Eden (Llama 3.1) model support** - New `eden_provider.py` defining `EdenModelProvider` and size-specific subclasses (`eden_7b` through `eden_35b`) that inherit from `Llama31ModelProvider`. - `train.py` now dispatches to `gpt_forward_step` for Eden models and automatically disables `fp32_residual_connection` (incompatible with standard TE `LayerNormLinear` layers — Hyena handles this via manual dtype casting, but GPT/Llama does not). - `infer.py` now initializes `ProcessGroupCollection` for non-Hyena providers (required by `GPTModelProvider.provide()`) and uses `StaticInferenceContext` instead of `HyenaInferenceContext` for Eden models. The `flash_decode` attribute is guarded to Hyena-only. - `predict.py` already worked architecture-agnostically via dynamic model loading; no changes required. **Checkpoint converters** - `savanna_to_mbridge.py` — converts ARC Savanna `.pt` checkpoints (local or downloaded from Hugging Face via `hf_hub_download`) into MBridge distributed checkpoint format. - `mbridge_to_vortex.py` — exports MBridge checkpoints to ARC's single-file Vortex inference format, handling MLP weight splitting, Hyena filter pole/residue computation, and TE layernorm key remapping. - Both are registered as console scripts (`evo2_convert_savanna_to_mbridge`, `evo2_export_mbridge_to_vortex`). **Model naming convention** The previous model size keys (`1b`, `7b`, `40b`, `7b_arc_longcontext`, …) were ambiguous — `7b` referred to Striped Hyena while `7B` referred to Llama. This PR replaces them with explicit, architecture-prefixed keys: - `evo2_*` for models matching public ARC checkpoints (e.g. `evo2_1b_base`, `evo2_7b`, `evo2_40b_base`). `_base` = 8K context, without it = 1M context. - `striped_hyena_*_nv` for NVIDIA-modified Hyena variants. - `eden_*` for Llama 3.1 variants. - Added `evo2_20b` config based on `arcinstitute/savanna_evo2_20b`. **Documentation updates** - `README.md` — added model naming convention tables, Vortex export section with round-trip example, updated all CLI examples to new model keys. - `checkpoint/README.md` — updated `--model-size` documentation. - Both Jupyter notebooks (`zeroshot_brca1.ipynb`, `fine-tuning-tutorial.ipynb`) — updated `MODEL_SIZE` and `--model-size` references. #### Usage Training an Eden model: ```bash torchrun --nproc-per-node 1 --no-python train_evo2 \ --model-size eden_7b --num-layers 2 --max-steps 5 \ --mock-data --seq-length 64 --mixed-precision-recipe bf16_mixed \ --no-activation-checkpointing ``` Converting Savanna checkpoint to MBridge: ```bash evo2_convert_savanna_to_mbridge \ --savanna-ckpt-path arcinstitute/savanna_evo2_1b_base \ --mbridge-ckpt-dir /tmp/mbridge_1b \ --model-size evo2_1b_base \ --tokenizer-path tokenizers/nucleotide_fast_tokenizer_256 ``` Exporting MBridge to Vortex: ```bash evo2_export_mbridge_to_vortex \ --mbridge-ckpt-dir /tmp/mbridge_1b/iter_0000001 \ --output-path /tmp/evo2_1b_vortex.pt \ --model-size evo2_1b_base ``` ### Type of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [x] New feature (non-breaking change which adds functionality) - [x] Refactor - [x] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. #### Triggering Code Rabbit AI Review To trigger a code review from code rabbit, comment on a pull request with one of these commands: - @coderabbitai review - Triggers a standard review - @coderabbitai full review - Triggers a comprehensive review See https://docs.coderabbit.ai/reference/review-commands for a full list of commands. ### Pre-submit Checklist - [x] I have tested these changes locally - [x] I have updated the documentation accordingly - [x] I have added/updated tests as needed - [x] All existing tests pass successfully <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added Eden (Llama 3.1) model family support alongside existing Hyena models (11B–35B variants). * Added checkpoint conversion utilities: Savanna-to-MBridge and MBridge-to-Vortex exporters with CLI tools. * **Documentation** * Updated model naming convention with Evo2 prefixes (e.g., `evo2_1b_base`, `evo2_7b`). * Expanded documentation for checkpoint conversion workflows and available models. * **Tests** * Added comprehensive test coverage for model providers, checkpoint conversions, and Eden inference/prediction workflows. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: John St. John <jstjohn@nvidia.com>
1 parent 52310f6 commit f873667

26 files changed

Lines changed: 5135 additions & 1812 deletions

bionemo-recipes/recipes/evo2_megatron/README.md

Lines changed: 435 additions & 44 deletions
Large diffs are not rendered by default.

bionemo-recipes/recipes/evo2_megatron/examples/.dockerignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,10 @@
77
*.yaml
88

99
# directories created during these notebook runs.
10+
evo2_20b_finetune/
11+
savanna_20b_download/
1012
nemo2_evo2_1b_8k/
13+
evo2_1b_bf16_mbridge/
1114
preprocessed_data/
1215
pretraining_demo/
1316
brca1_fasta_files/

bionemo-recipes/recipes/evo2_megatron/examples/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@
77
*.yaml
88

99
# directories created during these notebook runs.
10+
evo2_20b_finetune/
11+
savanna_20b_download/
1012
nemo2_evo2_1b_8k/
1113
evo2_1b_bf16_mbridge/
1214
preprocessed_data/

bionemo-recipes/recipes/evo2_megatron/examples/fine-tuning-tutorial.ipynb

Lines changed: 1145 additions & 1155 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)