Commit f873667
authored
Initial version porting the eden configs over to the new evo2 recipe (#1502)
### Description
This PR adds Eden (Llama 3.1) model support, Savanna/Vortex checkpoint
converters, and a standardized model naming convention to the Megatron
Bridge–based Evo2 recipe (`bionemo-recipes/recipes/evo2_megatron/`).
**Eden (Llama 3.1) model support**
- New `eden_provider.py` defining `EdenModelProvider` and size-specific
subclasses (`eden_7b` through `eden_35b`) that inherit from
`Llama31ModelProvider`.
- `train.py` now dispatches to `gpt_forward_step` for Eden models and
automatically disables `fp32_residual_connection` (incompatible with
standard TE `LayerNormLinear` layers — Hyena handles this via manual
dtype casting, but GPT/Llama does not).
- `infer.py` now initializes `ProcessGroupCollection` for non-Hyena
providers (required by `GPTModelProvider.provide()`) and uses
`StaticInferenceContext` instead of `HyenaInferenceContext` for Eden
models. The `flash_decode` attribute is guarded to Hyena-only.
- `predict.py` already worked architecture-agnostically via dynamic
model loading; no changes required.
**Checkpoint converters**
- `savanna_to_mbridge.py` — converts ARC Savanna `.pt` checkpoints
(local or downloaded from Hugging Face via `hf_hub_download`) into
MBridge distributed checkpoint format.
- `mbridge_to_vortex.py` — exports MBridge checkpoints to ARC's
single-file Vortex inference format, handling MLP weight splitting,
Hyena filter pole/residue computation, and TE layernorm key remapping.
- Both are registered as console scripts
(`evo2_convert_savanna_to_mbridge`, `evo2_export_mbridge_to_vortex`).
**Model naming convention**
The previous model size keys (`1b`, `7b`, `40b`, `7b_arc_longcontext`,
…) were ambiguous — `7b` referred to Striped Hyena while `7B` referred
to Llama. This PR replaces them with explicit, architecture-prefixed
keys:
- `evo2_*` for models matching public ARC checkpoints (e.g.
`evo2_1b_base`, `evo2_7b`, `evo2_40b_base`). `_base` = 8K context,
without it = 1M context.
- `striped_hyena_*_nv` for NVIDIA-modified Hyena variants.
- `eden_*` for Llama 3.1 variants.
- Added `evo2_20b` config based on `arcinstitute/savanna_evo2_20b`.
**Documentation updates**
- `README.md` — added model naming convention tables, Vortex export
section with round-trip example, updated all CLI examples to new model
keys.
- `checkpoint/README.md` — updated `--model-size` documentation.
- Both Jupyter notebooks (`zeroshot_brca1.ipynb`,
`fine-tuning-tutorial.ipynb`) — updated `MODEL_SIZE` and `--model-size`
references.
#### Usage
Training an Eden model:
```bash
torchrun --nproc-per-node 1 --no-python train_evo2 \
--model-size eden_7b --num-layers 2 --max-steps 5 \
--mock-data --seq-length 64 --mixed-precision-recipe bf16_mixed \
--no-activation-checkpointing
```
Converting Savanna checkpoint to MBridge:
```bash
evo2_convert_savanna_to_mbridge \
--savanna-ckpt-path arcinstitute/savanna_evo2_1b_base \
--mbridge-ckpt-dir /tmp/mbridge_1b \
--model-size evo2_1b_base \
--tokenizer-path tokenizers/nucleotide_fast_tokenizer_256
```
Exporting MBridge to Vortex:
```bash
evo2_export_mbridge_to_vortex \
--mbridge-ckpt-dir /tmp/mbridge_1b/iter_0000001 \
--output-path /tmp/evo2_1b_vortex.pt \
--model-size evo2_1b_base
```
### Type of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [x] Refactor
- [x] Documentation update
- [ ] Other (please describe):
### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only
basic unit tests are run.
-
[ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip)
- Skip all CI tests for this PR
-
[ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks)
- Run Jupyter notebooks execution tests for bionemo2
-
[ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow)
- Run slow single GPU integration tests marked as @pytest.mark.slow for
bionemo2
-
[ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all)
- Run all tests (unit tests, slow tests, and notebooks) for bionemo2.
This label can be used to enforce running tests for all bionemo2.
-
[ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes)
- Run tests for all recipes (under bionemo-recipes). This label can be
used to enforce running tests for all recipes.
Unit tests marked as `@pytest.mark.multi_gpu` or
`@pytest.mark.distributed` are not run in the PR pipeline.
For more details, see [CONTRIBUTING](CONTRIBUTING.md)
> [!NOTE]
> By default, only basic unit tests are run. Add appropriate labels to
enable an additional test coverage.
#### Authorizing CI Runs
We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.
- If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
- If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.
#### Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request
with one of these commands:
- @coderabbitai review - Triggers a standard review
- @coderabbitai full review - Triggers a comprehensive review
See https://docs.coderabbit.ai/reference/review-commands for a full list
of commands.
### Pre-submit Checklist
- [x] I have tested these changes locally
- [x] I have updated the documentation accordingly
- [x] I have added/updated tests as needed
- [x] All existing tests pass successfully
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Added Eden (Llama 3.1) model family support alongside existing Hyena
models (11B–35B variants).
* Added checkpoint conversion utilities: Savanna-to-MBridge and
MBridge-to-Vortex exporters with CLI tools.
* **Documentation**
* Updated model naming convention with Evo2 prefixes (e.g.,
`evo2_1b_base`, `evo2_7b`).
* Expanded documentation for checkpoint conversion workflows and
available models.
* **Tests**
* Added comprehensive test coverage for model providers, checkpoint
conversions, and Eden inference/prediction workflows.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: John St. John <jstjohn@nvidia.com>1 parent 52310f6 commit f873667
26 files changed
Lines changed: 5135 additions & 1812 deletions
File tree
- bionemo-recipes/recipes/evo2_megatron
- examples
- src/bionemo/evo2
- models
- run
- utils/checkpoint
- tests/bionemo/evo2
- run
Large diffs are not rendered by default.
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
| 11 | + | |
10 | 12 | | |
| 13 | + | |
11 | 14 | | |
12 | 15 | | |
13 | 16 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
| 11 | + | |
10 | 12 | | |
11 | 13 | | |
12 | 14 | | |
| |||
Lines changed: 1145 additions & 1155 deletions
Large diffs are not rendered by default.
0 commit comments