Add Evo2 1B SAE recipe + fix Savanna weights_only loading#1579
Open
polinabinder1 wants to merge 4 commits into
Open
Add Evo2 1B SAE recipe + fix Savanna weights_only loading#1579polinabinder1 wants to merge 4 commits into
polinabinder1 wants to merge 4 commits into
Conversation
torch 2.6 changed the default of `weights_only` to True. The Savanna checkpoint pickle includes numpy globals (`numpy.core.multiarray._reconstruct`), which the safer loader rejects. The converter then exits 0 with no output written and the error gets buried in stderr — silent failure. The Savanna repos under arcinstitute/* are trusted sources, so load with weights_only=False. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the existing esm2 / codonfm SAE recipes. Pipeline:
chunk -> convert (Savanna->MBridge) -> predict_evo2 -> pt_to_parquet -> train
Differences from esm2/codonfm are forced by Evo2 specifics:
- Hyena/Megatron-Core model, no HF AutoModel path => reuses the
existing `predict_evo2` CLI for inference instead of writing
a custom extract.py
- `pt_to_parquet.py` shim bridges predict_evo2's .pt output to
the universal `sae.activation_store` parquet contract
- `chunk_fasta.py` preprocessor keeps inputs within the model's
trained context length (8192 bp for 1B); Hyena fftconv OOMs
on long sequences even at micro-batch=1
- `train.py` is the same as codonfm's, copied verbatim per
bionemo-recipes' KISS-over-DRY convention
Validated end-to-end on 100 organelle sequences (Evo2 1B layer 12):
loss 0.67 -> 0.045, FVU 0.90 -> 0.10, var_exp 0.10 -> 0.90, 2m14s wall.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The recipe currently has no model-specific Python module — the extractor is upstream (`predict_evo2`) and the two scripts are simple CLIs in scripts/. Drop the empty package and adjust pyproject.toml so setuptools doesn't try to discover anything. Will reintroduce when there's actual library code to put there (eval, dashboard, dataloaders). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Evo2 1B SAE — working on Lepton
TL;DR: New SAE recipe at
bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/. ~200 lines original code + one-line bug fix in theevo2_megatronconverter.Pipeline:
chunk → convert (Savanna→MBridge) → predict_evo2 → pt_to_parquet shim → train. Longer than esm2'sextract → trainbecause Evo2 is a Hyena model in Megatron-Core (no HF AutoModel path), ships as Savanna checkpoints needing conversion, and takes unbounded-length genomes that have to be chunked. The shim and chunker are the only model-specific code;train.pyis reused verbatim from codonfm.Results on 100 organelle sequences (558 chunks at 8192 bp, Evo2 1B layer 12): SAE trained in 2m 14s on one H100. Loss 0.67 → 0.045 (15× reduction). FVU 0.90 → 0.10, variance explained 0.10 → 0.90, monotonic. Dead latents 5.4% at end (normal range; auxk revival is working). Encoder/decoder shapes [1920 ↔ 15360].
Three gotchas worth flagging:
weights_only=True(torch ≥ 2.6) silently kills Savanna HF checkpoint loads — exit 0, empty dir. One-line patch atsavanna_to_mbridge.py:138.dead_tokens_threshold(10M tokens) — smoke tests always show 0%. Trust it only after the window fills.Reusable for the next model: three-stage pattern (extractor → ActivationStore parquet → universal
train.py) holds for esm2, codonfm, and evo2. Only the extractor changes per model.Test plan
FASTA=<small.fasta> bash scripts/1b.shshould run end-to-end and producecheckpoint_final.ptwith the expected encoder/decoder shapesweights_only=Falsepatch resolves the converter failure on torch ≥ 2.6🤖 Generated with Claude Code