Skip to content

Commit a7d0463

Browse files
polinabinder1claude
andcommitted
evo2 dashboard: scope dashboard.py to a small corpus; clarify label provenance
--max-sequences caps the corpus (default 4000): the 7B pass exists only to give the example cards sequence-aligned activations, not to re-derive the full atlas. Labels are joined from --feature-annotations (the label-producer pipeline, #1630), not computed here. Stats stay local to the SAE codes (compute_feature_stats wants raw activations that won't fit for long DNA); UMAP still reuses compute_feature_umap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 35954d9 commit a7d0463

1 file changed

Lines changed: 19 additions & 3 deletions

File tree

  • bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts

bionemo-recipes/interpretability/sparse_autoencoders/recipes/evo2/scripts/dashboard.py

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,17 @@
2626
per-base activation track -> the example cards
2727
2828
The heavy lifting is reused, not reimplemented: ``encode_batch`` (engine) for the
29-
activations and ``sae.analysis.compute_feature_umap`` for the 2-D layout.
29+
activations and ``sae.analysis.compute_feature_umap`` for the 2-D layout. (Per-feature
30+
stats are computed locally from the SAE codes rather than via ``compute_feature_stats``,
31+
which wants raw pre-SAE activations — holding those for long DNA would not fit in memory.)
32+
33+
This runs over a SMALL representative corpus (``--max-sequences``), not the full atlas
34+
corpus: the 7B pass is here only because the example cards need sequence-aligned
35+
activations, which the anonymous token-level training cache cannot provide.
36+
37+
Feature *labels* are not produced here — they come from ``--feature-annotations`` (the
38+
feature-probing / label-producer pipeline, PR #1630) and are joined into the ``label``
39+
column; unlabeled features fall back to ``Feature N``. Users can further rename in-UI.
3040
3141
Memory is bounded by a two-pass scheme (mirrors the codonfm generator): pass 1 keeps
3242
only the per-(sequence, feature) max to pick top examples; pass 2 re-encodes just the
@@ -59,9 +69,13 @@ def parse_args():
5969
p.add_argument("--layer", type=int, default=int(os.environ.get("EMBEDDING_LAYER", "26")))
6070
p.add_argument("--device", default=os.environ.get("DEVICE", "cuda"))
6171
p.add_argument("--max-seq-len", type=int, default=int(os.environ.get("MAX_SEQ_LEN", "8192")))
62-
# Corpus + output.
63-
p.add_argument("--fasta", required=True, help="FASTA corpus to characterize features over")
72+
# Corpus + output. This is meant to be a SMALL, representative corpus (a few thousand seqs):
73+
# we re-run the 7B over it only because the example cards need sequence-aligned activations,
74+
# which the (anonymous, token-level) training activation cache can't provide. It is NOT the
75+
# full atlas corpus — stats/UMAP need only a representative sample.
76+
p.add_argument("--fasta", required=True, help="SMALL representative FASTA (a few thousand seqs)")
6477
p.add_argument("--output-dir", required=True, help="Directory to write the 3 parquets into")
78+
p.add_argument("--max-sequences", type=int, default=4000, help="Cap sequences read from --fasta (keep it small)")
6579
p.add_argument("--organism", default="None (raw DNA)", help="Phylo-tag preset to prepend (default: raw DNA)")
6680
p.add_argument("--batch-size", type=int, default=8)
6781
p.add_argument("--n-examples", type=int, default=6, help="Top examples per feature")
@@ -230,6 +244,8 @@ def main():
230244

231245
ids, seqs = [], []
232246
for sid, seq in read_fasta(args.fasta):
247+
if len(seqs) >= args.max_sequences:
248+
break
233249
ids.append(sid)
234250
seqs.append(clean_dna(seq))
235251
if not seqs:

0 commit comments

Comments
 (0)