Official code release for the ICLR 2026 paper: Automatic Image-Level Morphological Trait Annotation for Organismal Images.
🌐 Website: osu-nlp-group.github.io/sae-trait-annotation
🤗 Model: osunlp/sae-trait-annotation
🤗 Dataset: osunlp/bioscan-traits, images from bioscan-ml/BIOSCAN-5M
This repository provides an end-to-end pipeline to:
- preprocess BIOSCAN-5M into an
ImageFolderlayout, - train a Sparse Autoencoder (SAE) on DINOv2 activations,
- identify species-level prominent latents, and
- generate natural-language morphological trait annotations using MLLMs (Qwen2.5-VL).
The SAE training/inference stack is adapted from the public SAEV repository and is included here for convenience.
.
|-- preprocess_bioscan.py
|-- create_trait_dataset_mllm_sae.py
|-- create_trait_dataset_mllm.py
|-- saev/ # SAEV codebase (vendored)
`-- utils/
|-- create_train_json.py
`-- convert_trait_wds.py
This project uses Python 3.11 and uv.
pip install uvDependencies are installed automatically when running commands via uv run.
pip install bioscan-dataset
python - <<'PY'
from bioscan_dataset import BIOSCAN5M
_ = BIOSCAN5M("~/Datasets/bioscan-5m", download=True)
PYcreate_trait_dataset_* scripts expect a train/ subdirectory under --data-dir.
python preprocess_bioscan.py \
--csv-file /path/to/bioscan-5m/metadata.csv \
--image-dir /path/to/bioscan-5m/images \
--out-dir /path/to/processed_bioscan/trainuv run python -m saev activations \
--vit-family dinov2 \
--vit-ckpt dinov2_vitb14 \
--vit-batch-size 1024 \
--d-vit 768 \
--n-patches-per-img 256 \
--vit-layers -2 \
--dump-to /path/to/activations \
--n-patches-per-shard 2_4000_000 \
data:image-folder-dataset \
--data.root /path/to/processed_bioscan/trainuv run python -m saev train \
--data.shard-root /path/to/activations \
--data.layer -2 \
--data.patches patches \
--data.scale-mean False \
--data.scale-norm False \
--sae.d-vit 768 \
--sae.exp-factor 32 \
--ckpt-path /path/to/sae_ckpt \
--lr 1e-3 > LOG.txt 2>&1vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--host 0.0.0.0 \
--port <PORT>Single-image prompts (n-img-input=1):
uv run python -u create_trait_dataset_mllm_sae.py \
--data-dir /path/to/processed_bioscan \
--sae-ckpt-path /path/to/sae_ckpt/sae.pt \
--thresh 0.9 \
--trait-thresh 3e-3 \
--out-dir /path/to/output_mllm_sae \
--serve-choice qwen_72b \
--api-url http://0.0.0.0:<PORT>/v1/chat/completions > LOG.txt 2>&1Multi-image prompts (n-img-input=3):
uv run python -u create_trait_dataset_mllm_sae.py \
--data-dir /path/to/processed_bioscan \
--sae-ckpt-path /path/to/sae_ckpt/sae.pt \
--thresh 0.9 \
--trait-thresh 3e-3 \
--out-dir /path/to/output_mllm_sae \
--serve-choice qwen_72b \
--api-url http://0.0.0.0:<PORT>/v1/chat/completions \
--n-img-input 3 > LOG.txt 2>&1Single-image prompts:
uv run python -u create_trait_dataset_mllm.py \
--data-dir /path/to/processed_bioscan \
--sae-ckpt-path /path/to/sae_ckpt/sae.pt \
--thresh 0.9 \
--trait-thresh 3e-3 \
--out-dir /path/to/output_mllm \
--serve-choice qwen_72b \
--api-url http://0.0.0.0:<PORT>/v1/chat/completions \
--n-img-input 1 > LOG.txt 2>&1Multi-image prompts:
uv run python -u create_trait_dataset_mllm.py \
--data-dir /path/to/processed_bioscan \
--sae-ckpt-path /path/to/sae_ckpt/sae.pt \
--thresh 0.9 \
--trait-thresh 3e-3 \
--out-dir /path/to/output_mllm \
--serve-choice qwen_72b \
--api-url http://0.0.0.0:<PORT>/v1/chat/completions \
--n-img-input 3 > LOG.txt 2>&1In --out-dir, key artifacts include:
latent_to_patch_map.json(MLLM+SAE pipeline),species_latents_prominent/latent_response.jsonl(model responses),- per-species annotated patch visualizations under
species_latents_prominent/<species_name>/.
Use --debug and --n-debug-ex in generation scripts for small-scale dry runs.
For downstream classifier training, we build on BioCLIP.
Preprocessing helpers in this repo:
utils/create_train_json.py: build train JSONs from CSV metadata.utils/convert_trait_wds.py: convert trait annotations to WebDataset format.
If you use this repository, please cite the paper:
@inproceedings{pahuja2026automatic,
title={Automatic Image-Level Morphological Trait Annotation for Organismal Images},
author={Vardaan Pahuja and Samuel Stevens and Alyson East and Sydne Record and Yu Su},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=oFRbiaib5Q}
}Please also cite the source dataset:
@inproceedings{gharaee2024bioscan5m,
title={{BIOSCAN-5M}: A Multimodal Dataset for Insect Biodiversity},
booktitle={Advances in Neural Information Processing Systems},
author={Zahra Gharaee and Scott C. Lowe and ZeMing Gong and Pablo Millan Arias and Nicholas Pellegrino and Austin T. Wang and Joakim Bruslund Haurum and Iuliia Zarubiieva and Lila Kari and Dirk Steinke and Graham W. Taylor and Paul Fieguth and Angel X. Chang},
editor={A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
pages={36285--36313},
publisher={Curran Associates, Inc.},
year={2024},
volume={37},
url={https://proceedings.neurips.cc/paper_files/paper/2024/file/3fdbb472813041c9ecef04c20c2b1e5a-Paper-Datasets_and_Benchmarks_Track.pdf},
}Code
- SAEV for sparse autoencoder training infrastructure.
- BioCLIP for downstream training/evaluation tooling.
Funding
This research was supported in part by NSF CAREER #2443149, NSF OAC 2118240, and an Alfred P. Sloan Foundation Fellowship. Computational resources were provided by the Ohio Supercomputer Center.
S. Record and A. East were additionally supported by NSF Award No. 242918 (EPSCOR Research Fellows: Advancing NEON-Enabled Science and Workforce Development at the University of Maine with AI) and Hatch project Award #MEO-022425 from the USDA National Institute of Food and Agriculture. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the US Department of Agriculture.
People
We thank colleagues in the OSU NLP group for valuable feedback. This work was in part conceived at Funcapalooza.