SpectraLoRA

Post-training compact molecular GNNs for surrogate Raman prediction and post-hoc spectral alignment via structural prompting and derivative-free optimization via evolution strategies.

Overview

SpectraLoRA adapts a 22.5M-parameter equivariant molecular GNN model (DetaNet) to predict Hessians, vibrational frequencies, and Raman spectra from 3D atomic coordinates. The model is pre-trained on ~2M molecules from SPICE, NABLA2DFT, QM9, and QM7, then adapted with SO(3)-equivariant LoRA (AdaLoRA for invariant layers, ELoRA for equivariant tensor-product layers). A two-phase post-hoc spectral alignment strategy, supervised pre-training of a 1D U-Net with structural prompting (FiLM conditioning on Morgan fingerprints), followed by Natural Evolution Strategies (NES) to hill-climb the non-differentiable F1 metric, pushes fingerprint F1@15 from 0.426 to 0.532 and peak recall to 0.703, exceeding the in-distribution recall of Mol2Raman (0.634) on a harder out-of-distribution benchmark.

Reproducibility: Demonstration of Prediction & Figures

To run the prediction system:
To reproduce figures please run the cells in the Figures Notebook note that you may need git-lfs to get all artifact csv files.

Results Figure

Repository Layout

capsule-3259363/          DetaNet model code, weights, LoRA adapters
apps/api/                 Async FastAPI for dataset browsing + inference
apps/model/               Model service (Python 3.8, loads DetaNet)
db/init/                  Postgres schema + partitions
scripts/ingest/           Build Parquet shards + load DB
data-gen-pipeline/        DFT data generation (DeePMD imputation)
ramanchembl_pipeline/     Evaluation, stats, alignment, ES refinement
  artifacts/stats_v2/     Pre-computed CSVs + figures
  alignment_results/      ES refinement checkpoints + eval
  stats_notebook_lib.py   Plotting + analysis library
  stats_v2.ipynb          Main stats notebook (generates paper figures)
paper/                    Manuscript (LaTeX)
  figures/                Paper figures (PNG/PDF/SVG)

Cloning

This repo uses Git LFS for large model checkpoints and data files.

Full clone (requires git-lfs):

git lfs install
https://github.com/Arcadia-Science/spectralora-molecular-alignment

Code-only clone (skip LFS binaries, ~50 MB):

GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/Arcadia-Science/spectralora-molecular-alignment

LFS pointer stubs are checked out in place of the actual files. Model weights for inference are available separately on HuggingFace.

Local Setup

Install

pip install -r requirements.txt -f https://data.pyg.org/whl/torch-2.8.0+cpu.html

Build Parquet shards

python scripts/ingest/build_parquet.py --dataset qm9s --zip capsule-3259363/data/qm9s_csv.zip --out data/processed --shard-size 1000

Start some services

docker-compose up --build

Run local inference

curl -X POST "http://localhost:8000/predict/raman" \
  -H "Content-Type: application/json" \
  -d '{"dataset":"qm9s","molecule_id":1}'

Hardware Requirements

Minimal requirements: Laptop with 16GB of RAM + Docker for local inference on small dataset
Data Processing: AWS Batch or 3-5x c7a.48xlarge + EKS
Small Training: 7xp5e.48xlarge + AWS EKS Node group see : AWS DOCS
Medium Training: 12-15x 5e.48xlarge + AWS EKS
Large Training: Hyperpod AWS DOCS

Key Results

Metric	Baseline (raw, OOD)	+ES Refined OOD	Mol2Raman (in-dist)
FP F1@15	0.426	0.532	0.631
FP Recall@15	0.438	0.703	0.634
FP Precision@15	0.444	0.440	0.629
Cosine (full)	0.216	0.486	0.689

Full Distributed System

The pipeline is deployed as a four-plane distributed architecture. An Amazon FSx for Lustre parallel file system serves as the shared storage backbone. Ray orchestrates both the offline data-engineering phase (CPU workers featurize heterogeneous molecular sources into sharded PyTorch Geometric graphs using lock-free, shared-nothing SQLite3 indexing) and the distributed training phase (DDP/FSDP jobs with NCCL-based gradient synchronization). Data loading streams pre-sharded, pre-randomized chunks assigned deterministically by GPU rank, trading perfect global shuffle for sustained sequential I/O. Online serving splits into a public FastAPI service (dataset browsing, inference orchestration, Postgres metadata, Redis caching) and a dedicated DetaNet model service with geometries stored in Parquet shards for random access.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.codex		.codex
analysis		analysis
apps		apps
artifacts		artifacts
capsule-3259363		capsule-3259363
data-gen-pipeline		data-gen-pipeline
db/init		db/init
figures		figures
inference		inference
notebooks		notebooks
paper		paper
picking_analysis		picking_analysis
ramanchembl_pipeline		ramanchembl_pipeline
scripts		scripts
tests		tests
third_party/ELoRA/e3nn		third_party/ELoRA/e3nn
train		train
trained_param/latest		trained_param/latest
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference-docker-compose.yml		inference-docker-compose.yml
inference.ipynb		inference.ipynb
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpectraLoRA

Overview

Reproducibility: Demonstration of Prediction & Figures

Results Figure

Repository Layout

Cloning

Local Setup

Install

Build Parquet shards

Start some services

Run local inference

Hardware Requirements

Key Results

Full Distributed System

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SpectraLoRA

Overview

Reproducibility: Demonstration of Prediction & Figures

Results Figure

Repository Layout

Cloning

Local Setup

Install

Build Parquet shards

Start some services

Run local inference

Hardware Requirements

Key Results

Full Distributed System

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages