Skip to content

Commit d6cd678

Browse files
New Feature - API Framework (#6)
* - Introducing initial inference API. Subject to change, will likley move to seperate concerns and disentangle the main project from the HEST1k data, treating HEST1k more as a developmental dataset for the project. * feat: restructure project into a robust framework with Trainer API and Data Loader class template. This commit refactors the codebase from an experiment-focused structure into a general-purpose framework in order to encourage users to develope there own models and pipelines. Key changes include the definition of a clear data contract, the introduction of a high-level Trainer class, and the isolation of HEST-specific logic into a recipe. Core Changes: - Introduced SpatialDataset abstract base class to define a standard data contract (features, gene_counts, rel_coords) for all spatial transcriptomics datasets. - Implemented a high-level Trainer class to orchestrate the training lifecycle, including LR scheduling (warmup + cosine), AMP, and checkpointing. - Added a flexible callback system to the Trainer, including a built-in EarlyStoppingCallback. - Created a recipes/hest namespace to isolate HEST-specific dataset logic and utilities, maintaining backward compatibility through re-export facades. - Added a "Bring Your Own Data" (BYOD) guide and template for custom datasets. API & DX: - Exposed Trainer and SpatialDataset in the top-level package for easier access. - Standardized training engine functions (train_one_epoch, validate) to be agnostic to specific data sources. - Comprehensive unit tests added for the Trainer lifecycle, callbacks, and resumption. - Updated documentation (API.md) with detailed Training API and BYOD sections. Verified with 166 passing tests across the full suite. * docs: clarify licensing and add third-party attributions - Add code-level attribution in backbones.py for the foundation models. * - Minor update to the docs for clarification on the API and Dataset Recipies. * - reformatted the 6 files identified ifor CI/CD check (black formatter). Resolves the formatting failures.
1 parent ae68c5f commit d6cd678

41 files changed

Lines changed: 2874 additions & 209 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 55 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,46 @@
1-
# SpatialTranscriptFormer
1+
# SpatialTranscriptFormer Framework
22

33
> [!WARNING]
44
> **Work in Progress**: This project is under active development. Core architectures, CLI flags, and data formats are subject to major changes.
55
6-
**SpatialTranscriptFormer** bridges histology and biological pathways through a high-performance transformer architecture. By modeling the dense interplay between morphological features and gene expression signatures, it provides an interpretable and spatially-coherent mapping of the tissue microenvironment.
6+
<!-- -->
7+
8+
> [!TIP]
9+
> **Framework Release**: SpatialTranscriptFormer has been restructured from a research codebase into a robust framework. You can now use the Python API to train on your own spatial transcriptomics data with custom backbones and architectures.
10+
11+
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to model the interplay between morphological features and gene expression signatures, providing interpretable mapping of the tissue microenvironment.
12+
13+
## Python API: Quick Start
14+
15+
The framework is designed to be integrated programmatically into your scanpy/AnnData workflows:
16+
17+
```python
18+
from spatial_transcript_former import SpatialTranscriptFormer, Predictor, FeatureExtractor
19+
from spatial_transcript_former.predict import inject_predictions
20+
21+
# 1. Initialize model and backbone
22+
model = SpatialTranscriptFormer.from_pretrained("./checkpoints/stf_small/")
23+
extractor = FeatureExtractor(backbone="phikon", device="cuda")
24+
predictor = Predictor(model, device="cuda")
25+
26+
# 2. Predict from features
27+
predictions = predictor.predict_wsi(features, coords) # (1, G)
28+
29+
# 3. Integrate with Scanpy
30+
inject_predictions(adata, coords, predictions[0], gene_names=model.gene_names)
31+
```
32+
33+
For more details, see the **[Python API Reference](docs/API.md)**.
734

835
## Key Technical Pillars
936

37+
- **Modular Architecture**: Decoupled backbones, interaction modules, and output heads.
1038
- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
1139
- **Pathway Bottleneck**: Interpretable gene expression prediction via 50 MSigDB Hallmark tokens.
12-
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss** to prevent spatial collapse and ensure accurate morphology-expression mapping.
40+
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss**.
1341
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
14-
- **Biologically Informed Initialization**: Gene reconstruction weights derived from known hallmark memberships.
42+
43+
---
1544

1645
## License
1746

@@ -28,76 +57,58 @@ This project is protected by a **Proprietary Source Code License**. See the [LIC
2857

2958
The core architectural innovations, including the **SpatialTranscriptFormer** interaction logic and spatial masking strategies, are the unique Intellectual Property of the author. For a detailed breakdown, see the [IP Statement](docs/IP_STATEMENT.md).
3059

60+
---
61+
3162
## Installation
3263

3364
This project requires [Conda](https://docs.conda.io/en/latest/).
3465

3566
1. Clone the repository.
3667
2. Run the automated setup script:
37-
3. On Windows: `.\setup.ps1`
68+
- On Windows: `.\setup.ps1`
3869
- On Linux/HPC: `bash setup.sh`
3970

40-
## Usage
71+
## Exemplar Recipe: HEST-1k Benchmark
4172

42-
### Dataset Access
73+
The `SpatialTranscriptFormer` repository includes a complete, out-of-the-box CLI pipeline as an exemplar for reproducing our benchmarks on the [HEST-1k dataset](https://huggingface.co/datasets/MahmoodLab/hest).
4374

44-
The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:
75+
### 1. Dataset Access & Preprocessing
4576

4677
```bash
47-
# List available filtering options
48-
stf-download --list-options
49-
50-
# Download a specific subset (e.g., Breast Cancer samples from Visium)
78+
# Download a specific subset
5179
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
52-
53-
# Download all human samples
54-
stf-download --species "Homo sapiens" --local_dir hest_data
5580
```
5681

57-
> [!NOTE]
58-
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.
59-
60-
### Train Models
61-
62-
We provide presets for baseline models and scaled versions of the SpatialTranscriptFormer.
82+
### 2. Training with Presets
6383

6484
```bash
6585
# Recommended: Run the Interaction model (Small)
6686
python scripts/run_preset.py --preset stf_small
67-
68-
# Run the lightweight Tiny version
69-
python scripts/run_preset.py --preset stf_tiny
70-
71-
# Run baselines
72-
python scripts/run_preset.py --preset he2rna_baseline
7387
```
7488

75-
For a complete list of configurations, see the [Training Guide](docs/TRAINING_GUIDE.md).
76-
77-
### Real-Time Monitoring
78-
79-
Monitor training progress, loss curves, and **prediction variance (collapse detector)** via the web dashboard:
89+
### 3. Inference & Visualization
8090

8191
```bash
82-
python scripts/monitor.py --run-dir runs/stf_interaction_l4
92+
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
8393
```
8494

85-
### Inference & Visualization
95+
Visualization plots and spatial expression maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
8696

87-
Generate spatial maps comparing Ground Truth vs Predictions:
97+
## Documentation
8898

89-
```bash
90-
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
91-
```
99+
### Framework APIs & Usage
92100

93-
Visualization plots will be saved to the `./results` directory.
101+
- **[Python API Reference](docs/API.md)**: Full documentation for `Trainer`, `Predictor`, and `SpatialDataset`.
102+
- **[Bring Your Own Data Guide](src/spatial_transcript_former/recipes/custom/README.md)**: Templates and examples for training on your own non-HEST spatial transcriptomics data.
103+
- **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**: Detailed documentation for the included HEST-1k dataset recipe.
104+
- **[Training Guide](docs/TRAINING_GUIDE.md)**: Complete list of configuration flags and preset configurations for HEST models.
94105

95-
## Documentation
106+
### Theory & Interpretability
96107

97-
- [Models](docs/MODELS.md): Detailed model architectures and scaling parameters.
98-
- [Data Structure](docs/DATA_STRUCTURE.md): Organization of HEST data on disk.
99-
- [Pathway Mapping](docs/PATHWAY_MAPPING.md): Clinical interpretability and pathway integration.
100-
- [Gene Analysis](docs/GENE_ANALYSIS.md): Modeling strategies for high-dimensional gene space.
108+
- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the quad-flow interaction logic and network scaling.
109+
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
110+
- **[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
111+
- **[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
101112

102113
## Development
103114

0 commit comments

Comments
 (0)