You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* - Introducing initial inference API. Subject to change, will likley move to seperate concerns and disentangle the main project from the HEST1k data, treating HEST1k more as a developmental dataset for the project.
* feat: restructure project into a robust framework with Trainer API and Data Loader class template.
This commit refactors the codebase from an experiment-focused structure into a general-purpose framework in order to encourage users to develope there own models and pipelines. Key changes include the definition of a clear data contract, the introduction of a high-level Trainer class, and the isolation of HEST-specific logic into a recipe.
Core Changes:
- Introduced SpatialDataset abstract base class to define a standard data contract (features, gene_counts, rel_coords) for all spatial transcriptomics datasets.
- Implemented a high-level Trainer class to orchestrate the training lifecycle, including LR scheduling (warmup + cosine), AMP, and checkpointing.
- Added a flexible callback system to the Trainer, including a built-in EarlyStoppingCallback.
- Created a recipes/hest namespace to isolate HEST-specific dataset logic and utilities, maintaining backward compatibility through re-export facades.
- Added a "Bring Your Own Data" (BYOD) guide and template for custom datasets.
API & DX:
- Exposed Trainer and SpatialDataset in the top-level package for easier access.
- Standardized training engine functions (train_one_epoch, validate) to be agnostic to specific data sources.
- Comprehensive unit tests added for the Trainer lifecycle, callbacks, and resumption.
- Updated documentation (API.md) with detailed Training API and BYOD sections.
Verified with 166 passing tests across the full suite.
* docs: clarify licensing and add third-party attributions
- Add code-level attribution in backbones.py for the foundation models.
* - Minor update to the docs for clarification on the API and Dataset Recipies.
* - reformatted the 6 files identified ifor CI/CD check (black formatter). Resolves the formatting failures.
Copy file name to clipboardExpand all lines: README.md
+55-44Lines changed: 55 additions & 44 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,17 +1,46 @@
1
-
# SpatialTranscriptFormer
1
+
# SpatialTranscriptFormer Framework
2
2
3
3
> [!WARNING]
4
4
> **Work in Progress**: This project is under active development. Core architectures, CLI flags, and data formats are subject to major changes.
5
5
6
-
**SpatialTranscriptFormer** bridges histology and biological pathways through a high-performance transformer architecture. By modeling the dense interplay between morphological features and gene expression signatures, it provides an interpretable and spatially-coherent mapping of the tissue microenvironment.
6
+
<!---->
7
+
8
+
> [!TIP]
9
+
> **Framework Release**: SpatialTranscriptFormer has been restructured from a research codebase into a robust framework. You can now use the Python API to train on your own spatial transcriptomics data with custom backbones and architectures.
10
+
11
+
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to model the interplay between morphological features and gene expression signatures, providing interpretable mapping of the tissue microenvironment.
12
+
13
+
## Python API: Quick Start
14
+
15
+
The framework is designed to be integrated programmatically into your scanpy/AnnData workflows:
16
+
17
+
```python
18
+
from spatial_transcript_former import SpatialTranscriptFormer, Predictor, FeatureExtractor
19
+
from spatial_transcript_former.predict import inject_predictions
20
+
21
+
# 1. Initialize model and backbone
22
+
model = SpatialTranscriptFormer.from_pretrained("./checkpoints/stf_small/")
-**Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss** to prevent spatial collapse and ensure accurate morphology-expression mapping.
40
+
-**Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss**.
13
41
-**Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
14
-
-**Biologically Informed Initialization**: Gene reconstruction weights derived from known hallmark memberships.
42
+
43
+
---
15
44
16
45
## License
17
46
@@ -28,76 +57,58 @@ This project is protected by a **Proprietary Source Code License**. See the [LIC
28
57
29
58
The core architectural innovations, including the **SpatialTranscriptFormer** interaction logic and spatial masking strategies, are the unique Intellectual Property of the author. For a detailed breakdown, see the [IP Statement](docs/IP_STATEMENT.md).
30
59
60
+
---
61
+
31
62
## Installation
32
63
33
64
This project requires [Conda](https://docs.conda.io/en/latest/).
34
65
35
66
1. Clone the repository.
36
67
2. Run the automated setup script:
37
-
3. On Windows: `.\setup.ps1`
68
+
- On Windows: `.\setup.ps1`
38
69
- On Linux/HPC: `bash setup.sh`
39
70
40
-
## Usage
71
+
## Exemplar Recipe: HEST-1k Benchmark
41
72
42
-
### Dataset Access
73
+
The `SpatialTranscriptFormer` repository includes a complete, out-of-the-box CLI pipeline as an exemplar for reproducing our benchmarks on the [HEST-1k dataset](https://huggingface.co/datasets/MahmoodLab/hest).
43
74
44
-
The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:
75
+
### 1. Dataset Access & Preprocessing
45
76
46
77
```bash
47
-
# List available filtering options
48
-
stf-download --list-options
49
-
50
-
# Download a specific subset (e.g., Breast Cancer samples from Visium)
78
+
# Download a specific subset
51
79
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.
59
-
60
-
### Train Models
61
-
62
-
We provide presets for baseline models and scaled versions of the SpatialTranscriptFormer.
Visualization plots and spatial expression maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
86
96
87
-
Generate spatial maps comparing Ground Truth vs Predictions:
Visualization plots will be saved to the `./results` directory.
101
+
-**[Python API Reference](docs/API.md)**: Full documentation for `Trainer`, `Predictor`, and `SpatialDataset`.
102
+
-**[Bring Your Own Data Guide](src/spatial_transcript_former/recipes/custom/README.md)**: Templates and examples for training on your own non-HEST spatial transcriptomics data.
103
+
-**[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**: Detailed documentation for the included HEST-1k dataset recipe.
104
+
-**[Training Guide](docs/TRAINING_GUIDE.md)**: Complete list of configuration flags and preset configurations for HEST models.
94
105
95
-
##Documentation
106
+
### Theory & Interpretability
96
107
97
-
-[Models](docs/MODELS.md): Detailed model architectures and scaling parameters.
98
-
-[Data Structure](docs/DATA_STRUCTURE.md): Organization of HEST data on disk.
99
-
-[Pathway Mapping](docs/PATHWAY_MAPPING.md): Clinical interpretability and pathway integration.
100
-
-[Gene Analysis](docs/GENE_ANALYSIS.md): Modeling strategies for high-dimensional gene space.
108
+
-**[Models & Architecture](docs/MODELS.md)**: Deep dive into the quad-flow interaction logic and network scaling.
109
+
-**[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
110
+
-**[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
111
+
-**[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
0 commit comments