You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: remove redundant gene vocabulary loading and alignment
Transitioned the SpatialTranscriptFormer pipeline to a strictly
pathway-exclusive architecture. This change eliminates the memory
and I/O overhead associated with loading and aligning high-dimensional
gene expression matrices during training.
Key Changes:
- Dataset: Removed all gene-level loading logic and `num_genes`
parameters from HEST_Dataset and HEST_FeatureDataset.
- Alignment: Replaced gene matrix decoding with get_h5ad_valid_mask
for high-performance barcode-based patch filtering.
- Engine: Updated batch unpacking to treat genes as None, focusing
model training entirely on pre-computed pathway activity.
- CLI: Removed the --num-genes argument from arguments.py and
cleaned up the train.py entry point.
- Cleanup: Deleted redundant gene_vocab.py, build_vocab.py, and
associated CLI entry points in pyproject.toml.
Results:
- Reduced RAM usage and initialization time.
- Standardized dataloading across all HEST recipes.
- Verified stability with updated dataset test suite.
Copy file name to clipboardExpand all lines: README.md
+42-26Lines changed: 42 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@
8
8
> [!TIP]
9
9
> **Framework Release**: SpatialTranscriptFormer has been restructured from a research codebase into a robust framework. You can now use the Python API to train on your own spatial transcriptomics data with custom backbones and architectures.
10
10
11
-
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to model the interplay between morphological features and gene expression signatures, providing interpretable mapping of the tissue microenvironment.
11
+
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to directly predict spatially-resolved **biological pathway activity scores** from H&E image patches, providing interpretable maps of the tissue microenvironment.
-**Offline Pathway Targets**: Ground-truth pathway activities are pre-computed offline (`stf-compute-pathways`) from raw gene expression using QC → CP10k normalisation → z-score → mean pathway aggregation. This eliminates the circular auxiliary loss used in previous versions.
46
+
-**Spatial Pattern Coherence**: Optimised using a composite **MSE + PCC (Pearson Correlation) loss**.
47
+
-**Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, **PLIP**, and **GigaPath**.
47
48
48
49
---
49
50
@@ -61,6 +62,7 @@ This project is protected by a **Proprietary Source Code License**. See the [LIC
61
62
## Intellectual Property
62
63
63
64
The core architectural innovations, including the **SpatialTranscriptFormer** interaction logic and spatial masking strategies, are the unique Intellectual Property of the author. For a detailed breakdown, see the [IP Statement](docs/IP_STATEMENT.md).
65
+
64
66
---
65
67
66
68
## Installation
@@ -83,20 +85,30 @@ The `SpatialTranscriptFormer` repository includes a complete, out-of-the-box CLI
83
85
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
84
86
```
85
87
86
-
### 2. Training with Presets
88
+
### 2. Pre-Compute Pathway Activity Targets
89
+
90
+
Before training, compute the offline pathway activity matrix for each sample. This step applies per-spot QC, CP10k normalisation, and z-scoring before aggregating gene expression into MSigDB Hallmark pathway scores.
91
+
92
+
```bash
93
+
stf-compute-pathways --data-dir hest_data
94
+
```
95
+
96
+
See the **[Pathway Mapping docs](docs/PATHWAY_MAPPING.md)** for a full description of the scoring methodology and available CLI options.
Visualization plots and spatial expression maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
111
+
Visualization plots and spatial pathway activation maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
100
112
101
113
## Documentation
102
114
@@ -109,10 +121,9 @@ Visualization plots and spatial expression maps will be saved to the `./results`
109
121
110
122
### Theory & Interpretability
111
123
112
-
-**[Models & Architecture](docs/MODELS.md)**: Deep dive into the quad-flow interaction logic and network scaling.
113
-
-**[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
114
-
-**[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
115
-
-**[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
124
+
-**[Models & Architecture](docs/MODELS.md)**: Deep dive into the pathway-exclusive prediction architecture, quad-flow interaction logic, and network scaling.
-**[Data Structure](docs/DATA_FORMAT.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
116
127
117
128
## Development
118
129
@@ -123,28 +134,33 @@ Visualization plots and spatial expression maps will be saved to the `./results`
123
134
.\test.ps1
124
135
```
125
136
137
+
The test suite is organised into a hierarchical directory structure under `tests/`:
138
+
139
+
| Directory | Coverage Area |
140
+
| :--- | :--- |
141
+
|`tests/data/`| Data integrity, pathway scoring, augmentation |
142
+
|`tests/models/`| Backbone loading, interaction logic, model compilation |
143
+
|`tests/training/`| Loss functions, trainer loop, checkpoints, config |
144
+
|`tests/recipes/hest/`| HEST dataset loading and splitting |
145
+
|`tests/test_api.py`| End-to-end Python API integration |
146
+
126
147
## Development Roadmap
127
148
128
149
Active research and development is tracked in the **[Research & Improvement Roadmap](docs/SC_BEST_PRACTICES.md)**. Key directions are summarised below.
129
150
130
151
### Near-term
131
152
132
-
-**Vocabulary quality** — mitochondrial gene filtering (`MT-*` exclusion) and a rebuild of the gene vocabulary using SVG-weighted ranking (Moran's I), ensuring training targets are spatially informative rather than dominated by housekeeping genes.
133
-
-**Moran's I weighted loss** — weight each gene's contribution to the training loss by its spatial variability score, so that the gradient is driven by spatially coherent genes rather than high-expression noise.
134
-
135
-
### Medium-term: Architectural Reframing
153
+
-**Extended knowledge base integration** — The offline pathway scoring step currently supports MSigDB Hallmarks via GMT files. The architecture is designed to be database-agnostic; future work will add first-class support for [decoupleR](https://decoupler-py.readthedocs.io) + [PROGENy](https://saezlab.github.io/progeny/) (Saez lab) and [LIANA+](https://liana-py.readthedocs.io) ligand-receptor databases as alternative scoring backends.
154
+
-**Visium HD & Xenium support** — Architecturally trivial; blocked only by data availability.
136
155
137
-
The current model predicts ~1000 individual gene expression values as its primary task, with pathway activity as a secondary interpretability output. Based on a review of the ST literature and the [Saezlab ecosystem](https://saezlab.org) (PROGENy, decoupleR, LIANA+), we are shifting toward:
156
+
### Medium-term
138
157
139
-
-**Pathway activity as the primary prediction target.** Spatial pathway activity maps pre-computed offline via [decoupleR](https://decoupler-py.readthedocs.io) + [PROGENy](https://saezlab.github.io/progeny/) are spatially cleaner, clinically interpretable, and directly supervised — avoiding the circular regularisation issue of the current `AuxiliaryPathwayLoss`.
140
-
-**Gene expression as a secondary imputation head**, weighted by Moran's I.
141
-
-**Pluggable prior knowledge.** The offline preprocessing step accepts any biological network (PROGENy signalling pathways, MSigDB Hallmarks, LIANA+ ligand-receptor pairs, CollecTRI TF regulons) without changing the model architecture.
158
+
-**Evaluation on the 2025 Nat. Comms. benchmark suite** (11 methods, 28 metrics, 5 datasets).
159
+
-**Pluggable scoring backends** — Allow `stf-compute-pathways` to accept any biological network (CollecTRI TF regulons, custom GMT files) without changing the model architecture.
142
160
143
161
### Longer-term
144
162
145
-
- Evaluation on the 2025 Nat. Comms. benchmark suite (11 methods, 28 metrics, 5 datasets).
146
-
- Support for higher-resolution platforms (Visium HD, Xenium) — architecturally trivial, blocked only by data availability.
147
-
-**Clinical integration** — using predicted spatial pathway activation maps as features for patient risk assessment and prognosis tracking in an end-to-end pipeline.
163
+
-**Clinical integration** — Using predicted spatial pathway activation maps as features for patient risk assessment and prognosis tracking in an end-to-end pipeline.
148
164
149
165
> [!NOTE]
150
166
> **Call for Collaborators:** Rigorous risk assessment models require large clinical cohorts with spatial transcriptomics and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest. If you have access to such cohorts and are interested in exploring how spatially-resolved pathway activation correlates with patient prognosis, we would love to partner with you.
0 commit comments