Skip to content

Commit 9f28809

Browse files
refactor: remove redundant gene vocabulary loading and alignment
Transitioned the SpatialTranscriptFormer pipeline to a strictly pathway-exclusive architecture. This change eliminates the memory and I/O overhead associated with loading and aligning high-dimensional gene expression matrices during training. Key Changes: - Dataset: Removed all gene-level loading logic and `num_genes` parameters from HEST_Dataset and HEST_FeatureDataset. - Alignment: Replaced gene matrix decoding with get_h5ad_valid_mask for high-performance barcode-based patch filtering. - Engine: Updated batch unpacking to treat genes as None, focusing model training entirely on pre-computed pathway activity. - CLI: Removed the --num-genes argument from arguments.py and cleaned up the train.py entry point. - Cleanup: Deleted redundant gene_vocab.py, build_vocab.py, and associated CLI entry points in pyproject.toml. Results: - Reduced RAM usage and initialization time. - Standardized dataloading across all HEST recipes. - Verified stability with updated dataset test suite.
1 parent bf8e51c commit 9f28809

26 files changed

Lines changed: 967 additions & 739 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,3 +283,4 @@ global_genes_stats.csv
283283
*.sqlite
284284
HEST_v1_3_0.csv
285285
global_genes.json
286+
test_out.txt

README.md

Lines changed: 42 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
> [!TIP]
99
> **Framework Release**: SpatialTranscriptFormer has been restructured from a research codebase into a robust framework. You can now use the Python API to train on your own spatial transcriptomics data with custom backbones and architectures.
1010
11-
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to model the interplay between morphological features and gene expression signatures, providing interpretable mapping of the tissue microenvironment.
11+
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to directly predict spatially-resolved **biological pathway activity scores** from H&E image patches, providing interpretable maps of the tissue microenvironment.
1212

1313
## Python API: Quick Start
1414

@@ -28,22 +28,23 @@ predictor = Predictor(model, device="cuda")
2828
# coords: (N, 2) tensor of spatial coordinates (from your WSI tiling)
2929
features = extractor.extract_batch(image_patches, batch_size=64) # → (N, 768)
3030

31-
# 3. Predict gene expression from extracted features
32-
predictions = predictor.predict_wsi(features, coords) # → (1, G)
31+
# 3. Predict pathway activity scores from extracted features
32+
predictions = predictor.predict_wsi(features, coords) # → (1, P)
3333

3434
# 4. Integrate with Scanpy
35-
inject_predictions(adata, coords, predictions[0], gene_names=model.gene_names)
35+
inject_predictions(adata, coords, predictions[0], pathway_names=model.pathway_names)
3636
```
3737

3838
For more details, see the **[Python API Reference](docs/API.md)**.
3939

4040
## Key Technical Pillars
4141

42-
- **Modular Architecture**: Decoupled backbones, interaction modules, and output heads.
42+
- **Modular Architecture**: Decoupled backbones, interaction modules, and pathway output heads.
4343
- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
44-
- **Pathway Bottleneck**: Interpretable gene expression prediction via 50 MSigDB Hallmark tokens.
45-
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss**.
46-
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
44+
- **Pathway-Exclusive Prediction**: Directly predicts biological pathway activity scores (e.g., 50 MSigDB Hallmark pathways) — no intermediate gene reconstruction step.
45+
- **Offline Pathway Targets**: Ground-truth pathway activities are pre-computed offline (`stf-compute-pathways`) from raw gene expression using QC → CP10k normalisation → z-score → mean pathway aggregation. This eliminates the circular auxiliary loss used in previous versions.
46+
- **Spatial Pattern Coherence**: Optimised using a composite **MSE + PCC (Pearson Correlation) loss**.
47+
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, **PLIP**, and **GigaPath**.
4748

4849
---
4950

@@ -61,6 +62,7 @@ This project is protected by a **Proprietary Source Code License**. See the [LIC
6162
## Intellectual Property
6263

6364
The core architectural innovations, including the **SpatialTranscriptFormer** interaction logic and spatial masking strategies, are the unique Intellectual Property of the author. For a detailed breakdown, see the [IP Statement](docs/IP_STATEMENT.md).
65+
6466
---
6567

6668
## Installation
@@ -83,20 +85,30 @@ The `SpatialTranscriptFormer` repository includes a complete, out-of-the-box CLI
8385
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
8486
```
8587

86-
### 2. Training with Presets
88+
### 2. Pre-Compute Pathway Activity Targets
89+
90+
Before training, compute the offline pathway activity matrix for each sample. This step applies per-spot QC, CP10k normalisation, and z-scoring before aggregating gene expression into MSigDB Hallmark pathway scores.
91+
92+
```bash
93+
stf-compute-pathways --data-dir hest_data
94+
```
95+
96+
See the **[Pathway Mapping docs](docs/PATHWAY_MAPPING.md)** for a full description of the scoring methodology and available CLI options.
97+
98+
### 3. Training with Presets
8799

88100
```bash
89101
# Recommended: Run the Interaction model (Small)
90102
python scripts/run_preset.py --preset stf_small
91103
```
92104

93-
### 3. Inference & Visualization
105+
### 4. Inference & Visualization
94106

95107
```bash
96108
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
97109
```
98110

99-
Visualization plots and spatial expression maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
111+
Visualization plots and spatial pathway activation maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
100112

101113
## Documentation
102114

@@ -109,10 +121,9 @@ Visualization plots and spatial expression maps will be saved to the `./results`
109121

110122
### Theory & Interpretability
111123

112-
- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the quad-flow interaction logic and network scaling.
113-
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
114-
- **[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
115-
- **[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
124+
- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the pathway-exclusive prediction architecture, quad-flow interaction logic, and network scaling.
125+
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Offline pathway scoring methodology, QC pipeline, and MSigDB integration.
126+
- **[Data Structure](docs/DATA_FORMAT.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
116127

117128
## Development
118129

@@ -123,28 +134,33 @@ Visualization plots and spatial expression maps will be saved to the `./results`
123134
.\test.ps1
124135
```
125136

137+
The test suite is organised into a hierarchical directory structure under `tests/`:
138+
139+
| Directory | Coverage Area |
140+
| :--- | :--- |
141+
| `tests/data/` | Data integrity, pathway scoring, augmentation |
142+
| `tests/models/` | Backbone loading, interaction logic, model compilation |
143+
| `tests/training/` | Loss functions, trainer loop, checkpoints, config |
144+
| `tests/recipes/hest/` | HEST dataset loading and splitting |
145+
| `tests/test_api.py` | End-to-end Python API integration |
146+
126147
## Development Roadmap
127148

128149
Active research and development is tracked in the **[Research & Improvement Roadmap](docs/SC_BEST_PRACTICES.md)**. Key directions are summarised below.
129150

130151
### Near-term
131152

132-
- **Vocabulary quality** — mitochondrial gene filtering (`MT-*` exclusion) and a rebuild of the gene vocabulary using SVG-weighted ranking (Moran's I), ensuring training targets are spatially informative rather than dominated by housekeeping genes.
133-
- **Moran's I weighted loss** — weight each gene's contribution to the training loss by its spatial variability score, so that the gradient is driven by spatially coherent genes rather than high-expression noise.
134-
135-
### Medium-term: Architectural Reframing
153+
- **Extended knowledge base integration** — The offline pathway scoring step currently supports MSigDB Hallmarks via GMT files. The architecture is designed to be database-agnostic; future work will add first-class support for [decoupleR](https://decoupler-py.readthedocs.io) + [PROGENy](https://saezlab.github.io/progeny/) (Saez lab) and [LIANA+](https://liana-py.readthedocs.io) ligand-receptor databases as alternative scoring backends.
154+
- **Visium HD & Xenium support** — Architecturally trivial; blocked only by data availability.
136155

137-
The current model predicts ~1000 individual gene expression values as its primary task, with pathway activity as a secondary interpretability output. Based on a review of the ST literature and the [Saezlab ecosystem](https://saezlab.org) (PROGENy, decoupleR, LIANA+), we are shifting toward:
156+
### Medium-term
138157

139-
- **Pathway activity as the primary prediction target.** Spatial pathway activity maps pre-computed offline via [decoupleR](https://decoupler-py.readthedocs.io) + [PROGENy](https://saezlab.github.io/progeny/) are spatially cleaner, clinically interpretable, and directly supervised — avoiding the circular regularisation issue of the current `AuxiliaryPathwayLoss`.
140-
- **Gene expression as a secondary imputation head**, weighted by Moran's I.
141-
- **Pluggable prior knowledge.** The offline preprocessing step accepts any biological network (PROGENy signalling pathways, MSigDB Hallmarks, LIANA+ ligand-receptor pairs, CollecTRI TF regulons) without changing the model architecture.
158+
- **Evaluation on the 2025 Nat. Comms. benchmark suite** (11 methods, 28 metrics, 5 datasets).
159+
- **Pluggable scoring backends** — Allow `stf-compute-pathways` to accept any biological network (CollecTRI TF regulons, custom GMT files) without changing the model architecture.
142160

143161
### Longer-term
144162

145-
- Evaluation on the 2025 Nat. Comms. benchmark suite (11 methods, 28 metrics, 5 datasets).
146-
- Support for higher-resolution platforms (Visium HD, Xenium) — architecturally trivial, blocked only by data availability.
147-
- **Clinical integration** — using predicted spatial pathway activation maps as features for patient risk assessment and prognosis tracking in an end-to-end pipeline.
163+
- **Clinical integration** — Using predicted spatial pathway activation maps as features for patient risk assessment and prognosis tracking in an end-to-end pipeline.
148164

149165
> [!NOTE]
150166
> **Call for Collaborators:** Rigorous risk assessment models require large clinical cohorts with spatial transcriptomics and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest. If you have access to such cohorts and are interested in exploring how spatially-resolved pathway activation correlates with patient prognosis, we would love to partner with you.

0 commit comments

Comments
 (0)