Skip to content

Commit da5fe57

Browse files
Feat/primary pathway loss (#9)
* README.md: - Remove duplicate [Research & Improvement Roadmap] link from the Theory & Interpretability section; the single canonical reference is in the Development Roadmap section. - Fold the standalone "Future Directions & Clinical Collaborations" section into the Development Roadmap's Longer-term bullet list and attach the collaborator callout there, eliminating the separate section and the content overlap. docs/SC_BEST_PRACTICES.md: - Simplify the two-point numbered intro to a single sentence. - Collapse six functional ### category headers into three groups (Vocabulary & Preprocessing / Training & Supervision / Evaluation, Scale & Tooling) so items of related priority sit together instead of being scattered across thin categories. - Remove the #### "Architectural direction" sub-sub-header; promote it to a bold inline lead paragraph so it reads as context rather than a nested section. - Remove the #### "Coverage trade-offs" sub-sub-header; replace it with a plain lead sentence before the table. - Update items 7–14 and the quick-reference table to reflect the decoupleR + PROGENy reframing agreed in the March 2026 research session (pathway activity as primary task, gene expression as secondary head). Signed-off-by: BenjaminIsaac0111 <12176376+BenjaminIsaac0111@users.noreply.github.com> * feat(model): transition to pathway-exclusive architecture Redesign the SpatialTranscriptFormer to directly learn from and predict biological pathway activity scores from histology, completely removing gene-level reconstruction and auxiliary loss components. - Architectural Shift: The model now operates exclusively in pathway space, utilizing learnable pathway tokens as bottleneck predictors via cosine similarity with patch features. - Simplified Objective: Dropped the auxiliary gene reconstruction task and associated losses (ZINB, etc.) in favor of a direct pathway-level objective. - Refactored Loss System: Introduced a clean `CompositeLoss` (MSE + PCC) for direct pathway supervision, optimized for whole-slide spatial data. - Cleanup: Removed legacy PrimaryPathwayLoss and unused MaskedHuberLoss to streamline the training pipeline. * - Cleaned up and adapted tests to new model pipeline. God this took a long time... * chore: update .gitattributes for artefact data handling - Applying black formatting. * - Skipping the pathway data intergrety tests. Just skipping as this is typically used for local testing. * refactor: remove redundant gene vocabulary loading and alignment Transitioned the SpatialTranscriptFormer pipeline to a strictly pathway-exclusive architecture. This change eliminates the memory and I/O overhead associated with loading and aligning high-dimensional gene expression matrices during training. Key Changes: - Dataset: Removed all gene-level loading logic and `num_genes` parameters from HEST_Dataset and HEST_FeatureDataset. - Alignment: Replaced gene matrix decoding with get_h5ad_valid_mask for high-performance barcode-based patch filtering. - Engine: Updated batch unpacking to treat genes as None, focusing model training entirely on pre-computed pathway activity. - CLI: Removed the --num-genes argument from arguments.py and cleaned up the train.py entry point. - Cleanup: Deleted redundant gene_vocab.py, build_vocab.py, and associated CLI entry points in pyproject.toml. Results: - Reduced RAM usage and initialization time. - Standardized dataloading across all HEST recipes. - Verified stability with updated dataset test suite. * - fixed the CI collection errors and implemented a local pre-commit hook to automate testing. scripts/diagnose_collapse.py was still attempting to import the missing function, causing a collection error during the pytest run. src/spatial_transcript_former/recipes/hest/dataset.py contained a docstring reference that was confusing the documentation/test collection. --------- Signed-off-by: BenjaminIsaac0111 <12176376+BenjaminIsaac0111@users.noreply.github.com>
1 parent 5581172 commit da5fe57

48 files changed

Lines changed: 2177 additions & 4192 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitattributes

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,22 @@
1+
# Force LF line endings for all text files
12
* text=auto eol=lf
23

4+
# Explicitly handle Python files
5+
*.py text eol=lf
6+
7+
# Handle configuration files
8+
*.yaml text eol=lf
9+
*.json text eol=lf
10+
*.toml text eol=lf
11+
*.md text eol=lf
12+
13+
# Mark data artifacts as binary to prevent corruption
14+
*.csv binary
15+
*.sqlite binary
16+
*.h5 binary
17+
*.pth binary
18+
*.pt binary
19+
*.pkl binary
20+
21+
# Large data and logs should definitely be binary
22+
*.log text eol=lf

.gitignore

Lines changed: 67 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -182,9 +182,9 @@ cython_debug/
182182
.abstra/
183183

184184
# Visual Studio Code
185-
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
185+
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
186186
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
187-
# and can be added to the global gitignore or merged into this file. However, if you prefer,
187+
# and can be added to the global gitignore or merged into this file. However, if you prefer,
188188
# you could uncomment the following to ignore the entire vscode folder
189189
# .vscode/
190190

@@ -213,9 +213,74 @@ checkpoints/
213213
checkpoints_full/
214214
results/
215215
results_long_run/
216+
runs/
217+
vscode/
218+
.agent/
216219

217220
# Images
218221
*.png
219222
*.jpg
220223
*.jpeg
221224
*.svg
225+
docs/research_proposal.tex
226+
.vscode/settings.json
227+
hest_data/.gitattributes
228+
hest_data/HEST_v1_3_0.csv
229+
hest_data/README.md
230+
hest_data/cellvit_seg/INT1_cellvit_seg.geojson.zip
231+
hest_data/cellvit_seg/INT1_cellvit_seg.parquet
232+
hest_data/cellvit_seg/INT10_cellvit_seg.geojson.zip
233+
hest_data/cellvit_seg/INT10_cellvit_seg.parquet
234+
hest_data/cellvit_seg/INT11_cellvit_seg.geojson.zip
235+
hest_data/cellvit_seg/INT11_cellvit_seg.parquet
236+
hest_data/cellvit_seg/INT12_cellvit_seg.geojson.zip
237+
hest_data/cellvit_seg/INT12_cellvit_seg.parquet
238+
hest_data/cellvit_seg/INT13_cellvit_seg.geojson.zip
239+
hest_data/cellvit_seg/INT13_cellvit_seg.parquet
240+
hest_data/cellvit_seg/INT16_cellvit_seg.geojson.zip
241+
hest_data/cellvit_seg/INT16_cellvit_seg.parquet
242+
hest_data/cellvit_seg/INT19_cellvit_seg.geojson.zip
243+
hest_data/cellvit_seg/INT19_cellvit_seg.parquet
244+
hest_data/cellvit_seg/INT20_cellvit_seg.geojson.zip
245+
hest_data/cellvit_seg/INT20_cellvit_seg.parquet
246+
hest_data/cellvit_seg/INT21_cellvit_seg.geojson.zip
247+
hest_data/cellvit_seg/INT21_cellvit_seg.parquet
248+
hest_data/cellvit_seg/INT1_cellvit_seg.geojson
249+
hest_data/cellvit_seg/INT10_cellvit_seg.geojson
250+
hest_data/cellvit_seg/INT11_cellvit_seg.geojson
251+
hest_data/cellvit_seg/INT12_cellvit_seg.geojson
252+
hest_data/cellvit_seg/INT13_cellvit_seg.geojson
253+
hest_data/cellvit_seg/INT16_cellvit_seg.geojson
254+
hest_data/cellvit_seg/INT19_cellvit_seg.geojson
255+
hest_data/cellvit_seg/INT20_cellvit_seg.geojson
256+
hest_data/cellvit_seg/INT21_cellvit_seg.geojson
257+
hest_data/cellvit_seg/TENX175_cellvit_seg.geojson
258+
hest_data/cellvit_seg/TENX175_cellvit_seg.geojson.zip
259+
hest_data/cellvit_seg/TENX175_cellvit_seg.parquet
260+
hest_data/metadata/TENX175.json
261+
hest_data/patches/TENX175.h5
262+
hest_data/st/TENX175.h5ad
263+
hest_data/tissue_seg/TENX175_contours.geojson
264+
hest_data/wsis/TENX175.tif
265+
.idea/.gitignore
266+
.idea/csv-editor.xml
267+
.idea/deployment.xml
268+
.idea/jupyter-settings.xml
269+
.idea/misc.xml
270+
.idea/modules.xml
271+
.idea/SpatialTranscriptFormer.iml
272+
.idea/vcs.xml
273+
.idea/inspectionProfiles/profiles_settings.xml
274+
.idea/inspectionProfiles/Project_Default.xml
275+
.idea/runConfigurations/STF_Compute_Pathways.xml
276+
.idea/runConfigurations/STF_Train_PrimaryPathway.xml
277+
.gemini/settings.json
278+
.gemini/agents/literature-search.md
279+
.gemini/agents/test-triage.md
280+
281+
# Large Data Artifacts
282+
global_genes_stats.csv
283+
*.sqlite
284+
HEST_v1_3_0.csv
285+
global_genes.json
286+
test_out.txt

.pre-commit-config.yaml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# See https://pre-commit.com for more information
2+
# See https://pre-commit.com/hooks.html for more hooks
3+
repos:
4+
- repo: https://github.com/pre-commit/pre-commit-hooks
5+
rev: v4.5.0
6+
hooks:
7+
- id: trailing-whitespace
8+
- id: end-of-file-fixer
9+
- id: check-yaml
10+
- id: check-added-large-files
11+
12+
- repo: https://github.com/psf/black
13+
rev: 24.2.0
14+
hooks:
15+
- id: black
16+
language_version: python3
17+
18+
- repo: local
19+
hooks:
20+
- id: pytest
21+
name: pytest
22+
entry: conda run -n SpatialTranscriptFormer --no-capture-output python -m pytest
23+
language: system
24+
pass_filenames: false
25+
always_run: true

README.md

Lines changed: 42 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
> [!TIP]
99
> **Framework Release**: SpatialTranscriptFormer has been restructured from a research codebase into a robust framework. You can now use the Python API to train on your own spatial transcriptomics data with custom backbones and architectures.
1010
11-
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to model the interplay between morphological features and gene expression signatures, providing interpretable mapping of the tissue microenvironment.
11+
**SpatialTranscriptFormer** is a modular deep learning framework designed to bridge histology and biological pathways. It leverages transformer architectures to directly predict spatially-resolved **biological pathway activity scores** from H&E image patches, providing interpretable maps of the tissue microenvironment.
1212

1313
## Python API: Quick Start
1414

@@ -28,22 +28,23 @@ predictor = Predictor(model, device="cuda")
2828
# coords: (N, 2) tensor of spatial coordinates (from your WSI tiling)
2929
features = extractor.extract_batch(image_patches, batch_size=64) # → (N, 768)
3030

31-
# 3. Predict gene expression from extracted features
32-
predictions = predictor.predict_wsi(features, coords) # → (1, G)
31+
# 3. Predict pathway activity scores from extracted features
32+
predictions = predictor.predict_wsi(features, coords) # → (1, P)
3333

3434
# 4. Integrate with Scanpy
35-
inject_predictions(adata, coords, predictions[0], gene_names=model.gene_names)
35+
inject_predictions(adata, coords, predictions[0], pathway_names=model.pathway_names)
3636
```
3737

3838
For more details, see the **[Python API Reference](docs/API.md)**.
3939

4040
## Key Technical Pillars
4141

42-
- **Modular Architecture**: Decoupled backbones, interaction modules, and output heads.
42+
- **Modular Architecture**: Decoupled backbones, interaction modules, and pathway output heads.
4343
- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
44-
- **Pathway Bottleneck**: Interpretable gene expression prediction via 50 MSigDB Hallmark tokens.
45-
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss**.
46-
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
44+
- **Pathway-Exclusive Prediction**: Directly predicts biological pathway activity scores (e.g., 50 MSigDB Hallmark pathways) — no intermediate gene reconstruction step.
45+
- **Offline Pathway Targets**: Ground-truth pathway activities are pre-computed offline (`stf-compute-pathways`) from raw gene expression using QC → CP10k normalisation → z-score → mean pathway aggregation. This eliminates the circular auxiliary loss used in previous versions.
46+
- **Spatial Pattern Coherence**: Optimised using a composite **MSE + PCC (Pearson Correlation) loss**.
47+
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, **PLIP**, and **GigaPath**.
4748

4849
---
4950

@@ -61,6 +62,7 @@ This project is protected by a **Proprietary Source Code License**. See the [LIC
6162
## Intellectual Property
6263

6364
The core architectural innovations, including the **SpatialTranscriptFormer** interaction logic and spatial masking strategies, are the unique Intellectual Property of the author. For a detailed breakdown, see the [IP Statement](docs/IP_STATEMENT.md).
65+
6466
---
6567

6668
## Installation
@@ -83,20 +85,30 @@ The `SpatialTranscriptFormer` repository includes a complete, out-of-the-box CLI
8385
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
8486
```
8587

86-
### 2. Training with Presets
88+
### 2. Pre-Compute Pathway Activity Targets
89+
90+
Before training, compute the offline pathway activity matrix for each sample. This step applies per-spot QC, CP10k normalisation, and z-scoring before aggregating gene expression into MSigDB Hallmark pathway scores.
91+
92+
```bash
93+
stf-compute-pathways --data-dir hest_data
94+
```
95+
96+
See the **[Pathway Mapping docs](docs/PATHWAY_MAPPING.md)** for a full description of the scoring methodology and available CLI options.
97+
98+
### 3. Training with Presets
8799

88100
```bash
89101
# Recommended: Run the Interaction model (Small)
90102
python scripts/run_preset.py --preset stf_small
91103
```
92104

93-
### 3. Inference & Visualization
105+
### 4. Inference & Visualization
94106

95107
```bash
96108
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
97109
```
98110

99-
Visualization plots and spatial expression maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
111+
Visualization plots and spatial pathway activation maps will be saved to the `./results` directory. For the full guide, see the **[HEST Recipe Docs](src/spatial_transcript_former/recipes/hest/README.md)**.
100112

101113
## Documentation
102114

@@ -109,10 +121,9 @@ Visualization plots and spatial expression maps will be saved to the `./results`
109121

110122
### Theory & Interpretability
111123

112-
- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the quad-flow interaction logic and network scaling.
113-
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Clinical interpretability, pathway bottleneck design, and MSigDB integration.
114-
- **[Gene Analysis](docs/GENE_ANALYSIS.md)**: Modeling strategies for mapping morphology to high-dimensional gene spaces.
115-
- **[Data Structure](docs/DATA_STRUCTURE.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
124+
- **[Models & Architecture](docs/MODELS.md)**: Deep dive into the pathway-exclusive prediction architecture, quad-flow interaction logic, and network scaling.
125+
- **[Pathway Mapping](docs/PATHWAY_MAPPING.md)**: Offline pathway scoring methodology, QC pipeline, and MSigDB integration.
126+
- **[Data Structure](docs/DATA_FORMAT.md)**: Detailed breakdown of the HEST data structure on disk, metadata conventions, and preprocessing invariants.
116127

117128
## Development
118129

@@ -123,28 +134,33 @@ Visualization plots and spatial expression maps will be saved to the `./results`
123134
.\test.ps1
124135
```
125136

137+
The test suite is organised into a hierarchical directory structure under `tests/`:
138+
139+
| Directory | Coverage Area |
140+
| :--- | :--- |
141+
| `tests/data/` | Data integrity, pathway scoring, augmentation |
142+
| `tests/models/` | Backbone loading, interaction logic, model compilation |
143+
| `tests/training/` | Loss functions, trainer loop, checkpoints, config |
144+
| `tests/recipes/hest/` | HEST dataset loading and splitting |
145+
| `tests/test_api.py` | End-to-end Python API integration |
146+
126147
## Development Roadmap
127148

128149
Active research and development is tracked in the **[Research & Improvement Roadmap](docs/SC_BEST_PRACTICES.md)**. Key directions are summarised below.
129150

130151
### Near-term
131152

132-
- **Vocabulary quality** — mitochondrial gene filtering (`MT-*` exclusion) and a rebuild of the gene vocabulary using SVG-weighted ranking (Moran's I), ensuring training targets are spatially informative rather than dominated by housekeeping genes.
133-
- **Moran's I weighted loss** — weight each gene's contribution to the training loss by its spatial variability score, so that the gradient is driven by spatially coherent genes rather than high-expression noise.
134-
135-
### Medium-term: Architectural Reframing
153+
- **Extended knowledge base integration** — The offline pathway scoring step currently supports MSigDB Hallmarks via GMT files. The architecture is designed to be database-agnostic; future work will add first-class support for [decoupleR](https://decoupler-py.readthedocs.io) + [PROGENy](https://saezlab.github.io/progeny/) (Saez lab) and [LIANA+](https://liana-py.readthedocs.io) ligand-receptor databases as alternative scoring backends.
154+
- **Visium HD & Xenium support** — Architecturally trivial; blocked only by data availability.
136155

137-
The current model predicts ~1000 individual gene expression values as its primary task, with pathway activity as a secondary interpretability output. Based on a review of the ST literature and the [Saezlab ecosystem](https://saezlab.org) (PROGENy, decoupleR, LIANA+), we are shifting toward:
156+
### Medium-term
138157

139-
- **Pathway activity as the primary prediction target.** Spatial pathway activity maps pre-computed offline via [decoupleR](https://decoupler-py.readthedocs.io) + [PROGENy](https://saezlab.github.io/progeny/) are spatially cleaner, clinically interpretable, and directly supervised — avoiding the circular regularisation issue of the current `AuxiliaryPathwayLoss`.
140-
- **Gene expression as a secondary imputation head**, weighted by Moran's I.
141-
- **Pluggable prior knowledge.** The offline preprocessing step accepts any biological network (PROGENy signalling pathways, MSigDB Hallmarks, LIANA+ ligand-receptor pairs, CollecTRI TF regulons) without changing the model architecture.
158+
- **Evaluation on the 2025 Nat. Comms. benchmark suite** (11 methods, 28 metrics, 5 datasets).
159+
- **Pluggable scoring backends** — Allow `stf-compute-pathways` to accept any biological network (CollecTRI TF regulons, custom GMT files) without changing the model architecture.
142160

143161
### Longer-term
144162

145-
- Evaluation on the 2025 Nat. Comms. benchmark suite (11 methods, 28 metrics, 5 datasets).
146-
- Support for higher-resolution platforms (Visium HD, Xenium) — architecturally trivial, blocked only by data availability.
147-
- **Clinical integration** — using predicted spatial pathway activation maps as features for patient risk assessment and prognosis tracking in an end-to-end pipeline.
163+
- **Clinical integration** — Using predicted spatial pathway activation maps as features for patient risk assessment and prognosis tracking in an end-to-end pipeline.
148164

149165
> [!NOTE]
150166
> **Call for Collaborators:** Rigorous risk assessment models require large clinical cohorts with spatial transcriptomics and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest. If you have access to such cohorts and are interested in exploring how spatially-resolved pathway activation correlates with patient prognosis, we would love to partner with you.

config.yaml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,16 @@ training:
1111
num_genes: 1000
1212
batch_size: 8
1313
learning_rate: 0.0001
14-
output_dir: "./checkpoints"
15-
14+
output_dir: "./runs"
15+
1616
# MSigDB Pathway Settings
1717
pathways:
1818
default_collection: "hallmarks"
1919
cache_dir: ".cache"
20+
21+
# Quality Control Defaults
22+
qc:
23+
min_umis: 500
24+
min_genes: 200
25+
max_mt: 0.15
26+
min_pathways: 25

0 commit comments

Comments
 (0)