You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spatial transcriptomics datasets (Visium, Xenium, MERFISH) use the same H5AD format as scRNA-seq, so SingleCellMemMapDataset already handles the expression matrix correctly. However, per-cell spatial coordinates stored in .obsm['spatial'] are silently dropped during H5AD → SCDL conversion. This makes it impossible to use SCDL as the data backbone for spatial FM training (e.g. Nicheformer) without maintaining a separate coordinate lookup outside the dataset.
More broadly, .obsm can hold other useful per-cell arrays (e.g. embeddings, batch corrections) that downstream models may need. None of these are currently preserved.
BioNeMo Framework Version
All
Category
API/Interface
Proposed Solution
Add an optional obsm_keys parameter to SingleCellMemMapDataset:
For each key, extract adata.obsm[key] (a dense [n_cells, d] array) and persist it as an additional memmap file alongside the existing CSR arrays. On __getitem__, the corresponding row is returned with the expression vector. Strictly opt-in — no behavior change for users who do not pass obsm_keys.
A few open questions I'd want to align on before writing a PR:
Schema versioning — new arrays would require a version bump (currently v0.0.9). Is there a preferred pattern for backward-compatible extensions in the header spec?
SingleCellCollection — should obsm_keys propagate when aggregating multiple datasets?
Happy to implement this if it fits the roadmap.
Expected Benefits
Enables SCDL to serve as the data layer for spatial transcriptomics foundation model training without requiring workarounds outside the dataset class. Aligns SCDL more closely with the full AnnData API surface.
Problem & Motivation
Spatial transcriptomics datasets (Visium, Xenium, MERFISH) use the same H5AD format as scRNA-seq, so
SingleCellMemMapDatasetalready handles the expression matrix correctly. However, per-cell spatial coordinates stored in.obsm['spatial']are silently dropped during H5AD → SCDL conversion. This makes it impossible to use SCDL as the data backbone for spatial FM training (e.g. Nicheformer) without maintaining a separate coordinate lookup outside the dataset.More broadly,
.obsmcan hold other useful per-cell arrays (e.g. embeddings, batch corrections) that downstream models may need. None of these are currently preserved.BioNeMo Framework Version
All
Category
API/Interface
Proposed Solution
Add an optional
obsm_keysparameter toSingleCellMemMapDataset:For each key, extract
adata.obsm[key](a dense[n_cells, d]array) and persist it as an additional memmap file alongside the existing CSR arrays. On__getitem__, the corresponding row is returned with the expression vector. Strictly opt-in — no behavior change for users who do not passobsm_keys.A few open questions I'd want to align on before writing a PR:
__getitem__return type — should multi-modal output use adict/NamedTuple, or follow the pattern from the existingreturn_featuresdiscussion in [Feature] SCDL: Return additional metadata for item #1083?SingleCellCollection— shouldobsm_keyspropagate when aggregating multiple datasets?Happy to implement this if it fits the roadmap.
Expected Benefits
Enables SCDL to serve as the data layer for spatial transcriptomics foundation model training without requiring workarounds outside the dataset class. Aligns SCDL more closely with the full AnnData API surface.
Code Example