Skip to content

[Feature] Support .obsm arrays in SingleCellMemMapDataset for spatial transcriptomics #1577

@YonghaoZhao722

Description

@YonghaoZhao722

Problem & Motivation

Spatial transcriptomics datasets (Visium, Xenium, MERFISH) use the same H5AD format as scRNA-seq, so SingleCellMemMapDataset already handles the expression matrix correctly. However, per-cell spatial coordinates stored in .obsm['spatial'] are silently dropped during H5AD → SCDL conversion. This makes it impossible to use SCDL as the data backbone for spatial FM training (e.g. Nicheformer) without maintaining a separate coordinate lookup outside the dataset.

More broadly, .obsm can hold other useful per-cell arrays (e.g. embeddings, batch corrections) that downstream models may need. None of these are currently preserved.

BioNeMo Framework Version

All

Category

API/Interface

Proposed Solution

Add an optional obsm_keys parameter to SingleCellMemMapDataset:

SingleCellMemMapDataset(
    data_path="...",
    h5ad_path="visium.h5ad",
    obsm_keys=["spatial"],
)

For each key, extract adata.obsm[key] (a dense [n_cells, d] array) and persist it as an additional memmap file alongside the existing CSR arrays. On __getitem__, the corresponding row is returned with the expression vector. Strictly opt-in — no behavior change for users who do not pass obsm_keys.

A few open questions I'd want to align on before writing a PR:

  1. Schema versioning — new arrays would require a version bump (currently v0.0.9). Is there a preferred pattern for backward-compatible extensions in the header spec?
  2. __getitem__ return type — should multi-modal output use a dict/NamedTuple, or follow the pattern from the existing return_features discussion in [Feature] SCDL: Return additional metadata for item #1083?
  3. SingleCellCollection — should obsm_keys propagate when aggregating multiple datasets?

Happy to implement this if it fits the roadmap.

Expected Benefits

Enables SCDL to serve as the data layer for spatial transcriptomics foundation model training without requiring workarounds outside the dataset class. Aligns SCDL more closely with the full AnnData API surface.

Code Example

from bionemo.scdl.io.single_cell_memmap_dataset import SingleCellMemMapDataset

ds = SingleCellMemMapDataset(
    data_path="visium_scdl",
    h5ad_path="visium.h5ad",
    obsm_keys=["spatial"],
)

# returns (expression_tensor, {"spatial": tensor([x, y])})
expression, metadata = ds[0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions