BayraktarLab
diff --git a/‎.github/ISSUE_TEMPLATE/config.yml‎
Lines changed: 7 additions & 3 deletions b/‎.github/ISSUE_TEMPLATE/config.yml‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/question.md‎
Lines changed: 4 additions & 2 deletions b/‎.github/ISSUE_TEMPLATE/question.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 143 additions & 8 deletions b/‎README.md‎
Lines changed: 143 additions & 8 deletions
diff --git a/‎cell2location/__init__.py‎
Lines changed: 2 additions & 4 deletions b/‎cell2location/__init__.py‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎cell2location/cluster_averages/cluster_averages.py‎
Lines changed: 38 additions & 15 deletions b/‎cell2location/cluster_averages/cluster_averages.py‎
Lines changed: 38 additions & 15 deletions
@@ -1,5 +1,9 @@
 blank_issues_enabled: false
 contact_links:
-  - name: Cell2location Community Discussions
-    url: https://github.com/BayraktarLab/cell2location/discussions
-    about: Ask how to solve your problem using cell2location.
+  - name: scverse Discorse
+    url: https://discourse.scverse.org/c/ecosytem/cell2location/
+    about: Ask usage questions, how to solve your problems using cell2location and other scvi-tools packages.
+
+  - name: cell2location Community Discussions [deprecated]
+    url: https://discourse.scverse.org/c/ecosytem/cell2location/
+    about: Find previous answers/issues. For new questions please use the link above.
@@ -1,11 +1,13 @@
 ---
 name: Usage Question
-about: Ask how to solve your problem using cell2location.
+about: Template for posting a question to scverse Discourse.
 title: ''
 labels: question
 assignees: ''
 ---
 
+## Please use the template below to post a question to https://discourse.scverse.org/c/ecosytem/cell2location/. 
+
 ### Problem
 
 <!-- Please describe your problem below: -->
@@ -14,7 +16,7 @@ assignees: ''
 - [ ] I follow the instructions from the [cell2location tutorial (using on scvi-tools)](https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html).
 - [ ] I have adjusted required hyperparameters to my dataset and tissue `N_cells_per_location` and `detection_alpha`.
 - [ ] I have provided 10X reaction/inlet as `batch_key` for reference NB regression.
-- [ ] I have checked [Cell2location Community Forum](https://github.com/BayraktarLab/cell2location/discussions), [scvi-tools forum](https://discourse.scvi-tools.org/) and did not find a solution
+- [ ] I have checked [scverse Discourse](https://discourse.scverse.org/c/ecosytem/cell2location/) and [old Cell2location Community Forum](https://github.com/BayraktarLab/cell2location/discussions), and did not find a solution.
 
 
 ### Description of the data input and hyperparameters
 
@@ -15,6 +15,8 @@ If you use cell2location please cite our paper:
 Kleshchevnikov, V., Shmatko, A., Dann, E. et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat Biotechnol (2022). https://doi.org/10.1038/s41587-021-01139-4
 https://www.nature.com/articles/s41587-021-01139-4
 
+Please note that cell2locations requires 2 user-provided hyperparameters (N_cells_per_location and detection_alpha) - for detailed guidance on setting these hyperparameters and their impact see [the flow diagram and the note](https://github.com/BayraktarLab/cell2location/blob/master/docs/images/Note_on_selecting_hyperparameters.pdf). Many real datasets (especially human) show within-slide variability in RNA detection sensitivity - requiring you to try both recommended settings of the `detection_alpha` parameter: `detection_alpha=200` for low within-slide technical variability and `detection_alpha=20` for high within-slide technical variability.
+
 Cell2location is a principled Bayesian model that can resolve fine-grained cell types in spatial transcriptomic data and create comprehensive cellular maps of diverse tissues. Cell2location accounts for technical sources of variation and borrows statistical strength across locations, thereby enabling the integration of single cell and spatial transcriptomics with higher sensitivity and resolution than existing tools. This is achieved by estimating which combination of cell types in which cell abundance could have given the mRNA counts in the spatial data, while modelling technical effects (platform/technology effect, contaminating RNA, unexplained variance).
 
 <p align="center">
@@ -24,11 +26,9 @@ Overview of the spatial mapping approach and the workflow enabled by cell2locati
 
 ## Usage and Tutorials
 
-The tutorial covering the estimation of expresson signatures of reference cell types, spatial mapping with cell2location and the downstream analysis can be found here: https://cell2location.readthedocs.io/en/latest/
-
-You can also try cell2location on [Google Colab](https://colab.research.google.com/github/BayraktarLab/cell2location/blob/master/docs/notebooks/cell2location_tutorial.ipynb) on a smaller data subset containing somatosensory cortex.
+The tutorial covering the estimation of expresson signatures of reference cell types, spatial mapping with cell2location and the downstream analysis can be found here and tried on [Google Colab](https://colab.research.google.com/github/BayraktarLab/cell2location/blob/master/docs/notebooks/cell2location_tutorial.ipynb): https://cell2location.readthedocs.io/en/latest/
 
-Please report bugs via https://github.com/BayraktarLab/cell2location/issues and ask any usage questions in https://github.com/BayraktarLab/cell2location/discussions.
+Please report bugs via https://github.com/BayraktarLab/cell2location/issues and ask any usage questions about [cell2location](https://discourse.scverse.org/c/ecosytem/cell2location/42), [scvi-tools](https://discourse.scverse.org/c/help/scvi-tools/7) or [Visium data](https://discourse.scverse.org/c/general/visium/32) in scverse community discourse.
 
 Cell2location package is implemented in a general way (using https://pyro.ai/ and https://scvi-tools.org/) to support multiple related models - both for spatial mapping, estimating reference cell type signatures and downstream analysis.
 
@@ -61,10 +61,10 @@ bash Miniconda3-latest-Linux-x86_64.sh
 # use prefix /path/to/software/miniconda3
 ```
 
-Before installing cell2location and it's dependencies, it could be necessary to make sure that you are creating a fully isolated conda environment by telling python to NOT use user site for installing packages, ideally by adding this line to your `~/.bashrc` file , but this would also work during a terminal session:
+Before installing cell2location and it's dependencies, it could be necessary to make sure that you are creating a fully isolated conda environment by telling python to NOT use user site for installing packages by running this line before creating conda environment and every time before activatin conda environment in a new terminal session:
 
 ```bash
-export PYTHONNOUSERSITE="someletters"
+export PYTHONNOUSERSITE="literallyanyletters"
 ```
 
 
@@ -79,12 +79,147 @@ Cell2location architecture is designed to simplify extended versions of the mode
 We thank all paper authors for their contributions:
 Vitalii Kleshchevnikov, Artem Shmatko, Emma Dann, Alexander Aivazidis, Hamish W King, Tong Li, Artem Lomakin, Veronika Kedlian, Mika Sarkin Jain, Jun Sung Park, Lauma Ramona, Liz Tuck, Anna Arutyunyan, Roser Vento-Tormo, Moritz Gerstung, Louisa James, Oliver Stegle, Omer Ali Bayraktar
 
-We also thank Krzysztof Polanski, Luz Garcia Alonso, Carlos Talavera-Lopez, Ni Huang for feedback on the package, Martin Prete for dockerising cell2location and other software support.
+We also thank Pyro developers (Fritz Obermeyer, Martin Jankowiak), Krzysztof Polanski, Luz Garcia Alonso, Carlos Talavera-Lopez, Ni Huang for feedback on the package, Martin Prete for dockerising cell2location and other software support.
 
 ## FAQ
 
 See https://github.com/BayraktarLab/cell2location/discussions
 
 ## Future development and experimental features
+Future developments of cell2location are focused on 1) scalability to 100k-mln+ locations using amortised inference of cell abundance (same ideas as used in VAE), 2) extending cell2location to related spatial analysis tasks that require modification of the model (such as using cell type hierarchy information), and 3) incorporating features presented by more recently proposed methods (such as CAR spatial proximity modelling). We are also experimenting with Numpyro and JAX (https://github.com/vitkl/cell2location_numpyro).
+
+## Tips
+
+### Conda environment for A100 GPUs
+
+```bash
+export PYTHONNOUSERSITE="literallyanyletters"
+conda create -y -n test_scvi16_cuda113 python=3.9
+conda activate test_scvi16_cuda113
+conda install -y -c anaconda hdf5 pytables git
+pip install scvi-tools
+pip install git+https://github.com/BayraktarLab/cell2location.git#egg=cell2location[tutorials]
+pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 -f https://download.pytorch.org/whl/torch_stable.html
+conda activate test_scvi16_cuda113
+python -m ipykernel install --user --name=test_scvi16_cuda113 --display-name='Environment (test_scvi16_cuda113)'
+```
+
+### Issues with package version mismatches often originate from python user site rather than conda environment being used to install a subset of packages
+
+Before installing cell2location and it's dependencies, it could be necessary to make sure that you are creating a fully isolated conda environment by telling python to NOT use user site for installing packages by running this line before creating conda environment and every time before activatin conda environment in a new terminal session:
 
-We also provide an experimental numpyro translation of the model which has improved memory efficiency (allowing analysis of multiple Visium samples on Google Colab) and minor improvements in speed - https://github.com/vitkl/cell2location_numpyro. You can try it on Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vitkl/cell2location_numpyro/blob/main/docs/notebooks/cell2location_short_demo_colab.ipynb) - however note that both numpyro itself and cell2location_numpyro are in very active development. Numpyro+JAX are being introduces into scvi-tools so follow updates on that.
+```bash
+export PYTHONNOUSERSITE="literallyanyletters"
+```
+
+### Useful code for reading and combining multiple Visium sections
+
+Keeping info on distinct sections in a csv file (Google Sheet).
+
+```python
+sample_annot = pd.read_csv('./sample_annot.csv')
+
+from glob import glob
+sample_annot['path'] = pd.Series(
+    glob(f'{sp_data_folder}*'),
+    index=[sub('^.+WTSI_', '', sub('_GRCh38-2020-A$', '', i)) for i in glob(f'{sp_data_folder}*')]
+)[sample_annot['Sample_ID']].values
+import os
+sample_annot['file'] = [os.path.basename(i) for i in sample_annot['path']]
+
+sample_annot['Sample_ID'].unique()
+```
+
+Reading and concatenating samples.
+
+```python
+def read_and_qc(sample_name, file, path=sp_data_folder):
+    """
+    Read one Visium file and add minimum metadata and QC metrics to adata.obs
+    NOTE: var_names is ENSEMBL ID as it should be, you can always plot with sc.pl.scatter(gene_symbols='SYMBOL')
+    """
+    
+    adata = sc.read_visium(path + str(file) +'/',
+                           count_file='filtered_feature_bc_matrix.h5',
+                           load_images=True)
+    adata.obs['sample'] = sample_name
+    adata.var['SYMBOL'] = adata.var_names
+    adata.var.rename(columns={'gene_ids': 'ENSEMBL'}, inplace=True)
+    adata.var_names = adata.var['ENSEMBL']
+    adata.var.drop(columns='ENSEMBL', inplace=True)
+    
+    # just in case there are non-unique ENSEMBL IDs
+    adata.var_names_make_unique()
+
+    # Calculate QC metrics
+    sc.pp.calculate_qc_metrics(adata, inplace=True)
+    adata.var['mt'] = [gene.startswith('mt-') for gene in adata.var['SYMBOL']]
+    adata.obs['mt_frac'] = adata[:, adata.var['mt'].tolist()].X.sum(1).A.squeeze()/adata.obs['total_counts']
+    
+    # add sample name to obs names
+    adata.obs["sample"] = [str(i) for i in adata.obs['sample']]
+    adata.obs_names = 's' + adata.obs["sample"] \
+                          + '_' + adata.obs_names
+    adata.obs.index.name = 'spot_id'
+    
+    file = list(adata.uns['spatial'].keys())[0]
+    adata.uns['spatial'][sample_name] = adata.uns['spatial'][file].copy()
+    del adata.uns['spatial'][file]
+    print(adata.uns['spatial'].keys())
+    
+    return adata
+
+def read_all_and_qc(
+    sample_annot, Sample_ID_col, file_col, sp_data_folder, 
+    count_file='filtered_feature_bc_matrix.h5',
+):
+    """
+    Read and concatenate all Visium files.
+    """
+    # read first sample
+    adata = read_and_qc(
+        sample_annot[Sample_ID_col][0], sample_annot[file_col][0], 
+        path=sp_data_folder
+    ) 
+
+    # read the remaining samples
+    slides = {}
+    for i, s in enumerate(sample_annot[Sample_ID_col][1:]):
+        adata_1 = read_and_qc(s, sample_annot[file_col][i], path=sp_data_folder) 
+        slides[str(s)] = adata_1
+
+    adata_0 = adata.copy()
+
+    # combine individual samples
+    #adata = adata.concatenate(list(slides.values()), index_unique=None)
+    adata = adata.concatenate(
+        list(slides.values()),
+        batch_key="sample",
+        uns_merge="unique",
+        batch_categories=sample_annot[Sample_ID_col], 
+        index_unique=None
+    )
+
+    sample_annot.index = sample_annot[Sample_ID_col]
+    for c in sample_annot.columns:
+        sample_annot.loc[:, c] = sample_annot[c].astype(str)
+    adata.obs[sample_annot.columns] = sample_annot.reindex(index=adata.obs['sample']).values
+    
+    return adata
+    
+adata = read_all_and_qc(
+    sample_annot=sample_annot, 
+    Sample_ID_col='Sample_ID', 
+    file_col='file', 
+    sp_data_folder=sp_data_folder, 
+    count_file='filtered_feature_bc_matrix.h5',
+)
+
+adata_incl_nontissue = read_all_and_qc(
+    sample_annot=sample_annot, 
+    Sample_ID_col='Sample_ID', 
+    file_col='file', 
+    sp_data_folder=sp_data_folder, 
+    count_file='raw_feature_bc_matrix.h5',
+)
+```
@@ -2,13 +2,11 @@
 from pyro.distributions.transforms import SoftplusTransform
 from torch.distributions import biject_to, transform_to
 
-from .run_c2l import run_cell2location
+from . import models
 from .run_colocation import run_colocation
-from .run_regression import run_regression
 
 __all__ = [
-    "run_cell2location",
-    "run_regression",
+    "models",
     "run_colocation",
 ]
 
 
@@ -14,9 +14,9 @@ def compute_cluster_averages(adata, labels, use_raw=True, layer=None):
     labels
         Name of adata.obs column containing cluster labels
     use_raw
-        Use raw slow in adata?
+        Use raw slow in adata.
     layer
-        use layer in adata? provide layer name
+        Use layer in adata, provide layer name.
 
     Returns
     -------
@@ -38,7 +38,7 @@ def compute_cluster_averages(adata, labels, use_raw=True, layer=None):
             var_names = adata.raw.var_names
 
     if sum(adata.obs.columns == labels) != 1:
-        raise ValueError("cluster_col is absent in adata_ref.obs or not unique")
+        raise ValueError("`labels` is absent in adata_ref.obs or not unique")
 
     all_clusters = np.unique(adata.obs[labels])
     averages_mat = np.zeros((1, x.shape[1]))
@@ -53,29 +53,52 @@ def compute_cluster_averages(adata, labels, use_raw=True, layer=None):
     return averages_df
 
 
-def get_cluster_variances(adata_ref, cluster_col):
+def get_cluster_variances(adata, labels, use_raw=True, layer=None):
     """
-    :param adata_ref: AnnData object of reference single-cell dataset
-    :param cluster_col: Name of adata_ref.obs column containing cluster labels
-    :returns: pd.DataFrame of within cluster variance of each gene
+    Compute variance of each gene in each cluster
+
+    Parameters
+    ----------
+
+    labels
+        Name of adata.obs column containing cluster labels
+    use_raw
+        Use raw slow in adata.
+    layer
+        Use layer in adata, provide layer name.
+
+    Returns
+    -------
+    pd.DataFrame of within cluster variance of each gene
     """
-    if not adata_ref.raw:
-        raise ValueError("AnnData object has no raw data")
-    if sum(adata_ref.obs.columns == cluster_col) != 1:
-        raise ValueError("cluster_col is absent in adata_ref.obs or not unique")
+    if layer is not None:
+        x = adata.layers[layer]
+        var_names = adata.var_names
+    else:
+        if not use_raw:
+            x = adata.X
+            var_names = adata.var_names
+        else:
+            if not adata.raw:
+                raise ValueError("AnnData object has no raw data, change `use_raw=True, layer=None` or fix your object")
+            x = adata.raw.X
+            var_names = adata.raw.var_names
+
+    if sum(adata.obs.columns == labels) != 1:
+        raise ValueError("`labels` is absent in adata_ref.obs or not unique")
 
-    all_clusters = np.unique(adata_ref.obs[cluster_col])
-    var_mat = np.zeros((1, adata_ref.raw.X.shape[1]))
+    all_clusters = np.unique(adata.obs[labels])
+    var_mat = np.zeros((1, x.shape[1]))
 
     for c in all_clusters:
-        sparse_subset = csr_matrix(adata_ref.raw.X[np.isin(adata_ref.obs[cluster_col], c), :])
+        sparse_subset = csr_matrix(x[np.isin(adata.obs[labels], c), :])
         c = sparse_subset.copy()
         c.data **= 2
         var = c.mean(0) - (np.array(sparse_subset.mean(0)) ** 2)
         del c
         var_mat = np.concatenate((var_mat, var))
     var_mat = var_mat[1:, :].T
-    var_df = pd.DataFrame(data=var_mat, index=adata_ref.raw.var_names, columns=all_clusters)
+    var_df = pd.DataFrame(data=var_mat, index=var_names, columns=all_clusters)
 
     return var_df