-
Notifications
You must be signed in to change notification settings - Fork 0
About the method
It is designed to:
-
identify distinct transcriptional cell states within the malignant compartment, and
-
estimate their proportions in every input tumor sample.
Unlike many deconvolution tools, CDState does not require predefined marker genes or reference single‑cell profiles. It uses a constrained nonnegative matrix factorization (NMF) model and tumor purity estimates to separate malignant states from the tumor microenvironment (TME).
CDState factorizes the bulk expression matrix into:
-
a set of cell state profiles (gene expression vectors), and
-
a proportion matrix (how much each state is present in each sample).
It extends on the NMF with two key ideas that make the factorization focused at malignant cell states:
-
Sum‑to‑one proportions: for each sample, CDState constrains state proportions to sum to one so that they can be interpreted as cell fractions.
-
Cosine similarity penalty: CDState encourages inferred states to be as distinct as possible by penalizing states that have very similar gene expression profiles (high cosine similarity).
CDState operates in two steps. First, it identifies initial solution with sources that sum-to-one and assigns sources to malignant or nonmalignant compartments, then it uses input tumor purity to improve disentanglement of malignant sources while ensuring separation of malignant and TME compartments. The two steps are controlled by two parameters (alpha and beta) used in the loss function. Using tumor purity to separate malignant and TME
Step 1 – Initial decomposition (alpha = 1, beta = 0) CDState focuses on reconstructing the bulk RNA‑seq data as accurately as possible.
-
It finds a set of components and their proportions that best reconstruct each input bulk expression
-
It then classifies components into malignant (MAL) and nonmalignant (TME): it selects the combination of sources whose proportions give the best correlation with input tumor purity.
CDState gradually increases the weight of the state‑separation term (the cosine similarity penalty), while monitoring how well the inferred malignant proportions match the input purity (beta increases, while alpha decreases).
Once the mean squared error between input tumor purity and CDState-inferred one starts to increase, beta is gradually decreased back to zero (alpha increases back to 1) and a final optimization focused on improving the reconstruction runs.
CDState requires:
- Bulk RNA‑seq expression matrix: normalized counts or expression values in a matrix of shape:
- rows = genes
- columns = samples.
- Tumor purity estimates : one purity value per sample
- Range of cell state numbers: a range of values for the number of states K to infer, for example k = 2, 3, 4 makes CDState identify 2, 3 and 4 sources. We recommend no more than half the number of samples.
For each chosen number of states K, CDState returns:
-
Cell state expression profiles: a matrix with one expression profile per inferred state (genes × states).
-
Cell state proportions: a matrix with one column per state and one row per sample, summing to one per sample.
-
Malignant state assignment: an assignment of each state to either the malignant compartment or the TME, based on correlation with tumor purity.
Before running the factorization, CDState applies a built‑in gene filtering:
-
it removes genes with extremely low or extremely high overall expression,
-
and, among the remaining genes, keeps only the most variable genes within expression bins.
After deconvolution, CDState can recover state expression profiles across all genes using a separate regression step, which is useful when comparing states across datasets or downstream analyses. For use-case examples, please see the notebooks.
Choosing the best number of sources is part of the analysis. CDState supports running multiple values and multiple random initializations per value.
Typical workflow:
-
Run CDState for a range of K (for example 2–6), with 10–20 runs per K.
-
For each K, summarize:
- reconstruction error
- how many malignant are identified
- compare correlation between true and inferred purity across all K.
We recommend to choose solution based on median number of malignant states identified for a given K and the best correlation across runs that identified this median number of states.
For practical examples please see notebooks.
Because CDState relies on correlation with tumor purity to label malignant states, datasets with few samples can be limiting. In small cohorts, truly malignant states may show only weak correlations with purity that do not reach the statistical significance threshold CDState requires, so these states may not be classified as malignant. In addition, when the number of samples is very low, the method has less power to detect patterns shared across tumors and may instead learn more sample‑specific components rather than robust, shared malignant signals. Based on our validation experiments using different number of input samples, we recommend applying CDState to datasets with at least 20 samples.