About the method

CDState is an unsupervised deconvolution method for tumor bulk RNA‑seq data.

It is designed to:

identify distinct transcriptional cell states within the malignant compartment, and
estimate their proportions in every input tumor sample.

Unlike many deconvolution tools, CDState does not require predefined marker genes or reference single‑cell profiles. It uses a constrained nonnegative matrix factorization (NMF) model and tumor purity estimates to separate malignant states from the tumor microenvironment (TME).

How CDState works

CDState factorizes the bulk expression matrix into:

a set of cell state profiles (gene expression vectors), and
a proportion matrix (how much each state is present in each sample).

It extends on the NMF with two key ideas that make the factorization focused at malignant cell states:

Sum‑to‑one proportions: for each sample, CDState constrains state proportions to sum to one so that they can be interpreted as cell fractions.
Cosine similarity penalty: CDState encourages inferred states to be as distinct as possible by penalizing states that have very similar gene expression profiles (high cosine similarity).

Two-step optimization with tumor purity

CDState operates in two steps. First, it identifies initial solution with sources that sum-to-one and assigns sources to malignant or nonmalignant compartments, then it uses input tumor purity to improve disentanglement of malignant sources while ensuring separation of malignant and TME compartments. The two steps are controlled by two parameters (alpha and beta) used in the loss function. Using tumor purity to separate malignant and TME

Step 1 – Initial decomposition (alpha = 1, beta = 0) CDState focuses on reconstructing the bulk RNA‑seq data as accurately as possible.

It finds a set of components and their proportions that best reconstruct each input bulk expression
It then classifies components into malignant (MAL) and nonmalignant (TME): it selects the combination of sources whose proportions give the best correlation with input tumor purity.

Step 2 – Refining malignant states

CDState gradually increases the weight of the state‑separation term (the cosine similarity penalty), while monitoring how well the inferred malignant proportions match the input purity (beta increases, while alpha decreases).

Once the mean squared error between input tumor purity and CDState-inferred one starts to increase, beta is gradually decreased back to zero (alpha increases back to 1) and a final optimization focused on improving the reconstruction runs.

Inputs

CDState requires:

Bulk RNA‑seq expression matrix: normalized counts or expression values in a matrix of shape:

rows = genes
columns = samples.

Tumor purity estimates : one purity value per sample
Range of cell state numbers: a range of values for the number of states K to infer, for example k = 2, 3, 4 makes CDState identify 2, 3 and 4 sources. We recommend no more than half the number of samples.

Outputs

For each chosen number of states K, CDState returns:

Cell state expression profiles: a matrix with one expression profile per inferred state (genes × states).
Cell state proportions: a matrix with one column per state and one row per sample, summing to one per sample.
Malignant state assignment: an assignment of each state to either the malignant compartment or the TME, based on correlation with tumor purity.

Gene filtering and full‑gene recovery

Before running the factorization, CDState applies a built‑in gene filtering:

it removes genes with extremely low or extremely high overall expression,
and, among the remaining genes, keeps only the most variable genes within expression bins.

After deconvolution, CDState can recover state expression profiles across all genes using a separate regression step, which is useful when comparing states across datasets or downstream analyses. For use-case examples, please see the notebooks.

Choosing the number of states

Choosing the best number of sources is part of the analysis. CDState supports running multiple values and multiple random initializations per value.

Typical workflow:

Run CDState for a range of K (for example 2–6), with 10–20 runs per K.
For each K, summarize:

reconstruction error
how many malignant are identified
compare correlation between true and inferred purity across all K.

We recommend to choose solution based on median number of malignant states identified for a given K and the best correlation across runs that identified this median number of states.

For practical examples please see notebooks.

Minimum number of input samples

Because CDState relies on correlation with tumor purity to label malignant states, datasets with few samples can be limiting. In small cohorts, truly malignant states may show only weak correlations with purity that do not reach the statistical significance threshold CDState requires, so these states may not be classified as malignant. In addition, when the number of samples is very low, the method has less power to detect patterns shared across tumors and may instead learn more sample‑specific components rather than robust, shared malignant signals. Based on our validation experiments using different number of input samples, we recommend applying CDState to datasets with at least 20 samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly