Skip to content

Example in Section 5.2 - Over-representation analysis #37

@thegrebe

Description

@thegrebe

There is an issue in the example for section 5.2, over-representation analysis:

Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differentially expressed genes, 28 are annotated to a gene set^[example adopted from https://guangchuangyu.github.io/2012/04/enrichment-analysis/].

d <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29))
row.names(d) <- c("In_category", "not_in_category")
d

Whether the overlap(s) of 25 genes are significantly over represented in the gene set can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher's exact test.

fisher.test(d, alternative = "greater")

In the case of Over-Representation Analysis, our question is "what is the probability of observing at least as many genes from the ontology that are DE ?"

However, the alternative in fisher.test can read as "what alternative values in the top-left cell of the provided 2x2 matrix should be considered".

The data.frame should thus be (columns are permuted):

gene in interest gene not in interest
in_category 28 2613
not_in_category 29 15310

Correct code:

Example: Suppose we have 17,980 genes detected in a Microarray study and 57 genes were differentially expressed. Among the differentially expressed genes, 28 are annotated to a gene set^[example adopted from https://guangchuangyu.github.io/2012/04/enrichment-analysis/].

d <- data.frame(gene.in.interest=c(28, 29), gene.not.interest=c(2613, 15310))
row.names(d) <- c("In_category", "not_in_category")
d

Whether the overlap(s) of 25 genes are significantly over represented in the gene set can be assessed using a hypergeometric distribution. This corresponds to a one-sided version of Fisher's exact test.

fisher.test(d, alternative = "greater")

Alternatively, using the current data.frame, one can use alternative = "less", but I find it a bit harder to understand/link to the original question ("at least as many"):

d1 <- data.frame(gene.in.interest=c(28, 29), gene.not.interest=c(2613, 15310))
row.names(d1) <- c("In_category", "not_in_category")
fisher.test(d1, alternative = "greater")
	Fisher's Exact Test for Count Data

data:  d1
p-value = 7.879e-10
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 3.524092      Inf
sample estimates:
odds ratio
   5.65631 
d2 <- data.frame(gene.not.interest=c(2613, 15310), gene.in.interest=c(28, 29) )
row.names(d2) <- c("In_category", "not_in_category")
fisher.test(d2, alternative = "less")
	Fisher's Exact Test for Count Data

data:  d2
p-value = 7.879e-10
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.000000 0.283761
sample estimates:
odds ratio 
 0.1767937

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions