BST430 Lecture 15

Testing and High-throughput count data

Seong-Hwan Jun

2025-10-30

Hypothesis testing

  • Statistical hypothesis testing is a method used to make inferences or draw conclusions about a population based on sample data.
  • It involves formulating two competing hypotheses: the null hypothesis (\(H_0\)) and the alternative hypothesis (\(H_a\)).

Collect data

Example: coin flip.

  • \(X_i \sim \text{Bernoulli}(p)\), where \(p\) is the probability of getting heads.
  • Flip the coin \(n\) times and record the number of heads observed, denoted as \(X_i; i = 1, ..., n\).
  • How would you test if the coin is fair (\(H_0: p = 0.5\)) or biased (\(H_a: p \neq 0.5\))?

Test statistic

  • A test statistic is a numerical summary of the data that is used to evaluate the evidence against the null hypothesis.
  • It is calculated from the sample data and is used to determine the likelihood of observing the data under the null hypothesis.

If \(p = 0.5\) were true, how likely is it to observe the number of heads we did?

Generate some data

[1] "T" "T" "H" "T" "H" "H"
coinFlips
 H  T 
59 41 

Binomial distribution under null

Rejection region

  • The rejection region is the range of values for the test statistic that leads to the rejection of the null hypothesis.
  • It is determined based on the significance level (\(\alpha\)) chosen for the test.
  • For a two-tailed test with \(\alpha = 0.05\), the rejection region would include values of the test statistic that are in the extreme 2.5% on either side of the distribution under the null hypothesis.

Rejection region

So we do not reject the null hypothesis. Does that mean the coin is fair?

Binomial test

binom.test(x = sum(coinFlips == "H"), n = numFlips, p = 0.5)

    Exact binomial test

data:  sum(coinFlips == "H") and numFlips
number of successes = 59, number of trials = 100, p-value = 0.08863
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4871442 0.6873800
sample estimates:
probability of success 
                  0.59 

Sample size

What if we increased the sample size to 1000 flips with the same probability of heads?

coinFlips
  H   T 
604 396 

Binomial distribution under null

Binomial test

binom.test(x = sum(coinFlips == "H"), n = numFlips, p = 0.5)

    Exact binomial test

data:  sum(coinFlips == "H") and numFlips
number of successes = 604, number of trials = 1000, p-value = 5.062e-11
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.5729182 0.6344677
sample estimates:
probability of success 
                 0.604 

Testing set up

  • Suppose we are interested in testing efficacy of a treatment (drug/procedure/etc).
  • A: Control/B: Treatment.
  • We collect a sample of size n from each group and measure some outcomes of interest (e.g., blood pressure, gene expression levels, etc).
  • We want to test whether there is a significant difference in the outcomes between the two groups.

High throughput sequencing data

High-throughput sequencing

  • High-throughput sequencing (HTS) technologies have revolutionized genomics and molecular biology.
  • Rapid sequencing of large volumes of DNA and RNA molecules.

Experimental set up

  • Biological samples are collected under different conditions (e.g., treatment vs. control or healthy vs disease).
  • The resulting data consists of counts of reads mapped to each genetic material of interest, e.g., gene, transcript, microbiome, etc.
  • We aim to identify genes that are differentially expressed between conditions to characterize disease progression or differences brought forth by treatment (efficacy).

Examples

  • Bulk RNA sequencing
    • Measure gene expression levels in a population of cells.
    • The resulting data consists of counts of transcripts for each gene aggregated over all cells captured.

Examples

  • Single-cell RNA sequencing
    • Profile gene expression at the single-cell level.
    • Cells are isolated, and their RNA is sequenced individually.
    • The resulting data consists of counts of transcripts for each gene in each cell.
    • Enables the study of cellular heterogeneity and identification of distinct cell types or states.

Examples

  • Genetic screening using RNAi or CRISPR-Cas9
    • Identify genes essential for cell survival or proliferation.
    • Cells are transduced with a library of RNAi or CRISPR constructs targeting different genes.
    • After a period of growth, the abundance of each construct is measured using HTS.
    • Counts reflect the impact of gene knockdown or knockout on cell viability.

Examples

  • Microbiome studies
    • Abundance of microbiome (e.g., gut or oral microbiota) in maintaining health or contributing to disease.
    • Samples are collected from different environments or conditions.

Statistical challenges

  • We observe a random sample of genetic molecules that are present in the sample (sequencing can be biased).
  • Data is count-based, often with many zeros (sparsity).
  • Noise and variability due to technical and biological factors. Examples, batch effects and other confounding factors.
  • High dimensionality with many genes and relatively few samples.
  • Multiple testing issues when identifying differentially expressed genes.
  • Complex dependencies among genes.

Bulk RNA-seq counts

library(pasilla)
fn = system.file("extdata", "pasilla_gene_counts.tsv",
                 package = "pasilla", mustWork = TRUE)
counts = as.matrix(read.csv(fn, sep = "\t", row.names = "gene_id"))
print(dim(counts))
[1] 14599     7
print(colnames(counts))
[1] "untreated1" "untreated2" "untreated3" "untreated4" "treated1"  
[6] "treated2"   "treated3"  
head(rownames(counts))
[1] "FBgn0000003" "FBgn0000008" "FBgn0000014" "FBgn0000015" "FBgn0000017"
[6] "FBgn0000018"

Quiz

  • Compute the gene wise (rows) log of the means.
  • Plot the log of the mean. What plot type is appropriate to visualize the distribution of the means?

Quiz

genewise_vars <- rowVars(counts)
boxplot(log(genewise_vars))
# Focus on 2000 most variable genes.
num_genes <- 2000
genes_to_analyze <- names(sort(genewise_vars, decreasing = TRUE))[1:num_genes]
cts_subset <- counts[genes_to_analyze,]

Testing differential expression

  • We want to find genes that are differentially expressed between two conditions (treated vs untreated).
  • Testing over 10,000 genes (multiple testing) -> highly probable that we may find significant genes by chance (false discovery).
  • What test should we use?

Quiz

  • Try t.test on each gene between the two conditions.
  • Adjust for multiple testing using Benjamini-Hochberg procedure p.adjust(..., method="BH").
  • How many genes are significant at FDR 5%?
  • Are the assumptions of the t-test satisfied?

Permutation test

  • A permutation test is a non-parametric statistical test used to determine the significance of an observed effect.
  • It involves randomly shuffling the data to create a null distribution of the test statistic.
  • The p-value is calculated by comparing the observed test statistic to the null distribution.

When to use permutation tests

Why (or when) would we use this?

  • When the assumptions of traditional parametric tests (e.g., t-test, ANOVA) are not met.
  • When dealing with small sample sizes where parametric tests may not be reliable.

Quiz

  • How many random shuffling of the data points (permutations) are possible?
  • Write code to perform permutation test for one of the genes. Generate 10, 20, and all 35 permutations.
    • What statistic should we compute?

Quiz

log_counts <- log1p(cts_subset)
gene <- genes_to_analyze[1]
shuffles <- combn(7, 4)
diffs <- apply(shuffles, 2, function(untreated_idx) {
  mean(log_counts[gene,untreated_idx]) - mean(log_counts[gene,-untreated_idx])
})
untreated_mean <- mean(log_counts[gene, str_detect(colnames(log_counts), "untreated")])
treated_mean <- mean(log_counts[gene, str_detect(colnames(log_counts), "treated")])
observed <- treated_mean - untreated_mean
print(observed)
[1] -0.05359491
p_value <- mean(abs(diffs) >= abs(observed))
print(p_value)
[1] 0.8571429

Quiz

hist(diffs)
abline(v=observed, col="red")

GLM for count-valued variables.

  • In the simple setting where you only have an experimental condition, non-parametric tests such as the permutation test or others (e.g., Wilcoxon rank sum test) may be fine.
  • If you have multiple conditions or covariates to adjust for (e.g., batch effects), you may want to consider a regression-based approach.
  • Since we have counts data, we can consider a Poisson or Negative Binomial Generalized Linear Model (GLM).

Fitting Poisson GLM

  • We will use glm(..., family = "poisson") to fit a Poisson GLM.
  • To facilitate ease of modeling we will group the data by gene and then use nest function.

Fitting Poisson GLM

Prepare the data…

dat <- data.frame(genes = rownames(cts_subset), cts_subset)
colnames(dat)
[1] "genes"      "untreated1" "untreated2" "untreated3" "untreated4"
[6] "treated1"   "treated2"   "treated3"  
dat_long <- pivot_longer(dat, cols = -genes, names_to = "sample", values_to = "count") %>% 
  mutate(condition = if_else(str_detect(pattern = "untreated", sample), "untreated", "treated"))

# Nest data by genes
nested_data <- dat_long %>%
  group_by(genes) %>%
  nest()
nested_data
# A tibble: 2,000 × 2
# Groups:   genes [2,000]
   genes       data            
   <chr>       <list>          
 1 FBgn0000556 <tibble [7 × 3]>
 2 FBgn0064225 <tibble [7 × 3]>
 3 FBgn0040813 <tibble [7 × 3]>
 4 FBgn0002526 <tibble [7 × 3]>
 5 FBgn0000559 <tibble [7 × 3]>
 6 FBgn0026562 <tibble [7 × 3]>
 7 FBgn0000042 <tibble [7 × 3]>
 8 FBgn0027571 <tibble [7 × 3]>
 9 FBgn0001219 <tibble [7 × 3]>
10 FBgn0003517 <tibble [7 × 3]>
# ℹ 1,990 more rows

Fitting Poisson GLM

poisson_glm_nested <- nested_data %>% 
  mutate(glm = map(data, ~glm(count ~ condition, data = .x, family = "poisson")),
         tidy_glm = map(glm, broom::tidy))
poisson_glm_nested
# A tibble: 2,000 × 4
# Groups:   genes [2,000]
   genes       data             glm    tidy_glm        
   <chr>       <list>           <list> <list>          
 1 FBgn0000556 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 2 FBgn0064225 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 3 FBgn0040813 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 4 FBgn0002526 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 5 FBgn0000559 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 6 FBgn0026562 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 7 FBgn0000042 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 8 FBgn0027571 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
 9 FBgn0001219 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
10 FBgn0003517 <tibble [7 × 3]> <glm>  <tibble [2 × 5]>
# ℹ 1,990 more rows

Fitting Poisson GLM

p_values <- poisson_glm_nested %>%
  unnest(tidy_glm) %>% 
  filter(term == "conditionuntreated") %>%
  dplyr::pull(p.value)
# Adjust for multiple testing.
adj_pvalues <- p.adjust(p_values, method = "BH")
# Number of differential genes.
sum(adj_pvalues < 0.05)
[1] 1704

A bit too many?

Diagnostics

  • Poisson distribution assumes that the mean and variance are equal.

Diagnostics

Negative Binomial GLM

  • To account for overdispersion, we can use a Negative Binomial GLM.
  • Negative Binomial distribution has two parameters, one parameter to account for dispersion.
  • We will need to use glm.nb from MASS package (not offered by glm).

Negative Binomial GLM

nb_glm_nested <- nested_data %>% 
  filter(genes %in% genes_to_analyze) %>% 
  mutate(glm = map(data, ~glm.nb(count ~ condition, data = .x)),
         tidy_glm = map(glm, broom::tidy))
p_values <- nb_glm_nested %>%
  unnest(tidy_glm) %>% 
  filter(term == "conditionuntreated") %>%
  dplyr::pull(p.value)
adj_pvalues <- p.adjust(p_values, method = "BH")
sum(adj_pvalues < 0.05)
[1] 11

Quiz

  • Compare the counts (or log counts) of untreated vs treated for the genes that are selected as differentially expressed by Negative Binomial GLM.
  • Do you think these are truly differential?

Note

  • We actually did not account for library size differences between samples here.
  • In practice, we would want to include an offset term in the GLM to account for library size differences.
  • More sophisticated methods such as edgeR and DESeq2 elicit prior on dispersion parameter by pooling information across genes.