Skip to content

Latest commit

 

History

History
259 lines (159 loc) · 4.98 KB

File metadata and controls

259 lines (159 loc) · 4.98 KB

Project: Likelihood Ratio Test for Competing Biological Models

Background

In biology, we often compare two alternative explanations for the same data.

Examples:

  • Is a DNA sequence better explained by a motif model or by random background?
  • Does adding an extra parameter significantly improve a model?
  • Is one experimental condition better explained by a different distribution?

The Likelihood Ratio Test (LRT) provides a principled statistical framework to compare two nested models using probability theory.

In this project, you will implement an LRT from scratch in F#.


Learning Objectives

After completing this project, you should be able to:

  • Explain what a statistical model and likelihood are
  • Implement likelihood functions for simple probabilistic models
  • Compare nested models using a likelihood ratio
  • Compute a test statistic and p-value
  • Interpret statistical evidence in a biological context

Problem Description

You are given numerical observations and two competing statistical models:

  • Null model $M_0$: a simpler model with fewer parameters
  • Alternative model $M_1$: a more complex model that extends $M_0$

Your task is to determine whether the more complex model provides a significantly better explanation of the data.


Key Idea

The Likelihood Ratio Test asks:

Does the increase in likelihood justify the additional model complexity?

This is answered by comparing the maximum likelihoods of the two models.


Statistical Framework

Likelihood

Given data $x_1, \dots, x_n$ and a model with parameters $\theta$, the likelihood is:

$$ L(\theta) = \prod_{i=1}^n p(x_i \mid \theta) $$

In practice, we work with the log-likelihood:

$$ \ell(\theta) = \sum_{i=1}^n \log p(x_i \mid \theta) $$


Likelihood Ratio Test Statistic

Let:

  • $\ell_0$ be the maximum log-likelihood under the null model
  • $\ell_1$ be the maximum log-likelihood under the alternative model

The test statistic is:

$$ \Lambda = 2(\ell_1 - \ell_0) $$

Under standard assumptions, $\Lambda$ follows a $\chi^2$ distribution with degrees of freedom equal to the difference in number of parameters between the models.


Models Used in This Project

You will use Gaussian models with known structure.

Null Model $M_0$

All observations come from a single normal distribution:

$$ x_i \sim \mathcal{N}(\mu, \sigma^2) $$

Parameters: $\mu, \sigma$


Alternative Model $M_1$

Observations come from two different groups, each with its own mean:

$$ x_i \sim \mathcal{N}(\mu_{g(i)}, \sigma^2) $$

Parameters:

  • $\mu_A$
  • $\mu_B$
  • $\sigma$

This models a biological scenario such as:

  • control vs treatment
  • two experimental conditions

Input

  • A list of observations: values
  • A list of group labels: labels (e.g. "A" or "B")

Example:

values = [4.8; 5.1; 5.0; 6.2; 6.4; 6.1]
labels = ["A"; "A"; "A"; "B"; "B"; "B"]

Output

  • Log-likelihood under $M_0$
  • Log-likelihood under $M_1$
  • Likelihood ratio statistic $\Lambda$
  • Degrees of freedom
  • p-value
  • Final interpretation

Starting Tasks

Task 1: Log-Likelihood for a Gaussian Model

Implement the log-likelihood for a normal distribution:

$$ \log p(x \mid \mu, \sigma) = -\frac{1}{2}\log(2\pi\sigma^2) -\frac{(x - \mu)^2}{2\sigma^2} $$


Task 2: Parameter Estimation

Estimate parameters by maximum likelihood.

For a normal distribution:

  • $\hat{\mu} = \text{mean}(x)$
  • $\hat{\sigma}^2 = \frac{1}{n} \sum (x_i - \hat{\mu})^2$

Apply this:

  • once for all data (null model)
  • once per group (alternative model)

Task 3: Compute Log-Likelihoods

Compute:

  • $\ell_0$: log-likelihood under the null model
  • $\ell_1$: log-likelihood under the alternative model

Task 4: Likelihood Ratio Statistic

Compute:

$$ \Lambda = 2(\ell_1 - \ell_0) $$


Task 5: p-value Calculation

  • Degrees of freedom: $df = \text{number of parameters in } M_1 - M_0 = 1$
  • Compute the p-value using the $\chi^2(df)$ distribution

(You may implement the $\chi^2$ CDF numerically or approximate it.)


Task 6: Interpretation

Decide whether the alternative model is significantly better than the null model at a chosen significance level (e.g. $\alpha = 0.05$).


Example

Input

values: [5.0; 5.1; 4.9; 6.2; 6.3; 6.1]
labels: ["A"; "A"; "A"; "B"; "B"; "B"]

Output

Log-likelihood (null): -7.82
Log-likelihood (alternative): -2.91
LRT statistic: 9.82
p-value: 0.0017
Conclusion: significant difference between groups

Implementation Notes

  • Do not use built-in statistical test functions
  • Focus on numerical correctness and clarity
  • Work in log-space only
  • Structure your code so models are clearly separated

Tasks extension

  • Implement a one-sided alternative
  • Visualize fitted distributions
  • Apply the test to real biological data
  • Extend to more than two groups

Submission

Submit:

  • F# source code
  • A documentation describing your approach
  • One example dataset with interpretation