Skip to content

Latest commit

 

History

History
226 lines (135 loc) · 4.34 KB

File metadata and controls

226 lines (135 loc) · 4.34 KB

Project: Gene Finding with a Hidden Markov Model and the Viterbi Algorithm

Background

In genomics, one fundamental task is to identify genes within a long DNA sequence. Genes are regions that are translated into proteins, while the surrounding DNA is non-coding.

A classical computational approach to this problem models DNA as being generated by a Hidden Markov Model (HMM), where:

  • the observed sequence is the DNA bases
  • the hidden states represent biological regions such as coding and non-coding DNA

In this project, you will implement a tiny gene finder using:

  • a simple HMM
  • the Viterbi algorithm to decode the most likely sequence of hidden states

Learning Objectives

After completing this project, you should be able to:

  • Explain what a Hidden Markov Model is
  • Define states, transitions, and emissions in a biological context
  • Implement the Viterbi dynamic programming algorithm
  • Decode the most likely hidden state sequence
  • Interpret algorithmic output biologically

Problem Description

You are given a DNA sequence consisting of the bases A, C, G, T.

Your task is to:

  1. Define a small HMM that models coding and non-coding regions
  2. Implement the Viterbi algorithm
  3. Predict which positions in the DNA sequence belong to genes

You will focus on exact decoding, not on parameter learning.


Hidden Markov Model Used

Hidden States

You will use the following states:

  • I – Intergenic (non-coding DNA)
  • C – Coding DNA

(Extensions with more states are optional.)


Observations

The observed symbols are DNA bases:

A, C, G, T

Transition Probabilities

Transitions model how likely it is to move between regions.

Example (illustrative):

From → To I C
I 0.95 0.05
C 0.10 0.90

Emission Probabilities

Different regions emit bases with different probabilities.

Example:

State A C G T
I 0.30 0.20 0.20 0.30
C 0.15 0.35 0.35 0.15

(These values may be provided or estimated from data.)


Input

  • A DNA sequence as a string
  • Transition probabilities
  • Emission probabilities
  • Initial state probabilities

Output

  • The most likely sequence of hidden states (one per base)
  • Predicted coding regions as intervals
  • The log-probability of the best path

Key Idea: Viterbi Algorithm

The Viterbi algorithm finds the most likely path of hidden states that explains the observed sequence.

It uses dynamic programming to compute:

  • the best score for each state at each position
  • the previous state that led to that score

Starting Tasks

Task 1: Model Representation

Define data structures for:

  • states
  • transition probabilities
  • emission probabilities

Task 2: Initialization

For the first base in the sequence:

  • compute the initial log-probability for each state

Task 3: Viterbi Recursion

For each position i in the sequence and each state s:

$$ V_i(s) = \max_{s'} \left[ V_{i-1}(s') + \log P(s \mid s') + \log P(x_i \mid s) \right] $$

Store:

  • the best score
  • the state s' that achieved it

Task 4: Termination and Traceback

  • Identify the state with the highest score at the final position
  • Trace back through stored pointers to recover the full state path

Task 5: Gene Prediction

Convert the state path into predicted gene regions:

  • contiguous runs of C correspond to coding regions

Example

Input DNA

ATGCGCGTTATATATGCGC

Output (states)

IIIIICCCCCIIIIIIIII

Predicted gene

Start: 5, End: 9

(Indices are illustrative.)


Implementation Notes

  • Work in log-space to avoid numerical underflow
  • Do not use existing HMM or Viterbi libraries
  • Focus on clarity rather than speed
  • Use arrays or matrices for dynamic programming tables

Tasks extension

  • Add start and stop codon states
  • Use three coding states to model codon periodicity
  • Estimate emission probabilities from labeled sequences
  • Compare results to a GC-content sliding window
  • Visualize predicted genes along the sequence

Submission

Submit:

  • F# source code
  • A short documentation describing your model and assumptions
  • One example input sequence with interpretation