Project: Gene Finding with a Hidden Markov Model and the Viterbi Algorithm
In genomics, one fundamental task is to identify genes within a long DNA sequence. Genes are regions that are translated into proteins, while the surrounding DNA is non-coding.
A classical computational approach to this problem models DNA as being generated by a Hidden Markov Model (HMM), where:
- the observed sequence is the DNA bases
- the hidden states represent biological regions such as coding and non-coding DNA
In this project, you will implement a tiny gene finder using:
- a simple HMM
- the Viterbi algorithm to decode the most likely sequence of hidden states
After completing this project, you should be able to:
- Explain what a Hidden Markov Model is
- Define states, transitions, and emissions in a biological context
- Implement the Viterbi dynamic programming algorithm
- Decode the most likely hidden state sequence
- Interpret algorithmic output biologically
You are given a DNA sequence consisting of the bases A, C, G, T.
Your task is to:
- Define a small HMM that models coding and non-coding regions
- Implement the Viterbi algorithm
- Predict which positions in the DNA sequence belong to genes
You will focus on exact decoding, not on parameter learning.
Hidden Markov Model Used
Hidden States
You will use the following states:
I– Intergenic (non-coding DNA)C– Coding DNA
(Extensions with more states are optional.)
The observed symbols are DNA bases:
A, C, G, T
Transitions model how likely it is to move between regions.
Example (illustrative):
| From → To | I | C |
|---|---|---|
| I | 0.95 | 0.05 |
| C | 0.10 | 0.90 |
Different regions emit bases with different probabilities.
Example:
| State | A | C | G | T |
|---|---|---|---|---|
| I | 0.30 | 0.20 | 0.20 | 0.30 |
| C | 0.15 | 0.35 | 0.35 | 0.15 |
(These values may be provided or estimated from data.)
- A DNA sequence as a string
- Transition probabilities
- Emission probabilities
- Initial state probabilities
- The most likely sequence of hidden states (one per base)
- Predicted coding regions as intervals
- The log-probability of the best path
The Viterbi algorithm finds the most likely path of hidden states that explains the observed sequence.
It uses dynamic programming to compute:
- the best score for each state at each position
- the previous state that led to that score
Define data structures for:
- states
- transition probabilities
- emission probabilities
For the first base in the sequence:
- compute the initial log-probability for each state
For each position i in the sequence and each state s:
Store:
- the best score
- the state
s'that achieved it
- Identify the state with the highest score at the final position
- Trace back through stored pointers to recover the full state path
Convert the state path into predicted gene regions:
- contiguous runs of
Ccorrespond to coding regions
Input DNA
ATGCGCGTTATATATGCGC
Output (states)
IIIIICCCCCIIIIIIIII
Predicted gene
Start: 5, End: 9
(Indices are illustrative.)
- Work in log-space to avoid numerical underflow
- Do not use existing HMM or Viterbi libraries
- Focus on clarity rather than speed
- Use arrays or matrices for dynamic programming tables
- Add start and stop codon states
- Use three coding states to model codon periodicity
- Estimate emission probabilities from labeled sequences
- Compare results to a GC-content sliding window
- Visualize predicted genes along the sequence
Submit:
- F# source code
- A short documentation describing your model and assumptions
- One example input sequence with interpretation