Project: Gene Finding with a Hidden Markov Model and the Viterbi Algorithm

Background

In genomics, one fundamental task is to identify genes within a long DNA sequence. Genes are regions that are translated into proteins, while the surrounding DNA is non-coding.

A classical computational approach to this problem models DNA as being generated by a Hidden Markov Model (HMM), where:

the observed sequence is the DNA bases
the hidden states represent biological regions such as coding and non-coding DNA

In this project, you will implement a tiny gene finder using:

a simple HMM
the Viterbi algorithm to decode the most likely sequence of hidden states

Learning Objectives

After completing this project, you should be able to:

Explain what a Hidden Markov Model is
Define states, transitions, and emissions in a biological context
Implement the Viterbi dynamic programming algorithm
Decode the most likely hidden state sequence
Interpret algorithmic output biologically

Problem Description

You are given a DNA sequence consisting of the bases A, C, G, T.

Your task is to:

Define a small HMM that models coding and non-coding regions
Implement the Viterbi algorithm
Predict which positions in the DNA sequence belong to genes

You will focus on exact decoding, not on parameter learning.

Hidden Markov Model Used

Hidden States

You will use the following states:

I – Intergenic (non-coding DNA)
C – Coding DNA

(Extensions with more states are optional.)

Observations

The observed symbols are DNA bases:

A, C, G, T

Transition Probabilities

Transitions model how likely it is to move between regions.

Example (illustrative):

From → To	I	C
I	0.95	0.05
C	0.10	0.90

Emission Probabilities

Different regions emit bases with different probabilities.

Example:

State	A	C	G	T
I	0.30	0.20	0.20	0.30
C	0.15	0.35	0.35	0.15

(These values may be provided or estimated from data.)

Input

A DNA sequence as a string
Transition probabilities
Emission probabilities
Initial state probabilities

Output

The most likely sequence of hidden states (one per base)
Predicted coding regions as intervals
The log-probability of the best path

Key Idea: Viterbi Algorithm

The Viterbi algorithm finds the most likely path of hidden states that explains the observed sequence.

It uses dynamic programming to compute:

the best score for each state at each position
the previous state that led to that score

Starting Tasks

Task 1: Model Representation

Define data structures for:

states
transition probabilities
emission probabilities

Task 2: Initialization

For the first base in the sequence:

compute the initial log-probability for each state

Task 3: Viterbi Recursion

For each position i in the sequence and each state s:

$$ V_i(s) = \max_{s'} \left[ V_{i-1}(s') + \log P(s \mid s') + \log P(x_i \mid s) \right] $$

Store:

the best score
the state s' that achieved it

Task 4: Termination and Traceback

Identify the state with the highest score at the final position
Trace back through stored pointers to recover the full state path

Task 5: Gene Prediction

Convert the state path into predicted gene regions:

contiguous runs of C correspond to coding regions

Example

Input DNA

ATGCGCGTTATATATGCGC

Output (states)

IIIIICCCCCIIIIIIIII

Predicted gene

Start: 5, End: 9

(Indices are illustrative.)

Implementation Notes

Work in log-space to avoid numerical underflow
Do not use existing HMM or Viterbi libraries
Focus on clarity rather than speed
Use arrays or matrices for dynamic programming tables

Tasks extension

Add start and stop codon states
Use three coding states to model codon periodicity
Estimate emission probabilities from labeled sequences
Compare results to a GC-content sliding window
Visualize predicted genes along the sequence

Submission

Submit:

F# source code
A short documentation describing your model and assumptions
One example input sequence with interpretation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project: Gene Finding with a Hidden Markov Model and the Viterbi Algorithm

Background

Learning Objectives

Problem Description

Hidden Markov Model Used

Hidden States

Observations

Transition Probabilities

Emission Probabilities

Input

Output

Key Idea: Viterbi Algorithm

Starting Tasks

Task 1: Model Representation

Task 2: Initialization

Task 3: Viterbi Recursion

Task 4: Termination and Traceback

Task 5: Gene Prediction

Example

Implementation Notes

Tasks extension

Submission

FilesExpand file tree

Gene_Finding_with_Viterbi_Algorithm.md

Latest commit

History

Gene_Finding_with_Viterbi_Algorithm.md

File metadata and controls

Project: Gene Finding with a Hidden Markov Model and the Viterbi Algorithm

Background

Learning Objectives

Problem Description

Hidden Markov Model Used

Hidden States

Observations

Transition Probabilities

Emission Probabilities

Input

Output

Key Idea: Viterbi Algorithm

Starting Tasks

Task 1: Model Representation

Task 2: Initialization

Task 3: Viterbi Recursion

Task 4: Termination and Traceback

Task 5: Gene Prediction

Example

Implementation Notes

Tasks extension

Submission