Hidden Markov Models (HMMs) have many applications, including in the identification of genes which code for proteins within a sequence of DNA. My research involved trying Recurrent Neural Networks to better infer the parameters of HMMs from observation sequences, which had promising results relative to a standard approach: the Baum-Welch algorithm.
DNA is a sequence of 4 different building blocks, which we call nucleotides. Within a DNA sequence, exons are segments which contain the instructions for protein synthesis, while introns are segments which do not. Exons can also be divided into codons, where every group of 3 nucleotides codes for a single amino acid. Within each codon, the likelihood of each nucleotide type is specific to each position, leading to periodicity across every 3 nucleotides. However, this does not occur in introns, as they are not used to create amino acids (Yoon, 2009).
The identification of protein-coding genes within a given DNA sequence is one of many real-world applications of Hidden Markov Models (HMMs). In an HMM, the probability of each observation depends on the hidden state at that time step, and the hidden state is a number which randomly changes between time steps. In this case, the hidden states represent an intron and the three positions within an exon codon. We can use an HMM to estimate the most likely sequence of hidden states given an observation sequence, thereby identifying introns and exons.
However, HMMs have parameters which determine the initial distribution of hidden states, the probabilities of transitioning between hidden states at each time step, and the probabilities of different observations from each hidden state. In the case of protein-coding gene identification, we wish to “learn” the parameters that maximise the probability of the given DNA sequences, where it is unknown beforehand whether each nucleotide belongs to an intron or exon.
The Baum-Welch algorithm is a classic approach to infer these parameters, but I investigated whether deep learning techniques are more accurate. Namely, I used Recurrent Neural Networks (RNNs). Here, a list of numbers is transferred between time steps, and this list is updated at each time step by taking the weighted sum of the list at the previous time step and the current observations, then applying a non-linear function. At the final time step, the list of numbers is aggregated and transformed such that they can be used as parameters of an HMM.
An integral part of deep learning is a loss function; within an RNN, they are used to determine reasonable values for the weights (using a technique called Gradient Descent). Within my project, I investigated some suitable loss functions based on the likelihood of the observation sequences or the distribution-wise difference between the true and fitted HMM parameters. Although my derived loss functions all have limitations, the RNN still produced comparable results to the Baum-Welch algorithm. Therefore, my research indicates that a deep learning approach has potential in HMM parameter inference, which may have real-world impact including improved protein-coding gene identification.
References
Yoon, B.-J., 2009. Hidden Markov Models and their Applications in Biological Sequence Analysis. Current genomics.
Kyan Percevault
The University of Adelaide
