|
|
|
|
Mini Review
|
|
Gene Finding Using Hidden Markov Model |
|
Srabanti Maji
and
Deepak Garg
|
|
|
ABSTRACT
|
The objective of this study is to perform mini review on Hidden
Markov Models (HMMs) which is recently important and popular among bioinformatics
researchers and large no of software tools are based on this technique. The
mathematical foundations of HMMs shall be considered first in brief manner and
then the gene identification application. In the case of gene identification
process, HMM basically resolve three basic problems: First is the evaluation
problem, in this it computes the probability that a particular HMM will generates
a given sequence of observations. Second is Decoding problem, in which it will
uncover the most likely hidden state and Third is Learning problem, it is used
to adjust the model parameter and train the HMM to find an optimal model. Evaluation
problem can be solved by using Forward and Backward algorithm, Decoding problems
are solved by using Viterbi algorithm and posterior decoding algorithm and then
Learning problems are solved through Viterbi training algorithm and Baum-Welch
algorithm. Finally, some limitations of the current approaches and future directions
are also reported.
|
|
|
|
|
Received: May 17, 2012;
Accepted: July 06, 2012;
Published: August 02, 2012
|
|
INTRODUCTION
The usually applied technique which is quite prevailing too, is Markov Processes
and Markov Chains (De Fonzo et al., 2007; Nasiri,
2011). A Markov process is a stochastic process satisfying a certain property,
called the Markov Property. A process satisfies the Markov property if one can
make predictions for the future of the process based exclusively on its present
state just as well as one could express the process's full history (Parent,
2004; Cawley and Pachter, 2003).
A Markov chain is a first-order Markov process for which the probability distribution
of a state at a given time is explicitly dependent only on the previous state
and not on all the others (Garg, 2007a; Lifshits
et al., 2009). More specifically there is a finite set of possible
states and the transitions among them are governed by a set of conditional probabilities
of the next state given the present one, called transition probabilities (Garg,
2007b; Kumar and Raghava, 2009). The transition probabilities
are implicitly (unless declared otherwise) independent of the time and then
one speaks of homogeneous, or stationary, Markov chains. In case of DNA sequence;
the time means the position along the sequence (Yoon
and Vaidyanathan, 2004). Starting from a given initial state, the consecutive
transitions from a state to the next one produce a time-evolution of the chain
that is therefore completely represented by a sequence of states that a priori
are to be considered random (Lunter, 2007; Dosay-Akbulut,
2006).
A Hidden Markov Model consists of two stochastic processes. The first stochastic
process is a Markov chain that is characterized by states and transition probabilities
(Ahmad, 2011; Tran et al.,
2009). The states of the chain are externally not visible, therefore hidden.
Another stochastic process will generate emissions which is observable at every
instant. It is dependent on state probability distribution (Meyer
and Durbin, 2002). In case of Hidden Markov Model the term hidden
not indicates the parameter of the model, but it indicates the state of the
Markov Chain (El-Sayed and Khedr, 2007).
A Hidden Markov Model is a generalization of a Markov chain, in which each
(internal) state is not directly observable (hence the term hidden)
but produces (emits) an observable random output (external)
state, also called emission, according to a given stationary probability
law (Oron et al., 2008). In this case, the time
evolution of the internal states can be induced only through the sequence of
the observed output states.
If the number of internal states is N, the transition probability law is described
by a matrix with N times N values; if the number of emissions is M, the emission
probability law is described by a matrix with N times M values. A model is considered
defined once given these two matrices and the initial distribution of the internal
states. The study by Rabiner (Jing et al., 2006;
Frikha et al., 2007; Rabiner,
1989) is widely well appreciated for clarity in explaining HMMs. It is a
powerful class of model used in many fields including gene finding, profile
searches, error correction coding, multiple sequence alignment, speech recognition
and regulatory site identification.
BIOLOGICAL BACKGROUND
The central dogma of molecular biology pertains to DNA, RNA and Proteins. DNA
can be converted to RNA by a process known as Transcription and RNA to Proteins
by a process known as Translation. DNA can be represented by 4 characters, A,
C, T and G. These are called which are called nucleotides. The same characters
except for the nucleotide T which is replaced by U can
represent RNA (Cheng and Zou, 2007; Kim,
2006; Pop and Salzberg, 2008). In case of proteins,
the representation consists of 20 characters, corresponding to 20 amino acids
of which they are composed (Sur, 2008). A one-to-one
letter mapping occurs between a DNA molecule and its associated RNA molecule
and a three-to-one letter mapping occurs between the RNA molecule and its associated
protein molecule (Nath and Bhattacharjee, 2011; Tran
et al., 2008). The coordinates are known in rare cases where a protein
sequence folds in a defined three dimensional structure. The defined structure
is what actually provides the molecular function of the protein sequence.
Determining the DNA and RNA sequences are relatively cheaper than determining
protein sequences. One of the most successful class of techniques for analyzing
biological sequences has been the use of HMM (Larik and
Soomro, 2003; Yang and Yang, 2008). HMM are mainly
used for predicting protein sequences since it provides a good probabilistic
framework for the same.
Table 1: |
HMM Notations |
 |
Considering the work done, there are mainly two everyday tasks for bioinformatics:
Deducing the protein DNA sequence and Comparing protein sequences to an existing
database of protein sequences, both of which utilize Hidden Markov Models.
Mathematical basics and Elements of HMM: A Hidden Markov Model is a
finite learnable stochastic automate. It can be summarized as a kind of double
stochastic process with the two following aspects: (a) the firstly stochastic
process is a finite set of states, where each of them is generally associated
with a multidimensional probability distribution (Wang et
al., 2008). The transitions between the different states are statistically
organized by a set of probabilities called transition probabilities. (b) Secondly,
stochastic process, in any state an event can be observed. Since we will just
analyze what we observe without seeing at which states it occurred, the states
are "hidden" to the observer, therefore the name "Hidden Markov Model".
Each Hidden Markov Model is defined by states, state probabilities, transition
probabilities, emission probabilities and initial probabilities (Bhardwaj,
2007; Lee et al., 2008). For the sake of simplicity,
in the following notations we consider only one sequence of internal states
and one sequence of associated emissions. Table 1 shows the
notations.
BASICS OF HMM IN STOCHASTIC MODELING There are important aspects of modeling Hidden Markov Models in order to solve real problems. The following two steps are included in the stochastic modeling of an HMM automate-first, to define the model structure; and second, to define the learning and operating algorithm. Definition of HMM architecture: The generalized computational architecture of an operating HMM λi with two integrated stochastic processes is shown in Fig. 1.
|
Fig. 1: |
Generalized Architecture of an operating Hidden Markov Model.
In the above model, λ is HMM variable, there are 2 different state
variable in the model, hidden state variable St and observation
variable Et. In this above HMM, 4 different times like t-2, t-1,
t and t+1 are considered |
Each shape represents a random variable that can adopt any of a number of values. The random variable St is the hidden state at time t. The random variable Et is the observation at the time t. The law of conditional probability of the Hidden Markov variable St at the time t, knowing the values of the hidden variables at all times depends only on the value of the hidden variable St-1 at the time t-1. Every values before are not necessary anymore, so that the Markov property as defined before is satisfied. By the second stochastic process, the value of the observed variable Et depends on the value of the hidden variable St also at the time t. Definition of the learning and operating algorithms-three basic problems of HMMs: The main types of problems occurring in the use of Hidden Markov Models are: Evaluation problem: Compute the probability that a given model generates a given sequence of observations. The most used algorithms are: (a) the forward algorithm: find the probability of emission distribution (given a model) starting from the beginning of the sequence. (b) the backward algorithm: find the probability of emission distribution (given a model) starting from the end of the sequence.
Decoding problem: Given a model and a sequence of observations, induce
the most likely hidden states. In this stage we attempt to uncover the hidden
part of the model, i.e., to find the correct state sequence. More
specifically: (a) find the sequence of internal states that has, as a whole,
the highest probability. The most used algorithm is the Viterbi algorithm. (b)
find for each position the internal state that has the highest probability.
The most used algorithm is the posterior decoding algorithm.
Learning problem: given a sequence of observations, find an optimal
model. The observation sequence used to adjust the model parameters is called
a training sequence, because it is used to train the HMMs. The training
problem is very crucial for most applications of HMMs. The most used algorithms
start from an initial guessed model and iteratively adjust the model parameters.
More specifically: (a) find the optimal model based on the most probable sequences
(as in problem B1). The most used algorithm is the Viterbi training (that uses
recursively the Viterbi algorithm in B1). (b) find the optimal model based on
the sequences of most probable internal states (as in problem B2). The most
used algorithm is the Baum-Welch algorithm (that uses recursively the posterior
decoding algorithm in B2).
Gene finding: The term gene finding indicates the action
of finding genes within a DNA sequence, but is often used with a more general
meaning of labeling DNA tracts (Burge and Karlin, 1997),
for example labeling them as coding, intergenic, introns, etc. In this last
sense gene finding can be considered a special case (the most important in bioinformatics)
of the more general action known as sequence labeling (also for non-DNA sequences).
Determining the DNA and RNA sequences are relatively cheaper than determining
protein sequences. One of the most successful class of techniques for analyzing
biological sequences has been the use of HMM. HMM are mainly used for predicting
protein sequences since it provides a good probabilistic framework for the same
(Birney, 2001; Nath and Bhattacharjee,
2011). Considering the work done, there are mainly two everyday tasks for
bioinformatics: Deducing the protein DNA sequence and Comparing protein sequences
to an existing database of protein sequences, both of which utilize Hidden Markov
Models. In the early 1990s, Krogh et al. (1994)
introduced the use of HMMs for discriminating coding and intergenic regions
in E. coli genome. The program GeneMark (Borodovsky
and McIninch, 1993) finds genes in E. coli DNA using a Markov model
for the coding region. It is a parser with a complex intergenic model. The more
complex HMM (Fig. 2), intergenic model consists of several
parts in addition to the start and stop codon models. After generating the stop
codon, the model chooses either the transition to the long intergenic HMM or
the short intergenic HMM, with appropriate probabilities. The short intergenic
HMM tends to generate intergenic regions of lengths from 1 to 14 or so, with
statistics determined from examples of such short intergenic regions in actual
E. coli contigs. Similarly, the parameters of the long intergenic model
are adjusted to capture the statistics of longer intergenic regions. The parameters
of the two intergenic models were estimated from a set of known intergenic regions
by a learning procedure known as the forward-backward algorithm. As a result
of the training process, the long intergenic region develops patterns, without
having to explicitly encode them. The complex parser has a better accuracy.
|
Fig. 2: |
HMM architecture for a parser for E. coli DNA with
a complex intergenic model. The gene model above the central state that
contains the 61 triplet. In the DNA sequence A, T, G, C is the four codon
characters. This intergenic model consists of several parts in addition
to the start and stop codon models. After generating the stop codon, the
model chooses either the transition to the long intergenic HMM or the short
intergenic HMM, with appropriate probabilities. The short intergenic HMM
tends to generate intergenic regions of lengths from 1 to 14 or so, parameters
of the long intergenic model are adjusted to capture the statistics of longer
intergenic regions |
The gene model above the central state that contains the 61 triplet. In the DNA sequence A, T, G, C is the four codon characters. This intergenic model consists of several parts in addition to the start and stop codon models. After generating the stop codon, the model chooses either the transition to the long intergenic HMM or the short intergenic HMM, with appropriate probabilities. The short intergenic HMM tends to generate intergenic regions of lengths from 1 to 14 or so, parameters of the long intergenic model are adjusted to capture the statistics of longer intergenic regions.
Many extensions to the original pure HMM have been developed for
gene finding. For example, Henderson et al. (1997)
designed separate HMM modules, each one appropriate for a specific region of
DNA. Separate HMM modules were designed and trained for specific regions of
DNA :exons, introns, intergenic regions and splice sites. In order to form a
biologically viable topology, the models were coupled. Additionally, the integrated
HMM was trained on a set of eukaryotic DNA sequences and then tested by using
an unknown DNA sequences. Kulp et al. (1996)
and Burge and Karlin (1998) used a Generalized HMM (GHMM
or hidden semi-Markov Model) that allows more than one emission
for each internal state.
A pair Hidden Markov Model (Durbin et al., 1998)
is having large no of similarity with standard HMM. Basic difference is that
it emits pairwise alignment, where as standard HMM emits a single sequence.
This method provides parse only alignments between two sequences but, with suitable
enhancements, it is sometimes applied to gene finding. For example, Meyer
and Durbin (2002) resented a new method that predicts the gene structure
starting from two homologous DNA sequences, identifying the conserved subsequences.
A useful open-source implementation is described by Majoros
et al. (2005). Lukashin and Borodovsky (1998)
proposed a new algorithm (GeneMark.hmm) that improves the gene finding performance
of the old GeneMark algorithm by means of a suitable coupling with an HMM model.
Pedersen and Hein (2003) introduced an Evolutionary
Hidden Markov Model (EHMM), based on a suitable coupling of an HMM and a set
of evolutionary models based on a phylogenetic tree.
We propose an oversimplified biological example of an HMM, inspired by the
toy example by Eddy (2004) with only two internal states
but with exponential complexity. The model is detailed in Fig.
3. The set of internal states is U ≡ {c,n}
where ' c ' and ' n ' stand for the coding and non-coding internal states and
the set of emissions is the set of the four DNA bases:
X ≡ {A,T,G,C}
|
Fig. 3: |
An example of HMM, (a) The circular boxes represent the internal
states 'c', indicating the coding and 'n' non coding or intron, inside the
boxes there are the probabilities of each emission ('A' , 'T', 'C' and 'G')
for each state; outside the boxes four arrows are labelled with the corresponding
transition probability, (b) The 66 observed emissions are representing a
sample sequence and most likely, the sequences of internal states are given
in the second row, the dotted part is depicted in (c) and (d), (c) The right-hand
side column represents the boxed tract of bases in (b), the other columns
represent the two possible alternatives (for each squared base) for the
internal state ('c' or 'n') that emitted the base, each row refers to the
same position along the sequence, the arrows represent all possible transitions
and the emissions, (d) The figure shows a likely sequence of choices between
the alternative internal states producing its sequence in (b), such a sequence
of choices of internal state transitions amounts to choosing a path in (c) |
We are given a DNA sequence that begins in an exon, contains one 5' splice site and ends in an intron. The problem is to identify-where the 5' splice site (5'SS) is located. Lets imagine that exons have a uniform base composition on average (25% each base), introns are A/T rich (say, 40% each for A/T, 10% each for C/G) and the 5'SS consensus nucleotide is almost always a G (say, 95 G and 5% A). Figure 4 illustrates the action, on the same tract of the sequence in Fig. 3b, of the Viterbi algorithm used to decode the whole sequence by means of the model described in Fig. 3a.
|
Fig. 4: |
Illustration of the action (in the same sequence) of the Viterbi
algorithm used to decode the whole sequence by means of the model described
in Fig. 3 (a), more specifically, 4 (a) and 4 (b) illustrates
the transition from Fig. 3 (c), (d),
(a) In each circular box, there is the value of the Ψ
pointer, computed in the first phase, as illustrated in 4 (c), specifically,
Ψ = C means that we discard the hypothesis of the transition
from the previous state 'n' (also indicated by dashing the corresponding
incoming arrow), (b) In each circular box, the logarithmic value of the
probability γ is shown, which is calculated and used in the first phase,
dashed lines represent the transitions discarded in the second phase, note
that logarithms of the probabilities are used in order to avoid troubles
because of very small numbers, necessary for practical applications, (c)
A zoom of the marked zone in 4 (b), where the computation of a recursion
step of the Viterbi algorithm is done, is illustrated |
CONCLUSIONS Considering the present scenario of the HMM in bioinformatics, from the time of its introduction and from the wealth of available applications, it may be concluded that the concept has reached a developed state. In addition, since the beginning of the field, novel applications have been fostered by so many different extensions and modifications in respective techniques thus producing models that can be considered even today are still known as Hidden Markov Models.
The Hidden Markov Model can be used in various machine learning techniques.
Main advantages of HMMs are-the ease of use, less training sets required and
deeper understanding of the phenomenon due to the observation of the inner structure
of the model. Various gene prediction tools are available which are based on
HMM. The gene prediction tool Genscan can predict the gene with accuracy of
80%. Several improvements have been reported in HMM based gene prediction tools
like VEIL (Viterbi Exon-Intron Locator), having overall accuracy of 92 %. The
correlation coefficient of VEIL, on which total bases are correctly labeled,
is 0.73. To improve predictive accuracy, hybrid models are frequently designed
with the combination of techniques like Support Vector Machine and Artificial
Neural Network.
|
REFERENCES |
1: Parent, A., I. Benzaghou, I. Bougie and M. Bisaillon, 2004. Transcription and mRNA processing events: The importance of coordination. J. Biol. Sci., 4: 624-627. CrossRef | Direct Link |
2: Larik, A.S. and Z.A. Soomro, 2003. Elucidation of gene structure and function of prokaryotes and eukaryotes through DNA information technology. Asian J. Plant Sci., 2: 551-560. CrossRef | Direct Link |
3: Birney, E., 2001. Hidden markov models in biological sequence analysis. IBM J. Res. Dev., 45: 449-454. CrossRef |
4: Borodovsky, M. and J. McIninch, 1993. GENMARK: Parallel gene recognition for both DNA strands. Comp. Chem., 17: 123-133. CrossRef |
5: Burge, C. and S. Karlin, 1997. Prediction of complete gene structures in human Genomic DNA. J. Mol. Biol., 268: 78-94. CrossRef | PubMed |
6: Burge, C.B. and S. Karlin, 1998. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol., 8: 346-354. CrossRef |
7: Cawley, S.L. and L. Pachter, 2003. HMM sampling and applications to gene finding and alternative splicing. Bioinformatics, 19: ii36-ii41. CrossRef | Direct Link |
8: De Fonzo, V., F. Aluffi-Pentini and V. Parisi, 2007. Hidden markov models in bioinformatics. Curr. Bioinformatics, 2: 49-61. CrossRef |
9: Nath, D.C. and A. Bhattacharjee, 2011. A bayesian approach for autoregressive models in longitudinal data analysis: An application to type 2 diabetes drug comparison. Asian J. Applied Sci., 4: 640-648. CrossRef |
10: Durbin, R., S. Eddy, A. Krogh and G. Mitchison, 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. Cambridge
11: Eddy, S.R., 2004. What is a hidden markov model? Nat. Biotechnol., 22: 1315-1316. CrossRef | Direct Link |
12: Garg, D., 2007. Efficient algorithm for pattern discovery in bioinformatic sequences. Ph.D. Thesis, Thapar University, India.
13: Garg, D., 2007. Multi hit, dropoff percentage and NCM-2: Three improvements in BLAST. J. Clin. Diagn. Res., 1: 339-346. Direct Link |
14: Henderson, J., S. Salzberg and K.H. Fasman, 1997. Finding genes in DNA with a hidden markov model. J. Comput. Biol., 4: 127-141. CrossRef | PubMed |
15: Nasiri, J., A. Haghnazari and M. Alavi, 2011. Evaluation of prediction accuracy of genefinders using mouse genomic DNA. Trends Bioinform., 4: 10-22. CrossRef |
16: Cheng, K. and C.H. Zou, 2007. Informatics models of recognitions of protein synthesis. Asian J. Biochem., 2: 432-436. CrossRef | Direct Link |
17: Kim, J.S., 2006. Characterization of a gene encoding DNA repair protein XRCC3 from Arabidopsis thaliana. Int. J. Bot., 2: 23-28. CrossRef | Direct Link |
18: Krogh, A., I.S. Mian and D. Haussler, 1994. A hidden Markov model that finds genes in Escherichia coli DNA. Nucleic Acids Res., 22: 4768-4778. Direct Link |
19: Kulp, D., D. Haussler, M.G. Reese and F.H. Eeckman, 1996. A generalized hidden Markov model for the recognition of human genes in DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol., 4: 134-142. PubMed | Direct Link |
20: Kumar, M. and G.P.S. Raghava, 2009. Prediction of nuclear proteins using SVM and HMM models. BMC Bioinformatics, Vol. 10. CrossRef | Direct Link |
21: Lee, S.A., C.H. Chan, C.H. Tsai, J.M. Lai, F.S. Wang, C.Y. Kao and C.Y.F. Huang, 2008. Ortholog-based protein-protein interaction prediction and its application to inter-species interactions. BMC Bioinformatics, Vol. 9. CrossRef | Direct Link |
22: Jing, L., H. Hua and L. Sufang, 2006. Voice identification based on HMMs. Trends Applied Sci. Res., 1: 79-82. CrossRef | Direct Link |
23: Lifshits, Y., S. Mozes, O. Weimann and M. Ziv-Ukelson, 2009. Speeding up HMM decoding and training by exploiting sequence repetitions. Algorithmica, 54: 379-399. CrossRef |
24: Lukashin, A.V. and M. Borodovsky, 1998. Genemarkhmm: New solutions for gene finding. Nucleic Acids Res., 26: 1107-1115. Direct Link |
25: Lunter, G., 2007. HMMoC: A compiler for hidden Markov models. Bioinformatics, 23: 2485-2487. CrossRef | Direct Link |
26: Frikha, M., Z.B. Messaoud and A.B. Hamida, 2007. Towards discriminative training estimators for HMM speech recognition system. J. Applied Sci., 7: 3891-3899. CrossRef | Direct Link |
27: Majoros, W.H., M. Pertea and S.L. Salzberg, 2005. Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics, 21: 1782-1788. CrossRef |
28: Meyer, I.M. and R. Durbin, 2002. Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics, 18: 1309-1318. PubMed |
29: Dosay-Akbulut, M., 2006. Group I introns and splicing mechanism and their present possibilities in elasmobranches. J. Biol. Sci., 6: 921-925. CrossRef | Direct Link |
30: El-Sayed, M.H. and A.M. Khedr, 2007. Improving the performance of an HMM for protein family modelling. J. Applied Sci., 7: 1626-1632. CrossRef | Direct Link |
31: Ahmad, M., A. Abdullah and K. Buragga, 2011. A novel optimized approach for gene identification in DNA sequences. J. Applied Sci., 11: 806-814. CrossRef | Direct Link |
32: Oron, A.P., Z. Jiang and R. Gentleman, 2008. Gene set enrichment analysis using linear models and diagnostics. Bioinformatics, 24: 2586-2591. CrossRef | Direct Link |
33: Pop, M. and S.L. Salzberg, 2008. Bioinformatics challenges of new sequencing technology. Trends Genet., 24: 142-149. CrossRef |
34: Rabiner, L., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77: 257-286. CrossRef | Direct Link |
35: Sur, S., M. Bhattacharya, A.K. Bothra, L.S. Tisa and A. Sen, 2008. Bioinformatic analysis of codon usage patterns in a free-living diazotroph, Azotobacter vinelandii. Biotechnology, 7: 242-249. CrossRef | Direct Link |
36: Tran, D.H., K. Satou and T.B. Ho, 2008. Finding microRNA regulatory modules in human genome using rule induction. BMC Bioinformatics, Vol. 9. CrossRef | Direct Link |
37: Tran, T.T., F. Zhou, S. Marshburn, M. Stead, S.R. Kushner and Y. Xu, 2009. De novo computational prediction of non-coding RNA genes in prokaryotic genomes. Bioinformatics, 25: 2897-2905. CrossRef | Direct Link |
38: Wang, X., D. Wu, S. Zheng, J. Sun, L. Tao, Y. Li and Z. Cao, 2008. Ab-origin: An enhanced tool to identify the sourcing gene segments in germline for rearranged antibodies. BMC Bioinformatics, Vol. 9. CrossRef | Direct Link |
39: Yang, J.Y. and M.Q. Yang, 2008. Identification of intrinsically unstructured proteins using hierarchical classifier. Int. J. Data Mining Bioinformatics, 2: 121-133. CrossRef | Direct Link |
40: Yoon, B.J. and P.P. Vaidynathan, 2004. HMM with auxiliary memory: A new tool for modeling RNA structures. Asilomar Conf. Signals Syst. Comput., 2: 1651-1655. CrossRef |
41: Pedersen, J.S. and J. Hein, 2003. Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics, 19: 219-227. CrossRef | Direct Link |
42: Bhardwaj, L.M., D. Garg and S.C. Saxena, 2007. NCM-2 for normalizing blast values for simple regions. J. Bioinfor. Trends, 2: 29-36.
|
|
|
 |