INTRODUCTION
As the proteomics project evolve automated annotation of the protein sequences becomes increasingly important. One of the difficult tasks is to model protein family. Several groups are applying probabilistic modelling using HMM, neural networks, comparative modelling, Bayesian network and more. However HMM gained great respect because it satisfies great results in speech recognition. The 2 central ideas of working in this field are to integrate or improve the models. In this paper, we improve the performance of profile HMM in the application of protein family modelling by applying new training technique (CML) instead of using the traditional one (ML).
Protein family: A protein family is a group of evolutionary related proteins sequences. Although there is a divergence between protein sequences (primary structure), but all sequences have the same general function and a common three dimensional structure. It can be considered also as large collection of multiple sequence alignments. Figure 1 is an example of protein family (lasparagines).
Hidden Markov models are usually being estimated by maximum likelihood ML and this is true also for the HMM based protein modelers reported previously (Anders et al., 1993). The standard ML method optimizes the probability of the observed sequences (i.e., protein sequences in family), but for prediction (i.e., decoding process) one would rather want to maximize the probability of the correct prediction. In this study a method is described for estimating HMM from labeled sequences which maximize the probability of the correct labeling instead of observed sequence only and this substantially improve the performance of protein modelling with HMM. This method is an integral part of HMM protein modelling which is an HMM for protein sequence prediction. Here we focus on the description of the method and the results of HMM protein modeling will be used to illustrate the power of this method.
Profile HMM (PHMM): An HMM can be considered as a structure that generates
protein sequences by random process. This structure and the corresponding random
process of HMM can be described as follows: HMM contains a sequence of M states,
which we call match states, corresponding to positions in proteins or columns
in a multiple alignment. Each of these states can emit a letter x (x is amino
acid) from 20 amino acids, according to the probability distribution p(x/m_{k})
(which is called emission probability E_{mk }(x), k = 1,..., m), each
of the match states (m_{k}, k = 1,..., m), has distinct probability
distributions or emissions. For each match state m_{k} there is a delete
state d_{k} that doesn't emit any amino acid (dummy state used to skip
m_{k}) and there are a total m + 1 insert states to either side of the
match states which emit amino acid in exactly the same way as the match states,
but using the probability distributions p(x/i_{k}) at state i_{k}.
For convenience, there are dummy BEGIN and END states, denoted as m_{0}
and m_{m+1}, respectively.

Fig. 1: 
Sequence alignment of the type II lasparagines. NCBI accession
number for lasparagines are in brackets: seq1 E. carotovora (AY560097);
seq2 E. chrysanthemi (CAA32884); seq3 Wolinella succinogenes
(P50286); seq4 P. pseudomonas 7A (AAB31750); seq5 Escherichia
coli K12 (NP417432); seq6 Pombe (NP590286) 
For each state, there are three possible
transitions to other state as: transition into match or delete states always
move forward in the model, whereas transitions into insert states do not. Note
that multiple insertions between match states can occur, since the selfloop
on the insert state allows transitions from the state to itself. The transitions probability from state q to state r is called Γ(r/q)
or shortly a_{qr}.
A protein sequence can be generated by a random walk through the model as follows: commencing at state m_{0}, choose a transition to m_{1}, i_{0} or d_{1} randomly according to the transition probability Γ(m_{1}/m_{0}), Γ(i_{0}/m_{0}) and Γ(d_{1}/m_{0}). If m_{1} is chosen, emit the first amino acid x_{1} from the probability distribution p(x_{1}/m_{1}) and choose a transition to the next state according to probability transitions Γ(*/m_{1}), where * indicates any possible next state, if the next state is the insert state i_{1}, generate amino acid x_{2} from emission probabilities p(x_{2}/i_{1}) and select the next state from Γ(*/i_{1}), if the next state is delete state d_{2} then emit no amino acid and choose the next state from transition probabilities Γ(*/d_{2}). Continue using this manner all the way to the END state (Fig. 3), to generate the sequence of amino acids (x_{1}, x_{2},..., x_{L}) by the path of states (s = q_{0}, q_{1},..., q_{N+1}). Because the delete states emit no amino acid then N ≥ L. The probability of the event that the path q_{0}, q_{1},..., q_{N+1} is taken and the sequence x_{1}, x_{2},..., x_{L} is generated can be obtained using Eq. 1.
Where N is the number of states in path, we set p(x_{1}/q_{1}) = 1, if q_{1} is a delete state. The probability of any sequence x_{1}, x_{2},..., x_{L} of amino acids is the sum of all possible paths that could generate that sequence using Eq. 2.
By this way, the probability distribution of the sequences space is defined. The goal is to find a model (i.e., a model length and probability parameters) that accurately describes a family of proteins by assigning large probabilities to sequences in that family.
Estimating the parameters of profile HMM from training sequences: The most traditional approach is to estimate the parameters entirely automatically from a set of unaligned primary sequences using EM algorithm (i.e., learning process) (Anders et al., 1993). This approach can in principle find the model that best describes a given set of sequences. So given a set of training sequences s_{1}, s_{2},..., s_{n}, one can see how well a model fits them by calculating the probability that it generates them. This probability is simply the product of terms of the form given by Eq. 2 which yields Eq. 3:
Where each term p(s(a)/Model) is calculated by substituting x_{1}, x_{2},..., x_{L} as s(a) in Eq. 2. This is called likelihood of the model. One would like this value in Eq. 3 to be high which is called maximum likelihood (ML) method. There are some algorithms that give arbitrary starting point, find a local maximum by iteratively reestimating the model in such a way that the likelihood increases in each iteration. The most common algorithm is Baum Welch (forward backward algorithm) (Rabiner, 1989) which is a version of the general EM algorithm.
BaumWelch algorithm: Is the standard algorithm for maximizing the HMM
parameters when the paths through the model for each training sequence are unknown.
This algorithm has two main steps, in the first step, the expected number of
times of each transition and emission which are occurred in the training sequences
will be computed (expectation step). In the second step, the transitions and
emissions probabilities are updated using reestimation formulas. The main steps
of Baum Welch algorithm can be briefly described as follows:
• 
Shuffle the training protein sequences to avoid hill climbing, 
• 
Initialize the transitions and emissions probabilities, 
• 
Apply the forward algorithm to each sequence for obtaining forward probabilities
values, 
• 
Apply the backward algorithm to each sequence for obtaining backward probabilities
values, 
• 
Use the forward and backward values to calculate transitions and emissions
counts (expected values) expectation step, 
• 
Use the transitions and emissions counts for calculating transitions and
emissions probabilities (θ), 
• 
Calculate the target function (log p(training sequences/θ)), where
training sequences are s_{1}, s_{2},..., s_{n}, 
• 
Repeat steps (17) iteratively and after each iteration replace θ_{il}
by θ_{i}. 
The iteration here will be terminated when there is convergence between target
function values and not between θ values (Rabiner, 1989).
Class HMM: The protein modeling problem is part of a general class of problems which has a class label associated with each observation or sequence. In protein modeling the observations are amino acids and the label classes are for instance alpha helix, betasheets, loops and the rest. The traditional approach to handle this problem is to separately estimate models for sequences representing each class and latter merge them into one large model. However in HMM framework it is possible to estimate the whole combined model directly from the labeled sequences if the states of model are labeled, such a model called a Class Hidden Markov Model (CHMM) (Anders, 1994). In our work, we label protein sequence as the following: a for alpha helix, b for betasheets, l for coils and c for the rest as shown in Fig. 2.
We limit our work to one label for each state, which is probably the most useful
approach in protein modelling. This means that:
• 
Some states are labeled with a for alphahelix, 
• 
Some states are labeled with b for betasheets, 
• 
Some states are labeled with l for coils and 
• 
Some states are labeled with c for the rest. 
and the emission probability of label at each state will be one or zero. Therefore, the emission probability of a symbol (any amino acid) and its label (e.g., a) will be:
The results will be the same with the other labeled (b, c, or l) states.
The aim of CHMM is that a decoding of amino acids sequences yields the correct labels. That is the objective of training is to optimize the probability of the labeling, [Probability (label sequences/amino acids sequences, model)], rather than the probability of the amino acids sequence, [Probability (amino acids sequences/model)] which is optimized in profile HMM training.
A CHMM has 2 modes, the decoding mode in which an unlabelled sequence of amino
acids has to be labeled and it works exactly like a profile HMM and the training
mode where the labels of amino acids sequences are taken into account.

Fig. 2: 
Represent
a segment of one amino acids sequence in family and it's label sequence 
One should notice that in profile HMM, a state can emit symbol with probability
distribution over letters of symbols (amino acids) but in CHMM it can emit symbol
with associated label as in Fig. 2.
To estimate the parameters of a CHMM, a set of labeled sequences is needed. Let us call a sequence X = x_{1}, x_{2},..., x_{l} and the corresponding label sequence Y = y_{1}, y_{2},..., y_{l}. The estimation of the parameters by standard ML, if the model parameters are called θ and we assume there is only one sequence in the training set, then the ML solution will be:
The probability p(x, y/θ) can be calculated by a trivial modification of the standard forward algorithm (Anders, 1994) where only the valid paths through the model are allowed. A valid path is one in which the state labels agree with the observed labels. For instance, a part of the sequence labeled by a for alphahelix can only match a state also labeled a. The same holds for the backward algorithm and thus the forwardbackward (or Baum Welch) reestimation procedure can be used to maximize Eq. 5.
The model obtained by this way is in most cases identical to a model combined of submodels estimated from each class independently, which is the usual approach and the main advantage is that the combined model can be estimated in one go from scratch. Another advantage is that the transitions between the submodels are also automatically estimated in the CHMM, in the other hand the estimation of the several transitions between submodels for the individual classes is not straight forward when the submodels are trained independently.
In the estimated CHMM model, the prediction of labels for a sequence X = x_{1},
x_{2},..., x_{l} can be done by using viterbi algorithm
(Rabiner, 1989) to find the most probable state sequence that satisfies the
following equation:
then by taking off the label of each state in the path, we will get the sequence labels.
Conditional maximum likelihood estimation: In the ML approach the probability that submodel can emit the protein sequences in a certain class (e.g., alphahelix) is maximized. Even after maximizing this probability it may well happen that these sequences have a higher probability given a model estimated for different class (e.g., betasheets). ML is not discriminative, so the probability of the observed protein sequence under the model is not the relevant quantity for prediction; one really only cares about getting the predicted labeling correct. Therefore, maximizing the probability of the labeling instead seems more appropriate, this is called conditional maximum likelihood (CML) (Rabinar, 1991). Therefore instead of Eq. 5 the CML estimation will be:
This probability can be rewritten as:
The numerator is the same probability as in Eq. 5 which can be calculated as described above. The denominator is even simpler: it is the probability of the unlabelled protein sequence, i.e., the probability usually calculated in an HMM disregarding any labeling, this probability is calculated by standard forward algorithm.
The probability p(x, y/θ) is the sum over all valid paths through the model and p(x/θ) is the sum of the probability over all valid and invalid paths. Therefore, maximizing Eq. 8 corresponds to making probability p(x, y/θ) comes as close as possible to p(x/θ); i.e., to minimize the probability of all the invalid paths.
To actually maximize Eq. 8 is trickier than the maximization of Eq. 5, because an expectationmaximization EM algorithm like the Baum Welch doesn't exist. In this study, the extended Baum Welch algorithm (Normandin and Morgan, 1991) is used. The CML method is obviously not limited to HMM, it has proven successful in a neural network and HMM hybrid called hidden neural network or HNN (Riis and Krogh, 1997).
Simulation class HMM for protein family modeling: We constructed a model
structure of 2n match states, 2n + 2 insert states and n delete states.

Fig. 3: 
The
combined two sub models (CHMM), (1) the top sub model represents alphahelix
and (2) the bottom sub model represents betasheets 
The model consist of 2 submodels, one corresponding to alpha helix and the
other to beta sheet regions. We assigned n match states for class label a which
represents alpha helix and n match states for class label b which represents
beta sheets, the same thing for 2n + 2 insert states. The two sub models share
the n delete states and begin and end states as shown in Fig.
3.
Notice that the model length is n because the length of sequences in multiple alignment is n, or the average length of unaligned sequences is n. This model is oversimplified to be scientifically useful, as it should include match and insert states for corresponding coil, loops and the rest.
The main characteristics of our model are the following:
• 
Transition probabilities in our model 

• 
Each match state has the following four transition probabilities 


a) 
Forward transition to the next match state with same label, 


b) 
Forward transition to the next match state with different label, 


c) 
Forward transition to the insert state in its layer and 


d) 
Forward transition to the next delete state. 

• 
Each insert state has the following four transition probabilities 


a) 
Forward transition to the next same label match state, 


b) 
Forward transition to the next match state with different label, 


c) 
Forward transition to the next delete state and 


d) 
Forward transition to itself. 

• 
Each delete state has the following five transition probabilities 


a) 
Forward transition to the next match state with first label, 


b) 
Forward transition to the next match state with second label, 


c) 
Forward transition to the next insert state in its layer with the first
label, 


d) 
Forward transition to the next insert state in its layer with the second
label, 


e) 
Forward transition to the next delete state. 
Therefore the total number of transition probabilities in the model are 21xn+1.
• 
At each match and insert state, there are two emission probabilities,
one corresponding to amino acid denoted by φ and the other for class
label a or b called membership probability and denoted by Ψ. Therefore
the total emission probability for any state I will be: 
We notice that membership probability for state I with class labeled a is:
Ψ_{i}(a) = 1 and Ψ_{i}(b) = 0 and the contrary for
class labeled b. Delete states have no emissions probability.
RESULTS AND DISCUSSION
In this section, the conditional maximum likelihood estimation method was tested on the model we constructed previously. Our data sets collected here are the following three protein families (Lasparaginase, Micrococcus lutes K3type glutamines from A. orayzae, Daminoacylase (Georgia and Labrou, 2005; Masuo and Yoshimune, 2004; Yoshimune et al., 2005). All of these subsets (protein families) have approximately equal size which is almost 40 amino acid sequences. The tests were done using 3 fold cross validations. The model was estimated using 2 subsets and tested on the last one.
The experiment main steps are the following:
Step 1: The model was estimated in usual way using Baum Welch algorithm to satisfy ML and the performance tested using viterbi algorithm, the results are shown in Fig. 4 and second column in Table 1.
Step 2: To speed up our training, from the ML estimated model in the previous step the training will be continued using the extended Baum Welch algorithm to Satisfy CML and the performance tested also using viterbi algorithm, the results of this step are shown in Fig. 5 and the third column of Table 1.
The following ten numbers are calculated and taken into account for valuating
the performance:
• 
Amino acid sensitivity: The percentage of amino acids in alphahelix
regions that are correctly predicted as alphahelix (i.e., labeled with
a). 
• 
Amino acid determination: The percentage of amino acids in alphahelix
regions predicted as alphahelix (i.e., labeled with a) that actually alphahelix. 
• 
Alphahelix sensitivity: The percentage of alphahelix regions
that are correctly predicted. 
• 
Alphahelix determination: The percentage of predicted alphahelix
regions that are correct. 
• 
Missing alphahelix: The number of real alphahelix regions that
are not overlapping a Predicted one. 
• 
Wrong alphahelix: The number of predicted alphahelix regions
that are not overlapping a real one. 
• 
Betasheet sensitivity: The percentage of BetaSheet regions that
are correctly predicted. 
• 
Betasheet determination: The percentage of predicted BetaSheet
regions that are correct. 
• 
Missing betasheet: The number of real BetaSheet regions that
are not overlapping a predicted one. 
• 
Wrong betasheet: Number of predicted BetaSheet regions that are
not overlapping a real one. 
Table 1: 
Performance
of our CHMM for protein modeling trained by ML and CML methods 


Fig. 4: 
Represent
the score of model parameters at each iteration in ML estimation case 

Fig. 5: 
Represent
the score of model parameters at each iteration in CML estimation case 
From Table 1, a great significant improvement is observed especially in missing and wrong alphahelix and missing and wrong betasheets also in their sensitivity and determinations. Note that the decrease in the amino acid determination doesn't always mean a worse prediction, because if we assume that there are for example 500 alphahelixes in the tested sets and only one predicted correctly then the determination is correct.
CONCLUSIONS
PHMM has been used for protein family identification. In this paper we proposed a method to improve the PHMM identification to protein family. The performance of PHMM in the prediction and classification process and multiple sequences alignment also improved by using our method. This method includes using CHMM which requires labeling and estimate parameters (i.e., training model) that satisfy CML which is more accurate than ML. The CML estimation designed to optimize prediction accuracy which obviously shown in our results, where the missing and wrong numbers of alphahelix and betasheets decreased and their determination and sensitivity increased.