Search. Read. Cite.

Easy to search. Easy to read. Easy to cite with credible sources.

American Journal of Applied Sciences

Year: 2009  |  Volume: 6  |  Issue: 2  |  Page No.: 290 - 295

Robust Speech Recognition Using Fusion Techniques and Adaptive Filtering

S.A.R. Al-Haddad, S.A. Samad, A. Hussain, K.A. Ishak and A.O.A. Noor


The study proposes an algorithm for noise cancellation by using recursive least square (RLS) and pattern recognition by using fusion method of Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). Speech signals are often corrupted with background noise and the changes in signal characteristics could be fast. These issues are especially important for robust speech recognition. Robustness is a key issue in speech recognition. The algorithm is tested on speech samples that are a part of a Malay corpus. It is shown that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. Furthermore refinement normalization was introduced by using weight mean vector to obtain better performance. Accuracy of 94% on pattern recognition was obtainable using fusion HMM and DTW compared to 80.5% using DTW and 90.7% using HMM separately. The accuracy of the proposed algorithm is increased further to 98% by utilization the RLS adaptive noise cancellation.

Fig. 1. In this paper Mel-

Fig. 1: Flowchart for fusion of Dynamic Time Warping (DTW) and Hidden Markov Model (HMM) with RLS noise canceller

Fig. 2: Adaptive noise canceling

Frequency Cepstral Coefficient (MFCC) is chosen as the feature because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficients[13].

WAV file was recorded for 60 speakers. Each speaker says KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN with a second pause for each number.

The RLS was used in preprocessing for noise cancellation as shown in Fig. 2[14]. The explanation for Fig. 2 is as follows:

n = Background noise of any type
= Noise correlated to n
s = Speech signal
d = Desired signal
W = Optimum filter weight matrix
y = Output of adaptive process
e = Error signal in ideal case (clean speech)

Figure 3 shows the results of using the RLS adaptive filtering to the noisy signal. Figure 3a, shows the amplitude of the noisy speech and Fig. 3b shows the amplitude after processing using RLS.

Fig. 3: (a): Noisy speech, (b): Signal processed by adaptive noise canceller

After getting the filtered noise speech sample, the first process is endpoint detection. For detection, two basic parameters are used: Zero Crossing Rate (ZCR) and short time energy. The energy parameter has been used in endpoint detection since the 1970’s[15]. By combining with the ZCR, speech detection process can be made very accurate[16].

For labeling the segmented speech frame the zero crossing and energy were applied to the frame. Unfortunately it contained some level of background noise due to the fact that energy for breath and surround can quite easily be confused with the energy of a fricative sound[17].

As a result, this algorithm performs almost perfect segmentation for voice recoded by male speakers. For recoding done at noisy places, segmentation problem happens because in some cases the algorithm produces different values caused by background noise. This causes the cut off for silence to be raised as it may not be quite zero due to noise being interpreted as speech. On the other hand for clean speech both zero crossing rate and short term energy should be zero for silent regions.

Feature extraction: Mel Frequency Cepstral Coefficients (MFCC) is chosen because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficient[18]. Currently it is the most popular feature extraction method[18,19]. MFCC is produced after the recorded signal is pre-emphasized, framed and Hamming windowed. Then the signal is normalized and lowpass filtered. Lowpass filter is used to remove the potential artificial high frequencies appearing in their modulation spectrum due to transmission errors.

The Hamming window was calculated after getting the results from the endpoint process. The equation used is as follows:


where αw is equal 0.54, meanwhile βw, functions to normalized the energy through the operation so that the signal will not change. For the purpose of front end processing to obtain the desired frequency resolution on a Mel scale, the simple Fourier Transform (FT) is used. The average spectral magnitude for each amplitude coefficient is calculated as:


where the number of samples to get the average value is denoted as N, weighting function is denoted as wFB(n) and magnitude of the frequency computed by the Fourier transform is denoted as |s(f)|.

The cepstral coefficient is computed to minimize the non-information bearing variability from that amplitude via the following calculations:


where the average signal value in the kth is denote as Savg.

Dynamic time warping (DTW): DTW is one of the main algorithms in this system for recognition after HMM. Due to the wide variations in speech between different instances of the same speaker, it is necessary to apply some type of non-linear time warping prior to the comparison of two speech instances. DTW is the preferred method for doing this, whereby the principles of dynamic programming can be applied to optimally align the speech signals. On the other hand, for detecting similar shapes with different phases, DTW has been used to calculate more robust distance for time series data. It can be used to measure similarity between sequences of different lengths. Because of these advantages many researchers use DTW such as for generic analysis and mining tasks on time series data, voice recognition and signature verification[20]. The distance metric used is a Euclidean distance for the cepstral coefficients over all frames after DTW is applied to align the frames optimally. The distance metric between frame i of the test word TMFCC and frame j of the reference word RMFCC is calculated as:


This DTW algorithm has been tested with 80.5% correctness[21]. But for this fusion system the distance is calculated as:


for the purpose of processing one digit at a time. This distance will be used by decision fusion to process the weight mean vector for one digit.

Hidden markov model (HMM): HMM is typically an interconnected group of states that are assumed to emit a new feature vector for each frame according to an emission probability density function associated with that state. Viterbi algorithm is the most suitable for the estimation the parameters for HMM on the maximum likelihood criterion[22]. For HMM the expression is defined as λ = (A, B, π). A is denoted by a state transition probability matrix, B is denoted as output probability matrix and π denoted as initial state probability. The probability of the observation sequence p(o|λ) is given multidimensional observation sequences o, known as feature vectors.

For word-level HMM, the recognizer computes and compares all the p(o|λv) where (v = 1,2,…,W) and W is the digit word models. For left-to right, HMMs, p(o|λv) is computed using the Log-Viterbi algorithm as follows[23]:

for initialization,


for recursion,


and for termination,


The acronym used in the algorithm:

N = Number of states
T = Number of frames for feature vectors o = [o1, o2,…,oT]
aij = State transition between i and j
A = {aij} are their N-by-N matrix
B = {logbj (ot)} is a N-by-T matrix in log output probability
δt (j) = Likelihood value at the time index t and state j

Fusion HMM and DTW: The pattern recognition fusion method used to fuse the results of DTW and HMM is weight mean vector. DTW measures the distance between recorded speech and a template, expanding or shrinking the temporal axis of the target to find the path or warping function which maximizes the similarity between the two speech signals. The distance of the signals is computed at each instant along the warping function. Meanwhile, HMM trains cluster and iteratively moves between clusters based on their likelihoods given by the various models. The weight mean vectors equation used is as follows:


which expands to,



w1 = Query recognition rate in HMM test phase
w2 = Query recognition rate in DTW test phase
xn = Real time value of recorded speeches
= Weight mean vector

For example if recognition percentage for HMM is h and for DTW is d for one digit, then in the fusion model after the query is recognized by DTW and HMM individually, the final percentage is calculated as follows:


Results and Discussion

We have evaluated the algorithm using the data described in the methodology section. The recognition algorithms HMM, DTW and DTW-HMM pattern recognition fusion is then tested for the percentage of accuracy. The test is limited to Malay digits from 0-9. Random utterance of digits is done and the accuracy of 100 samples is analyzed. The results obtained from the accuracy test is about 80.5% of accuracy for DTW and 90.7% for HMM and 94% for pattern recognition fusion. The results obtained are shown in Table 1.

Table 1: Comparison of digit recognition accuracy without using noise canceller.

Table 2: Comparison digit recognition accuracy with noise canceller

Meanwhile for robustness, the speech is first filtered by using RLS noise cancellation, the results obtained are as shown in Table 2. Noise cancellation increases the accuracy for HMM, DTW and Fusion to 94.2, 91.4 and 98.1%, respectively.


This research has shown a speech recognition algorithm using MFCC vectors to provide an estimate of the vocal tract filter. DTW and HMM are the two recognition algorithms used. DTW is used to detect the nearest recorded voice. Meanwhile HMM is used to emit a new feature vector for each frame according to an emission probability density function associated with that state. The results showed a promising speech recognition module as tested on a Malay digit database. This paper has shown that the fusion technique can be used to fuse the pattern recognition outputs of DTW and HMM. Furthermore it also introduced refinement normalization by using weight mean vector to get better performance with an accuracy of 94% for pattern recognition fusion HMM and DTW. This can be compared to the accuracy for DTW and HMM, which is 80.5 and 90.7%, respectively. The accuracy is further increased after RLS noise cancellation to 98.1% for the fusion technique.


This research is supported by the following research grants: Fundamental Research Grant Scheme, Malaysian Ministry of Higher Education, FRGS UKM-KK-02-FRGS-0036-2006 and E-Science 01-01-02-SF0374, Malaysian Ministry of Science, Technology and Innovation.

" class="btn btn-success" target="_blank">View Fulltext