INTRODUCTION
Biometrics is the study of automated methods of recognizing a person based
on measurable physiological or behavioral characteristics (NSTC,
2006a). Biometric recognition systems are in demand today because of their
reliance on human features that are unique to each person and cannot be forged
easily, such as face, fingerprint, hand geometry, handwriting, iris, retina,
vein and voice. Speaker recognition or verification is a biometric modality
that uses an individuals voice for recognition or verification purpose.
It is a different technology from speech recognition, which recognizes words
as they are articulated (NSTC, 2006b). Speech contains
many characteristics that are specific to each individual. For this reason,
listeners are often able to recognize the speakers identity fairly quickly
even without looking at the speaker. Speaker verification is a process of determining
whether a person is who he or she claims to be by using his or her voice (Campbell,
1997; Naik, 1990; Rabiner and
Juang, 1993; NSTC, 2006b).
Many classification techniques have been proposed for speaker verification
systems including Dynamic Time Wrapping (DTW) (Al-Haddad
et al., 2008), Hidden Markov Models (Han et
al., 2003; Rabiner, 1989; Yoshizawa
et al., 2004), Artificial Neural Networks (ANNs) and Vector Quantization
(VQ) (Linde et al., 1980; Soong
et al., 1985; Vasuki and Vanathi, 2006) and
Support Vector Machine (SVM) (Wan and Renals, 2005).
The most popular classification technique in speaker verification is HMMs (NSTC,
2006b; Furui, 1997; Rabiner and
Juang, 1993). In particular, recognition or verification systems based on
HMMs are effective under many circumstances, but they suffer from major limitations
that limit applicability. Generally, the performance of the single technique
is limited. Applications that require high security such as internet banking
and high security access control can be obtained by using hybrid techniques.
This study presents a hybrid VQ and HMMs classification technique. The goal
in hybrid technique for speaker verification system by using VQ and HMMs is
to take advantage of the properties of both VQ and HMMs, improve flexibility
and verification performance. Other hybrid architectures that can be found in
the literature are hybrid ANN/HMM (Trenti and Gori, 2000),
hybrid TDNN/HMM (Jang and Un, 1996), hybrid MMI-connectionist/HMM
(Neukirchen and and Rigoll, 1997) and combining SDTW
and independent classification (Lichtenaur et al.,
2008). However, this study focuses on improving HMMs by using a hybrid technique
with VQ in speaker verification and the performance between other hybrid techniques
are not compared due to the difference in the data set used.
In most real world applications, the speech from speakers is captured in non-ideal
situations such as in noisy environments which may seriously reduce system performance
(Fujimoto and Ariki, 2000; Hussain
et al., 2007). Over several decades, a significant amount of research
attention has been focused on the signal processing techniques that are able
to extract a desired speech signal and reduce the effects of unwanted noise.
Depending on the number of sensors used in the system, these approaches can
classified into three basic categories, namely temporal filtering techniques
using only a single microphone, adaptive noise cancellation utilizing a primary
sensor to pick up the noisy signal and reference sensor to measure the noise
field and beam forming techniques exploiting an array of sensors (Yiteng
et al., 2006).
In this study, we propose a novel approach by using hybrid VQ and HMMs classification technique together with adaptive noise cancellation based on LMS, NLMS and RLS adaptive filter. We also compare the performance of the adaptive filter to select the best filter for the systems. The objective is to improve the performance of a speaker verification system for both clean and noisy environments. The technique is evaluated using a Malay spoken digit database for clean environment and Gaussian white noise is added to the data to evaluate the system performance for noisy environments.
SPEAKER VERIFICATION
Hidden Markov Models (HMMs): A speaker verification system consists of two phases: training phase and verification. In the training phase, the speaker voices are recorded and processed in order to generate its model to store in the database. While, in the verification phase, the existing reference templates are compared with the unknown voice input. In this study, the HMM method is used as the training algorithm.
The most flexible and successful approach to speech recognition so far has
been HMMs. The goal of HMMs parameter estimation is to maximize the likelihood
of the data under the given parameter setting. General theory of HMMs has been
given by Rabiner and Juang (1986, 1993).
There are 3 basic parameters in HMMs which are:
• | π:
The initial state distribution |
• |
a:
The state-transition probability matrix |
• |
b:
Observation probability distribution |
In the training phase, a HMM model for each speaker is generated. Each model
is an optimized model for the word it represents. For example, a model for the
Malay word Satu (number one), has its a, b and π parameters adjusted so
as to give the highest probability score whenever the word Satu
is uttered and lower scores for other words.
A training set is needed to build a model for each speaker. This training set consists of sequences of discrete symbols, such as the codebook indices obtained from the VQ stage. Here, an example is given of how HMMs are used to build models for a given training set. Assuming that N speakers are to be verified, first we must have a training set of L token words and an independent testing set. These are the steps needed during speaker verification process:
• |
First we build a HMM model for each speaker. The L training
set of tokens for each speaker will be used to find the optimum parameters
for each word model. This is done using the re-estimation formula |
• |
Then, for each unknown speaker in the testing set, first characterize
the speech utterance into an observation sequence. This means using an analysis
method for the speech utterance so that we get the feature vector and then
the vector is quantized using VQ. Thus, we will get a sequence of symbols,
with each symbol representing the speech feature for every discrete time
step |
• |
We calculate a, b and π parameters for the observation
sequence using one of the speaker models in the vocabulary. Then repeat
for every speaker model in the database |
After N models have been created, the HMM engine is then ready for speaker
verification. A test observation sequences from an unknown speech utterance
produced after vector quantization of cepstral coefficient vectors, is evaluated
using the Viterbi algorithm. The log-Viterbi algorithm is used to avoid precision
underflow. For each speaker model, probability score for the unknown observation
sequence is computed. The speaker whose model produces the highest probability
score and matches the ID claimed is then selected as the client speaker.
Speaker verification means making a decision on whether to accept or reject
a speaker. To decide, a threshold is used with each client speaker. If the unknown
speakers maximum probability score exceeds this threshold, then the unknown
speaker is verified to be the client speaker. However, if the unknown speakers
maximum probability score is lower than this threshold, then the unknown speaker
is rejected. The relationship is shown in Fig. 1.
|
Fig. 1: |
Speaker
verification decision |
The threshold is determined as follows:
• |
For each speaker, evaluate all samples spoken by him using
his own HMM models and find the probability scores. From the scores, find
the mean, μ1 and standard deviation, σ1,
of the distribution |
• |
For each speaker, evaluate all samples spoken by a large number
of impostors using the speakers HMMs models and find the probability
scores. From the scores, find the mean μ2 and standard deviation
σ2 of the distribution |
• |
For
each speaker, calculate the threshold as: |
Vector quantization (VQ): VQ is a process of mapping vectors of a large
vector space to a finite number of regions in that space. Each region is called
a cluster and is represented by its centre (called a centroid) (Soong
et al., 1985; Vasuki and Vanathi, 2006).
A collection of all the centroids makes up a codebook. The amount of data is
significantly less, since the number of centroids is at least ten times smaller
than the number of vectors in the original sample. This will reduce the amount
of computations needed for comparison in next stages. Even though the codebook
is smaller than the original sample, it still accurately represents a persons
voice characteristics. The only difference is that there will be some spectral
distortion.
In the feature extraction stage, we calculate the LPC cepstrum and the entire speech signal are represented as the LPC to cepstrum parameters and a large sample of these parameters are generated as the training vectors. During the training process of VQ, a codebook is obtained from these sets of training vectors. These training vectors are actually compressed to reduce the storage requirement. An element in a finite set of spectra in a codebook is called a codevector. The codebooks are used to generate indices or discrete symbols that will be used by the discrete HMMs. Hence, data compression of speech is accomplished by VQ in the training phase and the encoding phase that finds the input vectors the best codevectors.
To implement VQ, first, we must get the codebook. A large set of spectral analysis vectors (or speech feature vectors) is required to form the training step. If we denote the size of the VQ codebook as M = 2N codewords, then we require an L (with L >> M) number of training vectors. It has been found that L should at least be 10 M in order to train a VQ codebook that works well. For this research, we use the LBG algorithm, also known as the binary split algorithm.
The speaker verification based VQ codebook generation proposed by Soong
et al. (1985) can be summarized as follows:
Given a set of I training features vectors, (a1, a2, ..., aI) characterizing the variability of speaker, we want to find a partitioning of the feature vector space, (S1, S2,
, SM), for that particular speaker where, S, the whole feature space is represented as S = S1∪ S2∪
∪SM. Each partition, Si, forms a convex, non-overlapping region and every vector inside Si is represented by the corresponding centroid vector, bi, of Si. The partitioning is done in such way the average distortion is minimized over the whole training set:
The distortion between the vectors ai and bj is denoted as d(ai, bj). Short-time LPC vectors are used as feature vectors. The corresponding distortion measure to measure the similarity between any two features vectors is the LPC likelihood ratio distortion measure. The likelihood ratio distortion between two LPC vectors a and b is defined as:
where, Ra is the autocorrelation matrix of speech input data associated
with the vector a. Using this distortion measure and the VQ codebook training
algorithm proposed by Linde, Buzo and Gray (LBG) (Linde
et al., 1980). We generated speaker-based VQ codebooks. The input
speech signal is sampled, segmented and LPC analyzed giving sequence of vectors
a1, a2, ..., aI. The resultant LPC vector are
vector quantized using the N codebooks corresponding to the N different speakers.
The quantization errors (distortion) with respect to each codebook are individually
accumulated across the whole test token. The average distortion with respect
to the ith codebook (speaker) is:
The N resultant average distortions are compared to find the minimum. The final speaker recognition decision is given by:
A speaker verification system has a similar structure except only the codebook of the claimed identity is used and the resultant average distortion is compared with a present threshold to reject or to accept the identity claim made by unknown speaker.
ADAPTIVE NOISE CANCELLATION (ANC)
Conventional frequency-selective digital filters with fixed coefficients are
designed to have a given frequency response chosen to alter the spectrum of
the input signal in a desired manner. However, there are many practical application
problems that cannot be successfully solved by using fixed digital filters because
either we do not have sufficient information to design a digital filter with
fixed coefficients or the design criteria change during the normal operation
of the filter (Manolakis et al., 2000).
The principle of general noise cancellation is illustrated in Fig. 2. The s(n) signal is corrupted by uncorrelated additive noise v1(n) and the combined signal s(n)+vi(n) provides a primary input. A second sensor located at a different point, acquires a noise v2(n) that is uncorrelated with the signal s(n) but correlated with the noise v1(n). If we can design a filter that provides a good estimate y(n) of the noise v1(n), by exploiting the correlation between v1(n) and v2(n), then we could recover the desired signal by subtracting y(n)≈v1(n) from the primary input. The filtered signal is given by estimation error:
where, y(n) depends on the filter structure and parameters. The mean square error (MSE) is given by:
|
Fig. 2: |
Adaptive
noise cancellation using reference input |
However, the performance of the ANC is highly dependent on the quality of
the noise reference. The noise in the reference sensor and the noisy speech
sensor must be sufficiently correlated to obtain substantial noise reduction.
Any leakage from the primary speech signal into the noise reference signal must
be avoided since it results in the primary speech signal distortion and poor
noise cancellation.
Least mean square (LMS) adaptive filter: The LMS algorithm is an important
member of the family of the stochastic gradient algorithms. A significant feature
of the LMS algorithm is its simplicity. Moreover, it does not require measurements
of the pertinent correlation functions, nor does it require matrix inversion.
Indeed, it is the simplicity of the LMS algorithm that has made it the standard
against which other linear adaptive filtering algorithms are benchmarked (Haykin,
2002). Block diagram of adaptive transversal filter is illustrated in Fig.
3.
The output vector of LMS adaptive filter is given by:
where, ŵ(n) is current estimate of tap-weight vector and
(n)is
tap-input vector. Further, the superscript H stands for Hermitian, or equivalently
conjugate transpose. Estimation error signal is found as:
where, d(n) is desired response. The LMS algorithm is defined as:
where, ŵ (n+1) is the estimate of tap-weight vector at time (n+1) and (n+1) is a constant:
|
Fig. 3: |
Block
diagram of adaptive transversal filter |
Normalized least mean square (NLMS) adaptive filter: NLMS filter is
in the family of LMS filter. The NLMS filter differs from the conventional LMS
in the way in which the step size for controlling the adjustment to the filters
tap-weight vector is defined. The principle of the NLMS adaptive filters in
structural terms is exactly the same as the standard LMS filter, as shown in
the Fig. 3. Both adaptive filters are built around a transversal
filter, but differ only in the way in which the weight controller is mechanized.
The M-by-1 tap-input vector
(n) produces
an output ŷ(n) that is subtracted
from the desired response d(n) to produce the estimation error e(n). In response
to the combined action of the input vector
(n)
and error signal e(n), the weight controller applies a weight adjustment to
the transversal filter. This sequence of events is repeated for a number of
iterations until the filter reaches steady state. The output vector of NLMS
adaptive filter is given by:
where, ŵ(n) is current estimate of tap-weight vector and
(n) is tap-input vector. Further, the superscript H stands for Hermitian, or equivalently conjugate transpose. Estimation error signal is found as:
where, d(n) is desired response. The NLMS algorithm is defined as:
where, ŵ(n+1) is the estimate of tap-weight vector at time (n+1), ũ
and δ are constants (δ>0) (Haykin, 2002; Manolakis
et al., 2000).
Recursive Least-Squares (RLS) adaptive filter: The RLS algorithm is based on the least-squares (LS) estimate of
|
Fig. 4: |
Block
diagram of RLS filter |
the filter coefficients ŵ(n-1)
at iteration (n-1), by computing its estimate at iteration n using the newly
arrived data. The filter coefficients at time n are chosen to minimize the cost
function:
where, the error signal e(i) is computed for all times 1≤i≤n using the
current filter coefficients ŵ(n): e(i) = d(i)-ŵH (n)û(i)
and λ is called the forgetting factor. Using a matrix inversion lemma a
recursive update equation for
is
found as:
with
Finally, the weight update equation is:
where, ξ(n) is the a priori estimation error and is given by:
Equation 18 describes the filtering operation of the algorithm,
whereby the transversal filter is excited to compute the a priori estimation
error ξ(n). Equation 17 describes the adaptive operation
of the algorithm, whereby the tap-weight vector is updated by incrementing its
old value by an amount equal to the product of the complex conjugate of the
priori estimation error ξ(n) and time-varying gain vector
(n).
Equation 15 and 16 enable the value of
the gain vector itself. Figure 4 shows the block diagram of
the RLS adaptive filter (Haykin, 2002; Manolakis
et al., 2000; Poularikas and Ramadan, 2006).
EXPERIMENTS
Experimental conditions in clean environment: Speaker verification experiments
were carried out using a Malay spoken digit database (Ilyas
et al., 2007) which contains 100 speakers. To evaluate performance
of the system in noisy environments, experiments using added Gaussian white
noise at 4 levels (30, 20, 10 and 0 dB) were carried out with and without adaptive
filtering. For the experiments, 100 speakers were selected where each speaker
has 10 repetitions of Malay digits. All of the Malay digits, from 0 until 9
were selected to build the speaker model. Feature vectors composed of 14 linear
predictive coding cepstral (LPCC) coefficients were used. The analyzed frame
was windowed by a 15 msec Hamming window with 5 msec overlapping. The samples
were pre-segmented automatically using the start-end detection module to remove
the silent parts. For speaker modeling, all samples were selected from each
speakers training set. This procedure was for building the global codebook
to be used later for HMMs. Then, for each speaker, a codebook was built using
the Linde-Buzo-Gray (LBG) VQ method. The size of each codebook was 256 codevectors
as for the global codebook. For testing we used a workstation, equipped with
a Pentium D processor, with 1 GB of memory and running on the Windows XP operating
system.
Figure 5 shows the flow chart of the speaker verification experiment. First, clean speech signals passes through end point detection and feature extraction without adaptive filtering. If it is a training process, it will generate individual codebook for VQ models and global codebook for HMMs models. Otherwise, evaluation of combination VQ and HMMs and standalone HMMs will be conducted based on individuals models and ID.
Experimental conditions in noisy environment: Experiments in noisy environments were carried out using the same approach as in a clean environment. However, Gaussian white noise was added to clean speech signals to produce noisy speech signals. Figure 6 shows an example of original clean signal and noisy speech signals mixed with Gaussian white noise in SNR of 0 dB.
Experimental conditions in noisy environments using adaptive filter:
Adaptive filters could be used to cancel or clean the created noisy signal.
Filtered signals were tested using the same procedure as clean and noisy environments
to evaluate speaker verification system performance. However, noisy speech signals
went through adaptive filtering (Fig. 5) before end point
detection and feature extraction. Figure 7a-c
shows an example of the filtered signal that is obtained from LMS, NLMS and
RLS adaptive filtering at SNR of 0 dB (Fig. 6a, b)
of speech signal. All of the filtered signals are almost similar with the original
clean signal.
|
Fig. 5: |
Speaker
verification experiment flow chart |
|
Fig. 6: |
(a,
b) Original and noisy speech signal of the word Satu |
|
Fig. 7: |
Filtered
speech signal of word Satu with (a) LMS, (b) NLMS and (c)
RLS filtered (0 dB) |
RESULTS AND DISCUSSION
Clean environment: Table 1 shows a summary of the verification results for the experiments performed. An Equal Error Rate (EER) of 11.72% is achieved using this combination technique compared to stand alone HMM which is 16.66%. Using the combination technique, true speaker rejection rate is 0.06% while impostor acceptance rate is 0.03%. Figure 8 shows a ROC plot of False Rejection Rate (FRR) vs False Acceptance Rate (FAR). It shows that the hybrid technique of VQ and HMMs outperform the HMM only based technique.
Noisy environments (without adaptive filtering): Table 2 shows the verification results of HMMs in noisy environments (Gaussian white noise mixed) and without adaptive filtering. EERs of between 41.01-49.94% are achived for SNRs of between 0-30 dB. High noise levels worsen the system performance in all cases. Table 3 shows the verification results of hybrid VQ+HMMs in noisy environments (Gaussian white noise mixed) without adaptive filtering. EERs of between 37.14-49.11% are achived for SNRs of between 0-30 dB. Using the hybrid technique, a relative improvement of EER between 0.83-3.87%, FAR of between 19.92-26.43% and FRR of between 4.44-24.55% are obtained compared to HMMs technique.
|
Fig. 8: | ROC
plot of False Rejection Rate (FRR) vs. False Acceptance Rate (FAR) |
Table 1: |
Verification
result for clean environment (%) |
 |
Table 2: |
Verification
result of HMMs in noisy environments (%) |
 |
Table 3: |
Verification
result of VQ+HMMs in noisy environments (%) |
 |
Table 4: |
Verification
result using LMS adaptive filter (%) |
 |
Table 5: |
Verification
result using NLMS adaptive filter (%) |
 |
Noisy environments (with adaptive Filtering): Table 4-6
show the verification results using the hybrid VQ and HMMs in noisy environments
(Gaussian white noise mixed) and with LMS, NLMS and RLS adaptive filtering,
respectively. Figure 9 shows the ROC plot of FAR vs FRR using
hybrid VQ and HMMs and with and without adaptive filtering at noisest 0 dB condition.
Improvements of 26.41, 29.63 and 31.5% are achieved using LMS, NLMS and RLS
respectively. It can be seen that RLS adaptive filter gives the best result
for the nosiest (0 dB) condition in terms of speed of adaptation and speech
tracking behavior. However, as far as computational complexity is concerned,
RLS algorithm implies that complexity increases with the squared of filter order
(ON2), where N is the filter order. On the other hand, the LMS algorithm
gives the lowest computational requirements since the complexity of such an
algorithm is directly proportional to the filter order N. The NLMS algorithm
has variable step size for adaptation which posses a better tracking characteristics
with same computational complexity as the LMS version.
Table 6: |
Verification
result using RLS adaptive filter (%) |
 |
|
Fig. 9: | ROC
plot of False Rejection Rate (FRR) vs. False Acceptance Rate (FAR) with
different of adaptive filter at 0 dB SNR |
The traditional HMMs perform worse in both clean and noisy environment compared
to hybrid VQ and HMMs. Although hybrid VQ and HMMs perform better in both clean
and noisy environment, its performance still degrades under noisy condition.
Implementation of the adaptive filtering improves the hybrid VQ and HMMs in
noisy condition. The method proposed in the paper is reasonable with the following
justification:
• |
Recognition systems based on HMMs are effective under many
circumstances, but do suffer from some major limitations that limit applicability
of automatic speech recognition (ASR) technology in real-world environments
(Trenti and Gori, 2000) |
• |
The
goal in hybrid systems in ASR (speaker verification) is to take advantage
from the properties of both HMMs and ANNs (VQ) (Trenti
and Gori, 2000) |
• |
The adaptive filter relies for its operation on a recursive
algorithm, which makes it possible for the filter to perform satisfactorily
in a environment where complete knowledge of the relevant signal characteristics
is not available (Haykin, 2002) |
CONCLUSION
This study has shown two approaches of improving speaker verification in clean and noisy environments. The first approach show how hybrid VQ and HMMs improves a HMMs speaker verification performance in clean and noisy environments and the second approach shows how adaptive filtering improves the hybrid technique in noisy environments. Experimental results have shown that by using this hybrid classification technique, EER, FAR and FRR are improved in clean environment and noisy environments compared to HMMs alone. However, both techniques shown degradation in noisy environments. In order to address these problems, an Adaptive Noise Cancellation (ANC) technique using adaptive filtering is implemented due to its ability to separate overlapping speech frequency bands. Investigations using Least-Mean-Square (LMS), Normalized Least-Mean-Square (NLMS) and Recursive Least-Squares (RLS) adaptive filtering are conducted to find the best solution for the system. It has been shown that RLS adaptive filter gives the best result for the nosiest (0 dB) condition. However, considering computational complexity and overall results, NLMS adaptive filter is identified as the best filter. Further work will require concentration on real-time noisy conditions.
ACKNOWLEDGMENTS
This research is supported by the following research grant: Fundamental Research Grant Scheme, Malaysian Ministry of Higher Education, FRGS UKM-KK-02-FRGS-0036-2006 and E-science, Ministry of Science Technology and Innovation, e-science 01-01-02-SF0374.