Research Article
Speaker Identification Using Bayesian Algorithm
King Abdulaziz University, Jeddah, Saudi Arabia
Khalid Daqrouq
King Abdulaziz University, Jeddah, Saudi Arabia
Rami Al-Hmouz
King Abdulaziz University, Jeddah, Saudi Arabia
Biometrics refers to the study of the human beings that measures human behavioral and physiological data. Behavioral biometrics are based on the way, human beings do things such as gait, signature, blinking and lip movement, etc. while physiological biometrics are based on measuring a persons physical characteristics such as finger prints, iris pattern, retina patterns, palm prints, facial features and hand geometry, etc. Speaker recognition is the one of the vital biometric recognition which is the process of recognizing a person by his/her voice. There are six different types of speaker recognition which are: Speaker verification, speaker identification, speaker classification, speaker segmentation, speaker detection and speaker tracking (Campbell, 1997).
Speaker identification process may be either text-dependent or text-independent. In the text-dependent process, the speaker trains the system by uttering specific keywords and repeats them at the recognition phase, whereas the text-independent does not depend on a specific keywords.
The main focus of this study is speaker identification (recognition) that can be classified as either physiological or behavioral depending on the concentration of the system. If the concentration is on the focal tract then it is classified as physiological and if the concentration is on the way of speaking, it is considered behavioral.
Speaker identification may split in two additional categories, closed-set and open-set. During the closed-set, the speaker utterance is compared against all speaker models in the database. The system will then return the ID of the closest match. No speaker will be rejected during the process of the closed-set mode.
Fig. 1: | Process of speaker identification |
The open-set identification is basically a closed-set identification adding to it a speaker verification capability. The closest match will then be verified and it is matched the speaker will be granted rights to the system.
In the identification task Fig. 1, the system has trained models for a certain amount of client speakers and the task is to determine which one of these models best matches the current speaker. Speaker identification could be used in adaptive user interfaces. For example, a car shared by a couple could recognize the driver by his/her voice and adjust the seat accordingly. This particular application concept belongs to the more general group of speaker adaptation methods that are already employed in speech recognition systems (Kuhn et al., 2000).
The preprocessing, feature extraction, speaker modeling and decision logic modules comprises the speaker identification system. In the preprocessing step, the input signal is adjusted and prepared with the appropriate characteristics for the next step. Preprocessing covers the conversion of analog speech signal into digital form, filtering, framing, endpoint detection and windowing. In the feature extraction part the input utterance is converted into a set of vector. Those vectors are used characterized the speakers speech properties. Irrelevant features are removed in the front-end process. After, the relevant features are highlighted during the front-end process, they are used to create a speaker model. This process is called speaker modeling. Then, those models are stored in a database for future usage. Matching the unknown speaker sample with the correct speaker using the reference models database is done by the decision logic which is the last step in the speaker identification system, it is the decision will be made based on the maximum posteriori probability calculated using the unknown speaker feature vectors and the speaker reference model best selected (Farhood and Abdughafour, 2010).
Naive Bayesian classification is the simplest classification algorithms and yet effective and efficient. It is used when the dimensionality of the inputs is high. The use of Bayesian in speech processing has gained attention (Zweig and Russel, 1998; Maina and Walsh, 2011a, b; Jiang and Deng, 2001; Vogt and Sridharan, 2004; Meuwly and Drygajlo, 2001; Shiota et al., 2012; Jiang et al., 1999; Gauvain and Lee, 1996; Khanteymoori et al., 2008).
In (Maina and Walsh, 2011a, b) the study presented a comparison of two variations of Bayesian algorithms for joint speech enhancement and speaker identification. In both algorithms they make use of speaker dependent speech priors which allowed them to perform speech enhancement and speaker identification jointly.
The application of Bayes algorithm is factor scoring to speaker verification was presented in (Jiang and Deng, 2001). Bayesian theorem has also been implemented as a robust methodology for forensic automatic speaker recognition (Meuwly and Drygajlo, 2001). It has been shown that the approach gives an adequate solution for the interpretation of the recorded speech as evidence in the judicial process. In (Shiota et al., 2012) an integrating model structures based on the Bayesian framework for speech recognition has been proposed. The speech recognition experiment demonstrated the proposed method could automatically estimate reliable posterior distributions of model parameters and an adequate posterior distribution of model structures.
In this study, different feature extraction methods will be implemented, mostly based on Discrete Wavelet Transform and Linear Prediction Coefficient (LPC) techniques for text-independent speaker identification systems. The investigation procedure is based on feature extraction and voice classification. In the classification phase, Bayesian algorithm is proposed and the performance of Bayesian algorithm for speaker identification will be measured with different feature extraction methods for clear and noisy voice signals.
NAIVE BAYESIAN CLASSIFIER
Observing the feature of AFLPC methods reported in (Daqrouq and Al-Azzawi, 2012) and using these features statistics, the likelihood of each feature as begin a verified speaker can be determined.
Let the class Ci be a speaker class and let D be data that provides information about Ci, then:
The first step is to estimate P(D|Ci) usually referred to as the likelihood function. This is achieved using training speaker-feature samples. Features are collected by applying Discrete Wavelet Transform (DWT) and Forward Discrete Wavelet Transform (FDWT) on speaker signals. Figure 2 shows histogram for three different speakers features. It is clearly shown that using speaker features statistics dont give abundant distinctions among speaker classes. Therefore, applying GMM leads to low classification rate because speaker-feature-distributions can be modeled as single Gaussian distribution but no discrimination among classes.
Accordingly, we build likelihood function for each feature per speaker; it was found that most features can be modeled as Gaussian distribution as shown in Fig. 3.
For each feature, there would be a probability score to each speaker-class. Therefore, let f be a feature in features vector. Then:
and:
where, N is the total number of features in AFLPC. Under the assumption of conditional independence, we reach to.
Fig. 2(a-c): | Speaker-feature distributions, (a) Feature distribution of class 1, (b) Feature distribution of class 12 and (c) Feature distribution of class 32 |
Posteriori P(Ci|f1 f2 ... fN) is computed throughout Bayesian fusion based on all features probabilities in speaker signal AFLPC. The P(C) is the prior probability about the speaker-class Ci; it was assumed that all classes are equally likely (P(C) = 1/50). Σj P(f1 f2 ... fN|Cj)P(Cj) is a normalization term. The Maximum A Posteriori probability (MAP) of Ci is used to estimate the speaker class Ci that maximizes P(Ci|f1 f2 ... fN):
Argmaxi {P(Ci|f1 f2 ... fN)}
Fig. 3(a-f): | Feature distributions, (a) Feature number 4 distribution of class 1, (b) Feature number 156 distribution of class 10, (c) Feature number 190 distribution of class 24, (d) Feature number 156 distribution of class 41, (e) Feature number 190 distribution of class 24 and (f) Feature number 190 distribution of class 2 |
Similarly, features from different approaches can be combined to produce probability-scores to each speaker. The essence of this approach is a way to combine different methods in a probabilistic manner, one method produces features that are different from other methods as well as features from different methods suffer from independent noise. As a result of the combined methods, features will be more descriptive and noise can be eliminated in the process of fusion.
RESULT AND DISCUSSION
To examine the presented text-independent speaker identification system, a testing database was created from the Arabic language. The recording environment is a normal office setting via., PC-sound card with original frequency of 4 kHz and a sampling frequency of 16 kHz. These utterances are Arabic spoken digits from 0 to 14. Each speaker also distinctly reads 30 sec worth of different Arabic texts ten separate times.
In the experiments, several feature extraction methods were analyzed to expose the efficacy of the proposed system. The following experiment investigates the proposed method in terms of the recognition rate.
Table 1: | Comparison between different feature extraction methods |
Table 2: | Recognition rate with white Gaussian noise |
Table 3: | Recognition rate with real babble noise |
Table 1 shows a comparative study of different feature extraction methods and Bayesian algorithm classifier is utilized. The best recognition rate selection obtained was 90.93 for WPLPCF.
Another experiment was conducted to assess the performance of the system in noisy environments. Table 2 and 3 summarize the results of the speaker identification corresponding to White Gaussian Noise WGN and real noise (restaurant noise, which seems like babbling) with the Signal-to-Noise Ratio (SNR) of 0 and 5 dB references, respectively.
In both noisy conditions, the best Bayesian recognition rate was obtained for DWTLPCF. The reason for DWTs success over WP is that the feature vector is obtained from level 5 (depth 5), where the sub signals are filtered in lower depth than in WPLPCF at level 2. It is shown that the use of the Eigen vector in conjunction with WPLPCF can improve the robustness of an identification system.
The Bayesian algorithm in speaker identification with different features extraction methods has been successfully implemented. Seven methods are used to extract the essential speaker features based on Discrete Wavelet Transform and Linear Prediction Coefficient (LPC). Experimental results showed both DWT and WP linked with AFLPC are suitable for the feature extraction method. However, WP resulted in better performance in terms of recognition rate. As a comparison with other methods, WPLPCF produced a higher recognition rate. The experimental results revealed the proposed AFLPC technique with DWT can accomplish better results for a speaker identification system in an AWGN environment; 67.3% for 0 dB and 77.8% for 5 dB. It can also be concluded that that Bayesian algorithm gives excellent performance in case of minimum signal fluctuations.
This study was funded by the Deanship of Scientific Research (DSR), king Abdulaziz university, Jeddah. The authors, therefore, acknowledge with thanks DSR technical and financial support.