INTRODUCTION
Automatic Speaker Verification (ASV) refers to the mission of verifying speaker’s
identity by means of the speakerspecific information contained in speech signal.
Speaker verification methods are absolutely divided into textdependent and
textindependent applications. When the same text is used for both training
and testing, the system is called to be textdependent but for textindependent
process, the text used to train and test of the ASV system is totally unconstrained.
The textindependent speaker verification necessitates no restriction on the
type of input speech. In contrast, the textindependent speaker verification
generally gives less performance than textdependent speaker verification, which
requires test input to be the same utterances as training data (Nemati
and Basiri, 2010; Xiang and Berger, 2003).
Speaker verification has been the topic of active research for many years,
and has many important applications where propriety of information is a concern
(Lamel and Gauvain, 2000). Applications of speaker verification
can be found in biometric person authentication such as an extra identity check
in credit card payments over the Internet while, the potential applications
of speaker identification may be utilized in multiuser systems. For example,
in speaker tracking the task is to locate the segments of given speaker (s)
in an audio stream (Kwon and Narayanan, 2002; Lapidot
et al., 2002; Martin and Przybocki, 2001).
It has also potential applications in an automatic segmentation of teleconferences
and helping in the transcription of courtroom discussion.
There has been a wide spectrum of proposed approaches to speaker verification starting with very simplistic models such as those based on long term statistics. The most sophisticated methods rely on large vocabulary speech recognition with phonebased HMMs.
Feature extraction is a key stage in speaker verification systems. Speech extracted
features used in a speaker verification system drop within two classes based
on their related space. One class includes features defined in an unconditional
or absolute and irrelative space, while the other includes features defined
in a relative space. For the first class, depiction of a speaker in the feature
space is not related to any reference speaker (Naini et
al., 2010). While there is a momentous body of literature on features
in the absolute space, very little research has been conducted for studying
the properties of features extracted in the relative space. Melfrequency cepstral
coefficient (MFCC), Linear Prediction Cepstrum Coefficient (LPCC), wavelet coefficients
(Afshari, 2011; Avci and Akpolat,
2006), etc., are among the most common speech features in absolute space.
In recent times, Campbell et al. (2006) used
Maximum A Posteriori (MAP) adapted GMM mean super vectors as an absolute feature
with Support Vector Machine (SVM) as a discriminative model for speaker verification
(Campbell et al., 2006). For features defined
in a relative space, each speaker in the feature space is described relative
to some reference speakers. As well as, extracted features in the relative space
can be applied in conjunction with any other set of techniques from the verification
phase menu that are deemed more suitable. Kuhn et al.
(2000) introduced the eigenvoices concept and represented each new speaker
relative to eigenvoices (Kuhn et al., 2000; Thyes
et al., 2000). Afterwards, other researchers used a different approach
where they introduced the idea of space of anchor models to represent enrolled
speakers in verification systems and to verify a test speaker in a relative
feature space (Naini et al., 2010; Mami
and Charlet, 2002, 2003, 2006).
Speech features are often extracted by Fourier Transform (FT) and short time
Fourier transform (STFT). Unfortunately, they accept signals stationary within
a given time frame and may therefore lack the ability to represent localized
events properly. Recently, Wavelet Transform (WT) has been proposed for feature
extraction. The particular benefit of wavelet analysis possesses is the characterizing
signals at different localization levels in both time and frequency domains
(Derbel et al., 2008; Wu
and Ye, 2009; Zheng et al., 2002). Furthermore,
the WT is well suited to the analysis of nonstationary signals (Daqrouq
et al., 2010; Daqrouq, 2011). It provides an
alternative to classical linear timefrequency representations with better time
and frequency localization characteristics. In earlier studies, these properties
were applied in speaker recognition, particularly Wavelet Packet Transform (WPT)
(Wu and Lin, 2009; Lung, 2004;
Avci, 2009).
Artificial neural network performance is depending mainly on the size and quality
of training samples (Visser et al., 2003). When
the number of training data is small, not representative of the possibility
space, standard neural network results are poor. Incorporation of neural fuzzy
or wavelet techniques can improve performance in this case, particularly, by
input matrix dimensionality decreasing. Artificial Neural Networks (ANN) are
known to be excellent classifiers but their performance can be prevented by
the size and quality of the training set (Qureshi and Jalil,
2002).
The specific aim of the present study was to address this question by developing and evaluating the number of repeating the remainder (by modular arithmetic) for speech based textindependent verification of the speaker from imposters. For better investigation two other verification methods are proposed; Gaussian Mixture Model method and KMeans clustering method.
FEATURES EXTRACTION BY WAVELET PACKET
Wavelet packet: The wavelet packet method is a generalization of wavelet decomposition that offers a richer signal analysis. Wavelet packet atoms are waveforms indexed by three naturally interpreted parameters: position and scale as in wavelet transform decomposition and frequency. In the following, the wavelet transform is defined as the inner product of a signal x (t) with the mother wavelet ψ (t):
where, a and b are the scale and shift parameters, respectively. The mother wavelet may be dilated or translated by modulating a and b.
The wavelet packets transform performs the recursive decomposition of the speech signal obtained by the recursive binary tree (Fig. 1). Basically, the WPT is very similar to Discrete Wavelet Transform (DWT) but WPT decomposes both details and approximations instead of only performing the decomposition process on approximations. The principle of WP is that, given a signal, a pair of low pass and high pass filters is used to yield two sequences to capture different frequency subband features of the original signal. The two wavelet orthogonal bases generated form a previous node are defined as:
where, h [n] and g [n] denote the lowpass and highpass filters, respectively.

Fig. 1: 
Wavelet packet at depth 3 
In Eq. 3 and 4, ψ [n] is the wavelet
function. Parameters j and p are the number of decomposition levels and nodes
of the previous node, respectively (Wu and Lin, 2009).
Modular arithmetic: Modular arithmetic is referenced in number theory,
group theory, ring theory, knot theory, abstract algebra, cryptography, computer
science, chemistry and the visual and musical arts. Modular arithmetic provides
less computational complexity (Cormen et al., 2001).
It performs efficiently on large numbers. Wong and Blow
(2006) presented a logical design of an alloptical processor that performs
modular arithmetic. Modular arithmetic could be used to process the header of
alloptical packets on the fly (Wessing et al., 2002).
In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers wrap around after they reach a certain value. Timekeeping on a clock introduces an example of modular arithmetic (If the time is 7:00 now, then 8 h later it will be 3:00). The Swiss mathematician Leonhard Euler pioneered the modern method to congruence about 1750, when he explicitly introduced the idea of congruence modulo a number N. Modular arithmetic was developed by Carl Friedrich Gauss in his book Disquisitiones Arithmeticae, published in 1801 (Encyclopedia Britannica, 2010).
In fact, the notion of modular arithmetic is related to that of the remainder
in division. The operation of extracting the remainder is sometimes referred
to as modulo operation. Modular arithmetic can be handled mathematically by
introducing a congruence relation on the integers that is wellmatched with
the operations of the ring of integers: addition, subtraction and multiplication.
For a positive integer n two integers a and b are assumed to be congruent modulo,
written:
if their difference ab is an integer multiple of n. The number n is called the modulus of the congruence.
In computer science discipline, it is the remainder operator that is usually indicated by either "%" (e.g. in C, Java, JavaScript, Perl and Python) or "Mod" (e.g. in BASIC, SQL, Haskell and Matlab). These operators are commonly pronounced as "mod", however, it is specifically a remainder that is computed. The remainder operation can be represented using the floor function.
If a ≡ b (mod n), where n > 0, then if the remainder b is calculated:
where,
is the largest integer less than or equal to
then a≡b (mod n) and 0≤b<n
If instead a remainder b in the range n ≤ b < 0 = is required, then:
In this work, the WP tree consists of three stages and therefore has G_{High} = 2^{3} high pass nodes (with original signal node) and G_{Low} = 2^{3} low pass nodes with original signal node. More generally, for a qstage tree, there are G = (G_{High} + G_{low}) = 2^{q+1}1 nodes.
For a given orthogonal wavelet function, a library of wavelet packet bases
is generated. Each of these bases offers a particular way of coding signals,
preserving global energy and reconstructing exact features. The wavelet packet
is used to extract additional features to guarantee higher recognition rate.
In this study, WPT is applied at the stage of feature extraction, but these
data are not proper for classifier due to a great amount of data length. Thus,
it is essential to seek for a better representation for the speech features.
Previous studies showed that the use of entropy of WP as features in recognition
tasks is efficient. Avci and Akpolat (2006) proposed
a method to calculate the entropy value of the wavelet norm in digital modulation
recognition. In the biomedical field, Behroozmand and Almasganj
(2007) presented a combination of genetic algorithm and wavelet packet transform
used in the pathological evaluation and the energy features are determined from
a group of wavelet packet coefficients. Kotnik et al.
(2003) proposed a robust speech recognition scheme in a noisy environment
by means of waveletbased energy as a threshold for denoise estimation. In
the study of Wu and Lin (2009) the energy indexes of
WP were proposed for speaker identification. Sure entropy is calculated for
the waveforms at the terminal node signals obtained from DWT (Avci,
2009) for speaker identification. Avci (2007) proposed
features extraction method for speaker recognition based on a combination of
three entropies types (sure, logarithmic energy and norm). In this study, the
modular arithmetic to decrease the number of features in each WP node is proposed.
For a speech signal in a WP node.
{u (t)} = {u (t_{1}), u (t_{2}),..., u (t_{M})}, where M is the length of u (t), the remainder b is written as mod (n)_{u (t)} = 0,1,2,..., n1, for n>2. In this study, the number of repeating the remainder mod (n)_{u(t)} is utilized as a modular arithmetic wavelet speech signal feature vector:
WR (n) = r (0), r (1), r (2), ..., r (n1). Where r is the number of repeating the remainder mod (n) applied for a speech signal without WP and R denotes the same but without WP (Fig. 2).
Verification: Depending on the application, the universal area of speaker
recognition is divided into two particular tasks: identification and verification.
In speaker identification, the goal is to decide which one of a group of known
voices best matches the input voice sample. This is also referred to as closedset
speaker identification. Applications of pure closedset identification are limited
to cases where only enrolled speakers will be encountered but it is a useful
means of studying the separability of speakers’ voices or finding similar
sounding speakers, which has applications in speakeradaptive speech recognition.
In verification, the task is to decide from a voice sample if a person is who
he or she claims to be. This is sometimes referred to as the openset problem,
because this task requires distinguishing a claimed speaker’s voice known
to the system from a potentially large group of voices unknown to the system
(i.e., imposter speakers) (Reynolds et al., 2000).
The basis for presented verification systems is the modular arithmetic wavelet method MWM used to represent speakers. More specifically, the feature vectors extracted from a person’s speech is modeled by speaker model. For a Ddimensional feature vector denoted as x, the model for speakers is defined as the average vector calculated for fifteen different utterances of a same speaker. This average vector is used to present the hypothesized speakers to be compared to background speakers, which present the imposter speakers. Background speakers are defined as the average vector calculated for fifteen different utterances of a many speakers.
Hypothesized Speaker Model: The basis for verification systems is to create feature vector pattern used to represent speakers. Therefore, in this study, WR for hypothesized speaker model presenting is utilized. More specifically, the average of WR feature vectors extracted from fifteen different person’s utterances is used to represent each speaker pattern. This feature vector is compared to claimed speaker WR using verification functions.
Background speaker model: The background speakers should be selected
to signify the population of expected imposters, which is in general application
specific. Two issues that come up with the use of background speakers are the
selection of the speakers and the number of speakers to use. Ideally the number
of background speakers should be as large as possible to better model the imposter
population but practical considerations of computation and storage state a small
set of background speakers.

Fig. 2: 
First 150 coefficients of the feature vector extracted by
MWM for two speakers 
In the verification experiments, the number of background speakers is set
to three. In proposed scenario, it is assumed that imposters will attempt to
gain access only from similar sounding or at least samesex speakers.
The singlespeaker detection task can be defined as a basic hypothesis test
between:
H_{0} 
: 
Y is from the hypothesized speaker S and 
H_{1} 
: 
Y is not from the hypothesized speaker S 
The optimum test to choose between these two hypotheses is a following ratio test given by:
where, V (Y, H_{i} ), I = 0,1, is the verification function for the hypothesis H_{i} evaluated for the observed speech segment Y.
Verification functions: In this study three verification functions are
proposed:
• 
Percentage Root Mean Square Difference Score PRDS: This verification
function is based on the distance concept given by: 
where, MW_{Y} is WR taken for observed speech segment Y, at level three of WP and MWM_{Hi} is the average of WR vectors calculated for fifteen different utterances of a same speaker (hypothesized speaker H_{0}) or background model (H_{1}). The decision threshold for accepting or rejecting is determined by:
• 
Logarithmic Claimed to Signal Ratio Score CSRS: 
The decision threshold for accepting or rejecting is determined by:
• 
Correlation Coefficient Function: This function is denoted
by CC calculated between MW_{Y} and MWM_{Hi}. The decision
threshold is obtained by: 
Verification by gaussian mixture model: Gaussian Mixture Model GMM recently has become the dominant approach in textindependent speaker identification and verification. One of the influential attributes of GMMs is their capability to form smooth approximations to arbitrarily formed densities. As a typical model based approach, GMM has been utilized to characterize speaker’s voice in the form of probabilistic model. It has been reported that the GMM approach outperforms other classical methods for textindependent speaker recognition. In the following paragraph briefly mathematical development of the GMM based speaker verification scheme is introduced:
Given Y = {Y_{1}, Y_{2}, ... Y_{k}} where, Y = {y_{t1} = T_{1}, y_{t2} = T_{2}, ..., y_{tj} = T_{j}} is a sequence of T_{j} feature vectors in jth cluster R^{j}, the complete GMM for speaker model λ is characterized by the mean vectors, covariance matrices and mixture weights from all component densities. The parameters of the speaker model are denoted by:
λ = {P_{j,t}, _{t}u_{j,t}, _{t}Σ_{j,t}}, i = 1, 2, ..., M_{j} and j = 1, 2, ..., Then, the GMM likelihood can be written as:
Equation 14,
is the Gaussian mixture density for jth cluster and defined by a weighted sum
of M_{j} component densities (Lung, 2007). The
Expectation Maximization (EM) algorithm to create an object of the Gaussian
mixture distribution class restraining maximum likelihood estimates of the parameters
in a Gaussian mixture model with k components for data in the nbyd matrix
X, where n is the number of observations and d is the dimension of the data
is used. In this study, verification is performed by building Gaussian mixture
model by EM with 2 components of MWM vectors of two speakers feature vectors
GMMW. Then the GMM likelihood is used as the verification decision whether accept
or reject. This is accomplished by determining empirical threshold for decision
performing.
Verification by Kmean clustering method: In this section, a brief outline
of KMeans clustering algorithm and verification by this method is presented.
Clustering in N dimensional Euclidean space R^{N} is the process of
partitioning a given set of n points into a number, say K, of clusters based
on some similarity metric which establishes a rule for assigning patterns to
the domain of a particular cluster centroid as seen at Fig. 3.
Let the set of n points {x_{1}, x_{2}, ..., x_{n}} be
represented by the set S and the K clusters is represented by C_{1}
C_{2}, ..., C_{k} (Bandyopadhyay and Maulik,
2002). Then C_{i} ≠ φ for i = 1, ..., k, C_{i}∩
C_{j} = φ for i =1, ..., K, j = 1, ..., K, where:
KMeans (Hemalatha and Vivekanandan, 2008; Wagstaff
et al., 2001) is one of the commonly used clustering techniques,
which is an iterative hill climbing algorithm. It consists of the following
steps:
• 
Choosing K initial cluster centroids Z_{1}, Z_{2},
..., Z_{k}, randomly from the n points {x_{1}, x_{2},
..., x_{n}} 
• 
Assigning point x_{i}, i = 1, 2, ..., K to cluster C_{j},
j ε {1, 2, ..., k} where, x_{i}z_{j}≤
x_{i}z_{p} p =1, 2, …, k and j≠ p 
• 
Calculating new cluster centroids: ,
where 
i = 1, 2, ..., K, where n_{i} is the number of elements belonging to
cluster C_{i}
• 
If
then end. Otherwise continue from 2 
Kmeans is a common clustering algorithm that has been used in a variety of
application disciplines, such as image clustering and information retrieval,
as will as speech and speaker recognition.

Fig. 3: 
KMeans data clustering with K=4 (This paragraph with equations
and Fig. 3 are essential for understanding of contribution
of the paper) 
Different types of clustering algorithms that are based on KMeans, are mentioned
by Wagstaff et al. (2001), such as the modified
version for background knowledge, a genetic algorithm and the syllable contour
that is classified into several linear loci that serve as candidates for the
tonenucleus using segmental KMeans segmentation algorithm.
Here is an investigation of a new speaker verification system that based on
KMean feature extraction method KMM taken from speech signals. More specifically,
the presented verification method by KMeans clustering consists of two main
stages:
• 
Partitions the points in the NbyP data matrix X (two original
signals for two speakers) into two clusters. Then the two cluster centroid
locations in the 2byP matrix is extracted. For each speaker four columns
(8 coefficients) are preserved 
• 
Partitions the points in the NbyP data matrix X into four clusters.
Then coefficients is extracted as follows: 
• 
Distances from each point to every centroid in the Nby4 matrix D, afterwards
four coefficients: mean value, standard deviation, maximum and variance
are determined 
• 
Four cluster centroid locations in the 4byP matrix C (16 coefficients) 
• 
Sums of pointtocentroid distances in the 1by4 vector M (4 coefficients) 
• 
The first 32 elements of Nby1 vector I containing the cluster indices
of each point 
In total, 64 coefficients vector are extracted by this method for each speaker
for five signal frames of 1 second duration (totally 320 coefficient is extracted
for each speaker). The verification decision is taken based on the method presented
in by Eq. 10, 12 and 13.
RESULTS AND DISCUSSION
A testing database was created from Arabic language. The recording environment
is a normal office environment via PCsound card, with spectral frequency 4000
Hz and sampling frequency 16000 Hz. These Arabic utterances are Arabic spoken
digits from 0 to 15. In addition, each speaker read ten separated 30 sec different
Arabic texts. Total 84 individual speakers (19 to 40 years old) who are 54 individual
male and 30 individual female spoken these Arabic words and texts for training
by the method (creating hypothesized speaker model for each individual speaker
and background speaker model).

Fig. 4: 
Speaker verification rate results for 84 speakers 
The total number of tokens considered for training was 2100.
Experiments were conducted on a subset of recorded database consist of 54 male and 30 female speakers of different spoken words and texts. At first, feature vector was created by extracting MWM from silenceremoved data for each frame. Finally, verification process was performed using the three verification functions. For accepting, at least two of three of the three verification functions should be more than the threshold.
Speaker verification is a binary decision task to state whether a test utterance belongs to a speaker or not (hence, an outside imposter). Evaluations were carried out on the pool of 84 speakers, with the individual speaker features constructed using 100% of the data, and the imposter speaker model obtained from 100% of the utterances belonging to all speakers. In case of individual speaker to same speaker of different utterances verification (speakerspeaker system), 25 trials are applied for each speaker. In case of individual imposter to speaker verification (imposterspeaker system), 25 trials are also applied for each speaker. The verification rates for the 84 speakers are illustrated at Fig. 4. All experiments were applied according to the textindependent system.
A single run of speaker verification task consists of scoring test files against
the speaker model and background model. If the score ratio is greater than a
threshold for at least two of the three verification functions using specified
by Eq. 10, 12 and 13,
the test file is categorized as the target speaker, otherwise when the score
is less than this certain threshold for at least two of the three verification
functions, the test file is classified as an outside imposter. The performance
of presented verification system according to speakerspeaker verification system
and imposterspeaker verification system for independenttext platform taken
for different arithmetic modulus denoted by Mod: seven, eleven and nineteen
were reported at Table 1. Better result was achieved (91.61%)
in case of the speakerspeaker verification system for Mod19.
In the next experiment, the performances of the MWVS systems in the speakerspeaker and speakerimposter platforms were compared with the same of GMM and MWM method GMMW presented in section 3.1 and KMeans method KMM presented in section 3.2, under the recorded database. The results of these experiments via recorded database are summarized in Table 2. These results indicate that under similar condition, MWVS and GMMW provide a better platform for speaker verification than KMM. Moreover, the speakerspeaker system provides more accurate results than the imposterspeaker results for the all three methods.
Subsequent to assessment in the normal condition, experiments to assess the
speaker verification system in the speakerimposter platform under abnormal
noisy were conducted.
Table 1: 
Verification rate results for different modular arithmetic 

Table 2: 
Speaker verification rate results for GMMW, KMM and MWVS 


Fig. 5: 
Verification rates in the speakerimposter platform under
AWGN using DWT for 84 speakers 
To implement this experiment, additive white Gaussian noise (AWGN) was added
to the verified signals only, with SNR from 5 to 10 dB. Afterwards were applied
to presented method, where no important recognition rate was noticed. In this
experiment, the performance of the speaker verification system in speakerimposter
was improved by using the approximate subsignals of DWT at level (j) 1, 2 and
3. The results of this experiment are illustrated in Fig. 5.
These results indicate that the proposed verification system can tackle additive
white Gaussian noise condition more robustly if approximate subsignal of DWT
at level 3 is used instead of the original signal.
CONCLUSION
Modular arithmetic and wavelet packet based speaker verification system is
proposed in this study. This system was developed using a performing three verification
methods. In this study, effective feature extraction method for textindependent
system is developed, taking in consideration that the computational complexity
is very crucial issue. The experimental results on a subset of recorded database
showed that feature extraction method proposed in this paper is appropriate
for textindependent verification system. Two other verification methods are
proposed GMMW and KMM. The results of the experiments conducted in this study
demonstrated a better performance of MWVS in textindependent verification task.
Finally, the developed speaker verification system was employed with data obtained
under abnormal conditions where AWGN noisy was added. In this case it was observed
that the MWVS system can tackle additive white Gaussian noise condition more
robustly if approximate subsignal of DWT at level 3 is used instead of the
original signal. Another major contribution of this research is the development
of a less computational complexity speaker verification system with modular
arithmetic capable of dealing with abnormal conditions for relatively good degree.