INTRODUCTION
Presently, a rapid growth has been witnessed in the demand for effective identity
verification methods for an increasing range of applications. This process,
however, has been found to be considerably problematic due to the inherent limitations
of traditional authentication methods (Bubeck, 2003).
Because, the keys are easily misplaced, copied, stolen, or mechanically bypassed
through various forms of stealth and artifice. Also, the Pin numbers slip too
readily from memory or if recorded for later reference and they are potentially
accessible to impostors. The biometric systems essentially supercede these knowledgebased
verification controls since they are found in one who is the legitimate user
by virtue. For example, of one's face, fingerprints, voice, handgeometry or
retina scans. However, the verification of identity using biometric data is
replete with operational problems such as imperfect imaging conditions or changing
conditions from one sample to another, changes in user’s physiological
or behavioral characteristics, user’s interaction with the sensor and the
ambient conditions such as temperature and humidity fluctuation. In recent years,
an area of considerable interest in biometric recognition has been the use of
multiple modalities. The main attraction of multimodal biometrics is that it
provides the opportunity for enhancing the recognition accuracy beyond that
achievable with unimodal biometrics.
A multimodal biometric system requires an integration scheme to fuse the information
obtained from the individual modalities. Such information is proved to be complementary
to each other (Jain et al., 2004; Indovina
et al., 2003). There are various levels of fusion from sensor to
decision level (Jain et al., 2004). The fusion
at the matching score level is the most popular and frequently used method due
to its better performance, intuitiveness and simplicity (Indovina
et al., 2003).
In this study, a neural network approach is proposed for combining data obtained from face and voice modalities. The resilient backpropagation training algorithm was used for this purpose. The main aim of this study was to explore the potential usefulness of neural network technique and to investigate the possibility of benefiting from its properties of training and adaptability in enhancing accuracy in a multimodal biometrics scenario. To investigate the effectiveness of the suggested approach, it was compared with two well known fusion schemes namely Brute Force Search (BFS) and Genetic Algorithms (GA). The Mean Squared Error (MSE) was used as a performance indicator for these experiments.
Fusion techniques
Brute Force Search (BFS): This fusion technique can be used in the case
of having two matcher types only. The approach is based on the following equation
(Ma et al., 2005).
where u is the fused score, m^{th} is the normalized score of the m^{th} matcher, m = 1 or 2 and w is a weighting (combination) factor in the range 0 to 1. The weight (w) is calculated heuristically, by exhaustive search in order to minimize the Mean Squared Error (MSE) on the given development data.
Genetic algorithms (GA): The genetic algorithms have proved their capability
of elite preservation strategy to search for optimum solutions in multidimensional
space without worrying about local minima. They were intensively investigated
in the last decades in optimization problems and several variants have been
proposed in the literature (Kamepalli, 2001; Jia
et al., 2003; Castillo et al., 2007;
Sun and Wan, 1995). They rely mainly on performing biological
type operations such as reproduction, crossover, mutation and selection according
to some predefined fitness or cost function. The reproduction scheme generates
a population of candidates in some region of the space; i.e., exploration, crossover
will give birth to offspring of the next generation, mutation simulate small
random variation of the genotype and the selection will preserve only elite
candidates or 'best candidate' solutions; i.e., exploitation.
The algorithm starts with generating an initial population of candidates W_{i}^{0} = (w_{1}, w_{2}) with a uniform distribution. Then, for each candidate W_{i}, compute MSE(u)
where:
where, u is the fused score, x_{m} is the normalized score of the mth matcher, m = 1 or 2 and w_{m} is the corresponding weight (obtained on some development data) in the interval of 0 to 1, with no constrains. Then, select best candidates W_{i}^{0} for which fitness MSE(u)’s are minimal, others are discarded, these will represent the elites for which mutation and crossover operations are performed. At this stage a new population i is created; W_{i}^{j}; for the next generation j, where, i = 1 to N, N is the population size and j = 1 to M, M is the maximum number of generations. This process will be iterative going back again to computing MSE(u_{i}^{j}) where:
where, u_{i}^{j} is the fused score for the ith population in the jth generation.
In this technique, it is ensured that the population size during all generations remain unchanged. Up to a number of predefined generations M, the best candidate
W for which MSE(u_{i}^{j}) is minimal is then selected. However, so far, no formal method exists in the current GA literature as to predefine best GA parameters. They need performing several runs to find bestfit GA parameters, i.e., population size, number of generations, reproduction, crossover, mutation schemes and most importantly the selection criteria, i.e., MSE for such fusion scores is minimal.
In this study, the crossover probability is 0.90, the mutation rate is 0.10,
the population size is 100, the number of generations is 10 and the fitness
function is such that the vector weight for which the error rate of the fused
scores be minimal.
Neural network: The most popular neural network in pattern recognition
and decision is multilayer feedforward type. The backpropagation is the most
useful training algorithm for feedforward networks and is used to calculate
the gradient of the error of the network with respect to the network's modifiable
weights. This gradient is then used in a simple stochastic gradient descent
algorithm to find weights that minimize the system error rate. There exist faster
algorithms that use heuristic techniques. One heuristic modification is the
variable learning rate backpropagation and resilient backpropagation. In this
study, the resilient backpropagation training algorithm was used (Freeman
et al., 1998; Zainuddin and AbuHassan, 2005).
The purpose of using such algorithm is to eliminate the harmful effects of the
magnitudes of the partial derivatives.
The proposed structure for fusing biometrics is shown in Fig. 1.
In this technique, a feedforward multilayer neural network (Freeman
et al., 1998; Zainuddin and AbuHassan, 2005)
is used with the following parameters:

Fig. 1: 
Structure of the fusion neural network of scores from the
individual biometric modalities 
Input layer with 2N nodes, N nodes for face modality and N nodes for voice modality.
One hidden layer of M nodes is used for fusing biometrics; M being less than 2N. For each client represented by a concatenation vector of both modalities; one output node is assigned value 1 for sake of recognition. For impostors, the remaining output nodes are forced to zero. The output layer will be of N nodes, each of these for a particular client to be recognized. In this technique, the scores for face and voice modalities are of size NxN. Thus, the combined scores for fusion will be of size Nx(2N) each row is score vector of a particular client.
The fusion process is achieved by the proposed scheme in this work as follows:
where, θ_{j} is the output of node j in the hidden layer, F is a simple transfer function (sigmoid) for all nodes, S_{i}^{face} is a score vector for a particular client from the face modality, S_{i}^{voice} is a score vector for a particular client from the voice modality, w_{ij} is the synaptic weight connection node i to node j. The part θ_{k} in Eq. 2 represents the output of node k in the output layer and w_{jk} being synaptic weight between node j in the hidden layer and node k in the output layer.
In the development stage, the weigh matrices w_{ij} and w_{jk} are rearranged in such a way that the network will perform a recognition of a particular client over impostors. As said earlier that the accelerated resilient backpropagation algorithm, in this study, aims to minimize the mean squared errors using the Delta rule [9,10].
where, η is the momentum of the gradient and E is the sum squared error between the actual output and the desired one as follows:
Finally, the fusion is made explicitly within the network structure at the individual score level in the hidden layer. Therefore, the computationalbased search
of bestfit neural synaptic weights for fusion is simply generated by adapting the neural network for bestdecision scores.
RESULTS AND DISCUSSION
The experimental studies are concerned with the scorelevel fusion of face and voice biometrics in the recognition mode of verification. The investigations were initially performed by using scores for clean face images together with scores for degraded utterances.
In each experiment, the individual biometric score types involved were subjected
to the range equalization process using the MinMax normalization (Indovina
et al., 2003). In this study, the process of scorelevel fusion is
based on the use of brute force search, genetic algorithms as well as the neural
network. The procedures followed for speech feature extraction and speaker classification
were according to Fortuna et al. (2004). The
face recognition scores were based on the approach described by Zafeiriou
et al. (2006).
Fusion under varied data quality conditions: The datasets considered
for the face and voice modalities were extracted from the XM2VTS (clean images)
(Zafeiriou et al., 2006) and from the 1speaker
detection task of the NIST Speaker Recognition Evaluation 2003 (degraded speech)
databases, respectively (Fortuna et al., 2004).
Using these data sets, a total number of 140 client tests and 19460 (i.e., 140x[1401])
nonclient tests were used from the development data for investigating the performance
of the proposed schemes.
The results of verification experiments were presented as Mean Squared Errors (MSEs) (Table 1). The proposed neural network approach was compared to two well known fusion schemes namely the Brute Force Search (BFS) and the Genetic Algorithm (GA). These three fusion approaches are mainly concerned with adjusting the balance of weighting in fusion in favour of the modalities of better quality. Therefore, it can be stated that the method that could introduce an appropriate weighting scheme should lead to better verification results.
In the first fusion technique (BFS), although weights are calculated heuristically,
by exhaustive search in order to minimize the MSE on the given development data.
It was noticed that the use of BFS has increased the MSE obtained with the best
single modality. This increase in MSE could be due to highly degraded speech
database involved in this study. However, it should be emphasized that the use
of GA, successfully reduced the MSE for fused biometrics up to 86% which resulted
from the characteristics of GA that aims to assign best fit weights to the biometric
scores.
Table 1: 
Effectiveness of BFS and GA in biometric verification based
on mixedquality data 

It was also clear from the results that neural network approach leads to the
best performance as compared to BFS and GA (Table 1). In this
case, the reduction achieved in MSE with such a fusion scheme was in excess
of 99% as compared to the better modality. These results confirmed the usefulness
of neural network in enhancing accuracy in multimodal biometric systems. This
could be due to the fact that during the training stage, neural network successfully
tunes the connection weights to minimize the system error rates.
CONCLUSION
The study findings introduced the resilient backpropagation training algorithm for combining data obtained from face and voice modalities. Amongst the three fusion methods considered, neural network approach appears to offer considerable improvements to the accuracy of multimodal biometrics in varied data conditions which seems to be related to the individual characteristics of the proposed neural network approach. The fusion is made explicitly within the network structure at the individual score level. The network connection weights can successfully be tuned during the training stage for best decision scores. The preliminary results of the proposed approaches provides motivation for further research in order to exploit the properties of training and adaptability of neural network along with fuzzy systems to multimodal fusion.