INTRODUCTION
Support Vector Machine (SVM) (Cortes and Vapnik, 1995)
is one supervised study algorithm which is used widely in the statistical classifications
and linear regression analysis. It could map one vector into one higher dimension
space, where one super plane with the largest distance could be implemented
for linear classification. Based on the statistics learning theory (Vapnik,
1998), the SVM could be constructed by minimizing the upper bound of expectation
risk and obeys the rules of Structure Risk Minimization (SRM) other than the
classic Experience Risk Minimization (ERM). Its design is refined and it has
unique strong point to find the resolutions of the problems with small samples,
nonlinear and high dimensions. The problems that puzzled human beings such
as model selection, dimension tragedy and local minimization in the machine
learning field, could be resolved by the SVM theories. What’s more, many
classic learning models could be equivalent to the SVM learning algorithms.
Therefore, the SVM and SLT are thought as the perfect basic frame for current
machine theories.
Since, the appearing of the SVM, it’s used widely as one effective intelligent
machine learning method in the image analysis and disposing, especially for
face identification and detections (Zhang et al.,
2011; Zeng and Ye, 2008; Zhang
and Zhang, 2009). One face identification system was constructed based the
SVM by Zhao et al. (2012) in their research
experiments (Zeng and Ye, 2008). The Gabor wavelet
was used to conduct multiple scale transformation to extract the texture feature
of face images (Zhang and Zhang, 2009) and then principal
component analysis was done for the face identification based SVM algorithms.
However, the face image features consist of one vector with high dimension which
would lead to the issues that the computation complexity of the learning algorithms
based SVM increased with the training samples (Shu et
al., 2011). In fact, a large number of training data must be used to
reach the accuracy of face image identification algorithms. Therefore, the random
sample method based on Monte Carlo and Bayesian support vector machine are combined
to get rid of the issues of high learning dimension, long learning time in the
process of face detection and identifications.
PROBLEM DESCRIPTION
In order to find the human information in one image, the face information should
be figured out firstly. The common method is to conduct image segmentation and
the segmentation area would be check if there contains face information features.
In fact, the diversification of face shapes makes it impossible to describe
one face by one modality function f and the face with different shape would
be presented by different face feature values. If one data variance x is used
to describe face feature, the modality function f is the function of variance
x. Obviously, if the sample face images are enough, the value ranges of human
face feature vector x and the human face modality function could be determined
by the statistics approaches. Thus, the statistical relations between the face
feature vector x and the face modality function f could be setup. Here, the
nonlinearity relations between the feature vector x and modality function f
is supposed and one linear relation φ(x) could be figured out if it was
transformed by one nonlinear transformation φ. Thus, the relations between
the modality function f and the face features could be described by the following
linear regression problems (Eq. 1):
Here, the weight parameters w and the threshold b could be determined by the following programming Eq. 2 without restriction:
Here, the symbol L is the loss function. Thus, the linear regression problem could be resolved by the SVM with good results, where the loss function could be selected from the functions, i.e., Laplace, Huber and ε functions.
The results as shown by Eq. 2 presents the SRM approximate
rules, where the first item is the VC item, the second item is the experiment
risk item and the regular factor C is the bridge of the two items. In order
to find the resolution of Eq. 2, the Lagrange multiplier α_{n}
and α*_{n} are used and it could be transformed into one secondary
programming problem which could be referred (Pereira et
al., 2011). Now, the last SVM regression could be presented as the following
Eq. 3:
THEORIES OF BAYESIAN SUPPORT VECTOR MACHINE
In fact, the Bayesian Support Vector Machine (BSVM) is one effective form of
the classic support vector machine. According to the BSVM theories, the weight
vector w is considered as the weight parameters and the notability frame (Zhang
and Zhang, 2009). Another approach is that the modality function f_{N}
= (f(x_{1}), f(x_{2}),…, f(x_{N}))^{T}
was used as the weight parameter. Here, the support vector machine that used
f_{N} as weight function is called as the Bayesian support vector machine,
i.e., GSVM. The following gives the details about the theories of BSVM.
Deduction with w as weight parameters: Based on the conclusions of notability
frames (Zhang and Zhang, 2009), suppose the transcendent
probability could be described as following Eq. 4:
Here, the symbol ∝ is used without =, as it could ignore the normalized factors. Then, the likelihood probability is given as Eq. 5:
According to the Bayesian formula (Shu et al.,
2011), the posteriori distribution of weight vector w could be evaluated
on the Eq. 45:
Here:
When the posteriori distribution P(wD) is maximized, the equation log P(wD) is minimized. With considering log P(wD)∝M(w), the maximization P(wD) is equal to minimize M(w), i.e.:
After the comparison of Eq. 8 and 2, it’s
obvious that they are equivalent. This means that the optimal value w_{MP}
from the BSVM deduction is equivalent to the results from the secondary programming
based on the SVM theories.
According to the context above, the expected output t distribution of BSVM, could also be given as Eq. 9:
From the Eq. 5, the item P(tx, w) in Eq. 9 is in the form of as following Eq. 10:
When it is put into the Eq. 9 with 6, we could get P(tx, D).
Deduction with f_{N} as weight parameters: When the f_{N} was used as weight parameters, the output f(x, w) of BSVM is actually reformed as f(x). Firstly, f(x) is supposed as one Gauss progress with zero average and then its sample sequence f_{N} = (f(x_{1}), f(x_{2}),…, f(x_{N}))^{T} are also obeying to the Gauss distribution, i.e.:
Here, the item K_{N} is one kernel matrix and the corresponding like
hood function is given by Eq. 12:
Then, f_{N} is used as weight parameter and the obvious posteriori probability is presented by Eq. 13:
Where:
According the conclusion by Zhang et al. (2011)
f^{ T}_{N}K^{1}_{N}f_{N} = w^{2},
the Eq. 14 is actually equivalent to Eq. 7.
This means that the two deduction methods are consistent to each other. Thus,
when the f_{N} is used as weight parameter, the distribution of the
expected output t from BSVM would be Eq. 15:
IMPLEMENTED BSVM BY MONTE CARLO RANDOM SAMPLING
According to the theories of Hybrid Monte Carlo (HMC) random sampling, it could be used to implement the BSVM algorithm as one simple way. Here, the Eq. 9 should be rewritten as Eq. 16:
Here, the item w_{n}(n = 1, 2,…, N_{c}) could be considered as the N_{c} samples according to the posteriori distribution P(wD). When the square function was supposed, the expression P(tx, w_{n}) could be presented by Eq. 17:
According to the two equations above, the expected output distribution P(tx, D) of BSVM could be evaluated approximately. However, other three numerical methods, i.e., calculus of variations, required to obtain the approximate posteriori distribution P(wD) before the approximate the expected output distribution P(tx, D). This process presents the strong points of the hybrid Monte Carlo methods.
Now, the key problem is how to conduct the samples from the posteriori distribution P(wD). Its expression is given by Eq. 6, where there are two undefined parameters a and b. So, the optimal values α_{MP} and β_{MP} for the two parameters α and β should be evaluated before conducting sample operations. The following content would present the method to obtain α_{MP} and β_{MP} and the details about the practical sample operations:
• 
Optimal super parameters for posteriori distribution:
Firstly, we would consider α and β as super parameters, then the
values α_{MP} and β_{MP} were computed by the
secondary deduction by the notability frame and the flow chart could be
presented by Fig. 1. According to the Fig.
1, the optimal values α_{MP} and β_{MP} are
obtained by iteration methods, where the symbol M_{b} is one constant
threshold. The definitions of E_{W}(w_{i}) and E_{D}(w_{i})
are given as the following two equations: 
• 
Monte Carlo sample process: When the optimal values
α_{MP} and β_{MP} for the super parameters α
and β, the sample operation could be done on the posteriori distribution
P(wD). In fact, the Metropolis algorithm could be used for the sampling
but the classical random walk issue troubles the performance. In order to
avoid this issue, the hybrid Monte Carlo method is used to conduct sample
operation. Here, the hybrid Monte Carlo method could be considered as one
special Monte Carlo method 

Fig. 1: 
Process to evaluate the optimal values α_{MP}
and β_{MP} of the super parameters of posteriori distribution 
Compared to the Metropolis algorithm, the HMC method improves two aspects. Firstly, one more ‘flog leap’ step is used to avoid the random walk phenomenon. Secondly, the supplement variance u is introduced and the posteriori distribution P(wD) was extended to P(w, uD). This would lead to easier sample operation from P(w, uD) other than P(wD). Thus, the samples {(w_{i}, u_{i})}^{N}_{i = 1} could be obtained from P(w, uD) and its parameters u_{i} would be ignored.
The HMC sample flow chart could be displayed by Fig. 2a and it’s similar to that of Metropolis algorithm except for one more ‘frog leap’ step.
When the start point of ‘frog leap’ step is given, the values x_{M}
and u_{M} would be generated after M steps of ‘frog leap’
and they would be used as the i+1 sample according to the probability A, i.e.,
x_{i+1} and u_{i+1}. The step length r of the frog leap should
also be considered carefully. If the step length r is too long, the received
probability A is too small while the small r value, the frog leap steps should
be increased correspondingly. According to the Gualdi’s research results
in his report (Gualdi et al., 2012), the frog
leap step length r should change once when the value i is increased once. Thus,
the various frog leaps other than constant step should be adopted. Here, the
constant step length is used in order to simplify the computations.
NUMERICAL RESULTS
The training database with about 200 face images was constructed and these images were disposed where only the face features were conclude, i.e., eyes, brows, nose, mouth and so on. We could identify the face images on the base of these basic features when the given images are segmented into different areas. Twenty human face images were displayed in Fig. 3 for the constructed face image database. From these typical face images, they have different features for their colors, emotions and shapes (i.e., squares, flanks and incline) with common typical feature of features.
In fact, the effects of human face color could be get ride by gray disposing
on these images. However, the pivotal problem is how to evaluate the effects
of human emotions and shapes on the human face feature extractions. This indicates
actually that the face features extracted by digital image processing algorithms
contains the face shape information other than face features. But it’s
still one issue to be resolved for any possible effective algorithms. Therefore,
the circle degree values of human face images are used to measure the face shape
information quantities. Here, the Gabor texture features and Hu invariant variables
are used as the face features and the hybrid Monte Carlo random sample operator
and the BSVM algorithms are combined to figure out the statistical rules between
the human face features and face shapes. Thus, these features could be used
to justify the given images whether could be fit for the extracted statistical
rules. If these images are fit for the rules, they would be considered as the
face images, vice versa. Subsequently, for the about 40 randomly extracted images
from the training databases, their Gabor texture features and seven element
invariants are extracted for the description, where the Gabor texture features
consists of several filter banks with about three different scales and four
rotation orientation. These Gabor filter banks are used to filter on the face
images and their average values and variances are obtained, correspondingly.
While, the Hu invariant moments are obtained by the combination of zeros, one,
two, three moments.

Fig. 2(ac): 
HMC flow chart and frog leap and its sketch map (a) Flow chart,
(b) M step frog leap to generate w_{M} from w_{0} and (c)
M step frog leap to generate u_{M} from u_{0} 

Fig. 3: 
Typical face images in the constructed training databases 
In order to verify the proposed face identification methods, about 50, 75,
100, 125, 150, 175 and 200 face images were used to train the corresponding
BSVM algorithms and the statistical rules between the face features and face
shapes are extracted. Here, there are two training methods, i.e., the hybrid
Monte Carlo method and the data regress methods. The training method based data
regress is only supported by the support vector machines.

Fig. 4: 
Time cost comparison of hybrid Monte Carlo random sampling
and common training method 

Fig. 5(ab): 
Face identification results for the (a) Hybrid Monte Carlo
random sample and (b) Common training method 
In fact, if about nxN images were used to train the Bayesian support vector
machines, all the nxN face images would be used to training process by the data
regression method with its computation complexity exponent increase with its
training image numbers. However, for the hybrid Monte Carlo random sample methods,
about N images were extracted stochastically from nxN images to train the Bayesian
support vector machines for about n times. Then, the results of the BSVM with
n time training would be averaged to obtain the Bayesian support vector machines.
And, the summed times for all the n times are the total times for all the hybrid
Monte Carlo random sample methods. At last, the cost time to train the BSVM
with about 50, 75, 100, 125, 150, 175 and 200 face images, was plotted in the
Fig. 4 in the computer with Intel(R) Core(TM)2 Duo CPU T5800
at 2.00 GHz and 1 GB memory.
With the increase of training samples of human face images, the hybrid Monte
Carlo random sample method could reduce the cost time of the Bayesian support
vector machines as the Fig. 4 indicated. Thus, by comparing
the training results of the two different Bayesian support vector machine algorithms,
two human being pictures are used to face identifications after the two pictures
were segmented forehand. Their face identification accuracies are showed by
the Fig. 56, respectively. The images with
large face with small image complexity could be identified with good accuracy.
However, for the images with small faces and several human beings, the hybrid
Monte Carlo Bayesian SVM method could reach to good identification effect.

Fig. 6(ab): 
Human being face identification results based (a) Hybrid Monte
Carlo random sample and (b) Common training methods 
The common training method could not detect the small human being faces.
CONCLUSION
With considerable special advantages of the support vector machine in the resolution of problems in cases of small samples, nonlinearity and high dimension learning issues, the SVM algorithm has been used widely in the human being face identification and detections. However, a large number of face images should be used to conduct the training on the support vector machines with the exponent increased complexity with the training samples. Therefore, the statistical learning theories and the support vector machines are integrated to conduct face image identifications. Here, the hybrid Monte Carlo random sample is implemented to train the Bayesian support vector machines and good image identification efficiency are achieved with low training times.