INTRODUCTION
The field of face detection has made significant progress in the past decade.
There have been numerous reported approaches to face detection. Early works
(before year 2003) had been nicely surveyed by Yang et
al. (2002). Yang et al. (2002) divided
the various approaches into four categories: knowledgebased methods, feature
invariant approaches (Tareeq et al., 2007), template
matching methods (Park and Park, 2006) and appearancebased
methods (Ishak et al., 2006; Pakazad
et al., 2011). In general, appearancebased methods had been showing
superior performance to the others. Recently, more attention is given to boostingbased
face detection schemes which have evolved as the defacto standard of face detection
in realworld.
The realtime face detector based on AdaBoost algorithm proposed by Viola
and Jones (2004) can be regarded as a milestone in face detection researches.
The realtime performance was achieved by learning a sequence of Haarlike features.
The Haarlike features encode differences in intensities between rectangles
regions which can be calculated rapidly through the integral image. AdaBoost
algorithm is used to select a small number of distinct weak classifiers from
a large family to which they belong and combines these weak classifiers as a
strong classifier. Viola also adopted the cascaded structure to further speed
up efficiency of face detection. Lienhart and Maydt (2002)
used a set of rotated Haarlike features which significantly enrich the basic
set of simple Haarlike features. The detector had better accuracy than ViolaJones
face detector. Mita et al. (2005) proposed joint
Haarlike feature to capture the structural similarities within the face. It
can be calculated very fast and has robustness against addition of noise and
change in illumination. Yan et al. (2008) proposed
the Locally Assembled Binary (LAB) Haar feature. Several neighboring binary
Haar features were then assembled to capture their cooccurrence with similar
idea to LBP. They showed that LAB feature was more efficient than Haarlike
and LBP features both in discriminant power and computational cost. Because
of the large set of features and the selecting procedure of weak classifiers,
the AdaBoost training process of these features for face detection still takes
a long time.
In this study, we propose the socalled NeighboringBlock Binary Pattern (NBBP
for short) features and develop a feature selection strategy based on the entropy
of NBBP features to improve the learning efficiency of AdaBoost algorithm. The
idea is motivated from the information gain developed for Iterative Dichotomizer
version 3 (ID3) (Quinlan, 1986; Christy
and Thambidurai, 2006). Instead of exhaustively searching the feature with
minimal classification error in a bruteforce manner, we compute the feature
entropy for each NBBP and, then, choose the one with minimal entropy as the
most capable one in classification straightforwardly. Comparing with Haarlike
features, the total amount of competitive features for NBBP is much smaller.
This provides us a factor to accelerate the learning process. Furthermore, NBBP
feature is illumination invariant and more robust on reflecting the structure
of the underlying patterns. This facilitates the face detection system to learn
the true facial structure from training set with face and nonface images. The
proposed feature selection methodology based on entropy can provide a precise
criterion to measure the discrimination ability of available features and can
serve as a ‘referee’ to improve the efficiency of AdaBoost algorithm.
NEIGHBORINGBLOCK BINARY PATTERN FEATURES
Traditional Haarlike rectangle features measure the difference between the
sum of the pixels within rectangular regions, these features can capture the
intensity gradient at different locations by varying the position, size, shape
or arrangement of rectangular regions. Viola and Jones (2004)
applied three types of rectangle features for frontal face detection (Fig.
1). With the help of integral image, these features can be computed in constant
time, so Haarlike features can achieve realtime detection. However, Haarlike
features seem too simple and have some limits.
NBBP is a synonym of MultiBlock Local Binary Pattern (MBLBP) proposed by
Zhang et al. (2007). The basic idea to encode
the NBBP features is to apply the LBP operator to neighboring rectangular blocks.
Specifically, the original LBP operator was introduced by Ojala
et al. (1996). It labels the pixels in a 3x3 image by thresholding
the surrounding 8 pixels’ values with the center pixel’s value and
then, encoding the results as a binary number. The NBBP extends the concept
of LBP operator by applying it to 3x3 neighboringblocks. Some possible layouts
of 3x3 neighboring block is shown in Fig. 2 and the number
inside each block denotes its corresponding index. All blocks have the same
size and the value of each block is the sum of every pixel in each block which
can be fast computed through integral image. The values of surrounding blocks
will compare with the value of the central block and encode a nonmetric value
just like LBP.
We will use NBBP_{j} to denote the NBBP operator for the j^{th} nbbp feature. Its encoding scheme is described as:

Fig. 1: 
Examples of Haarlike features 

Fig. 2: 
Some possible layouts of NBBP features 
where, x^{j} represents subimage cropped by the j^{th} NBBP, x_{k}^{j} represents the k^{th} block in x^{j} indexed using the scheme depicted in Fig. 2, val (x_{k}^{j}) represents the value of block x_{k}^{j} and:
An example of NBBP feature value encoding scheme is shown in Fig. 3.
The encoded binary value represents the feature value of an NBBP. Apparently,
an NBBP feature covering larger area carry information for global texture and
one with smaller area is for local texture.

Fig. 3: 
A demonstration of NBBP feature value encoding scheme 
So NBBP features can detect diverse image structures such as edges, lines,
spots, flat areas and corners. Besides, NBBP is encoded by comparison, so it
doesn’t need to do any lighting normalization. In other words, NBBP features
are illumination invariant. Another advantage of NBBP feature is that the total
number of NBBPfeature candidates is much smaller than the Haarlike ones. Considering
an 18x18 subwindow, for example, the ratio of the total number of NBBP candidates
to the Haarlike ones is about 1 to 20 (2,601 : 51,705). This will greatly reduces
the time needed in the feature selection stage of AdaBoost algorithm, to be
detailed shortly. The experimental results from Zhang et
al. (2007) also reveal that the NBBP features are more effective than
Haarlike features and original LBP features.
NBBP ENTROPY
In thermodynamics, entropy is the measurement of characterizing disorder, alias
randomness (Laurendeau, 2005). Shannon
(2001) connected such a measurement to the uncertainty of an information
environment (Wang 2011). The common of these two is that
the more uncertainty the system should possess, the larger the entropy. Pattern
classification is aiming at categorizing things in order, i.e., less uncertainty.
Therefore, machinelearning algorithm, e.g., AdaBoost, is, in fact, to seek
some ways to minimize the system entropy effectively. In this section, we define
the feature entropies of NBBP’s associated to a given weighted training
set, i.e., each training sample is assigned a advantage of the feature entropies
to make a clever feature selection for AdaBoost Algorithm.
Consider a training set T = {(x_{i}, y_{i}) : y_{i} ∈{1, +1}} = P∪N, where x_{i} represents the ith graylevel image, y_{i} = 1 labels x_{i} as a negative example, y_{i} = +1 labels x_{i} as a positive example and P and N partition T into the subsets containing all positive and negative examples, respectively. Furthermore, we assume that all images in T have the same size. Each training example is assigned a weight (probability density) which is defined by D = {D (i)} with D (i) being the associated weight of (x_{i}, y_{i}) ∈ T. Since D (i) is a probability density, it is very obviously that Σ_{i} D (i) = 1.
Let's now focus on the particular NBBP_{j}. Applied NBBP_{j}
on P and N can obtain corresponding encoded feature values and two histogram
and
associated with P and N, respectively, can be constructed by accumulating the
probability densities associated with the examples in their corresponding training
subsets (Fig. 4). Specifically, the elements in
and
can be obtained by:
where, v = 0, . . . , 255 and:
With
and
the (expected) entropy ε_{j} for NBBP_{j} is defined by:
where,
Based on the theory of entropy, one can easily see that the minimal and maximal
possible values of ε_{j} are 0 and 1, respectively. ε_{j}
= 0 corresponds to the case that e_{j},_{v} = 0 for all v =
0, . . . , 255, i.e.,
and/or for
all v = 0, . . . , 255.

Fig. 4: 
NBBPhistogram construction from a training set 
This implies that each bin only contains patterns with single parity. Therefore,
considering NBBP_{j} as a feature, it can separate the positive and
negative examples in T completely. On the other hand, ε_{j} = 1
corresponds to the case that p^{+}_{j,v} = p^{}_{j,v}
for all v = 0,..., 255. Such a case implies that NBBP_{j} is totally
incapable for classification. The above observation reveals that the smaller
the entropy for a feature is selected for classification, the less misclassification
should be resulted.
ENTROPYDIRECTED ADABOOST ALGORITHM BASED ON NBBP
In this section, we describe the weak classifier based on NBBPfeature and summarize the improved version of AdaBoost algorithm which takes advantage of NBBPfeature entropy as a guidance for feature selection.
Weak classifier construction: Because the value of NBBP is nonmetric,
the weak classifier h_{j }(x) will not be in a threshold type like the
one with Viola and Jones (2004).
A new type of weak classifier for NBBP features must be designed. Now, consider
NBBP_{j} associated with histograms
and .
The new type weak classifier h_{j}(x) based on multibranch tree is
defined by:
and:
Therefore, given a pattern x, the output of h_{j} can be obtained by looking up the table associated to NBBP_{j}. Besides, this type of weak classifier is more compact, i.e., memory inexpensive, because each classifier needs only 32 bytes (256 bits) to store the table.
Entropydirected AdaBoost learning: Basically, the training procedure
of AdaBoost trains an optimal weak classifier, say, h_{t'} each iteration
on weighted training samples, giving higher weight to samples that are currently
misclassified. This is done T times, i.e., t = 1,…, T, for a sequence of
adaptively reweighted samples. The final classifier, say, H is a linear combination
of weak classifiers from each training stage.

Fig. 5: 
Original discrete AdaBoost algorithm 

Fig. 6: 
Entropydirected discrete AdaBoost algorithm 
For comparison, Fig. 5 and 6 list the
original and the proposed EntropyDirected AdaBoost algorithms, respectively.
The shaded areas in the tables highlight the main difference of them.
The most time consuming and critical part of AdaBoost algorithm lies in selection of the optimal weak classifier from a huge number of candidates. The original AdaBoost algorithm finds the optimal weak classifier by determining the minimal errors of all possible weak classifiers in a brute force manner. A more efficient way to find the optimal weak classifier is to compute a reasonable index, such as the expected entropy defined by 6, to measure the optimality of each weak classifier. By comparing such indices, the most beneficial one will then come to the fore naturally. As shown in the shaded area of Fig. 6, present scheme finds the optimal weak classifier by simply evaluating the entropies of all possible classifiers. The computation of the single weighted error ε_{t} in Fig. 6 is simply for obtaining the weight α_{t} of the currently selected classifier h_{t'} and for sample reweighting.
We conduct an simple experiment to demonstrated that the classifier constructed
by the proposed approach would yield better performance than Viola's approach.
We used the popular CBCL Face Database to construct two 200feature monolithic
classifiers. All face and nonface images in the CBCL training set are used
to train and the whole images in CBCL testing set are used to evaluate performance.
The first one classifier was constructed by the Viola's approach with Haarlike
features. The second one was constructed by the proposed EntropyDirected Discrete
AdaBoost algorithm with NBBP features. The comparative results ROC curve is
shown in Fig. 7 and the training times is shown in Fig.
8.

Fig. 7: 
Comparative ROC curve for the 200 feature classifiers 

Fig. 8: 
The Comparative results for training time for the 200 feature
classifiers 
According to the experimental results, the classifier based on the proposed
approach with NBBP features are more effective and efficient than Viola's approach.
It provides an initial evidence that the proposed scheme is an more effective
technique for face detection.
Face detection system by NBBP: Although a singlestage face detection
system can be built using the aforementioned methodology, it hardly achieves
realtime requirement. Faces in an image possibly appear in arbitrary locations
and with different scales. A face detection system needs repeatedly scanning
the image thoroughly using different size of windows to fulfill its job. Therefore,
such a windowing detector must evaluate tens of thousands of location/scale
combinations just in order to classif few (e.g., 0 to 10) of them as faces in
an image of moderate size. To achieve realtime, we thus need a mechanism being
able to early rejection of obvious nonface patterns.

Fig. 9: 
Structure of cascaded face detector 
The cascaded structure and the training algorithm proposed by Viola
and Jones (2004) had been successfully applied as a realtime face detector.
A cascaded classifier has great performance in detection while radically reducing
computation time. The overall detection process is a degenerate decision tree,
what we call a “cascade” is schematically shown in Fig.
9. Our facedetection system adopts such a cascaded structure. Basically,
we used the cascaded training algorithm described by Viola
and Jones, (2004) to train our system except that the weak classifiers of
each stage classifier in Fig. 9 are trained using the EntropyDirected
AdaBoost algorithm described in Fig. 6.
Our face dataset is composed of 24,000 face images obtained by mirroring 12,000
face images from LFWa which is originated from LFW face database (Huang
et al., 2007). These faces in LFWa were aligned using the commercial
face alignment software, so there is no inplane rotated face in our dataset.
The face images in the dataset are rescaled to 18x18 and partitioned to two
subsets of equal size. One is for training and the other is for validation.
Our nonface dataset is composed of 2,000 large images from bigfoto.com (http://www.bigfoto.com).
Likewise, the dataset is divided into two subsets of equal size for training
and validation, respectively. Training set is used to train the weak classifier
while validation set is used to evaluate the detection rate and false positive
rate of the trained cascaded classifier.
The 12,000 nonface examples used to train the firststage classifier are collected by random sampling (of size 18x18) subimages from the 1,000 nonface images in the training set. The nonface examples used to train the classifiers of subsequent stages are the 12,000 false positives obtained by scanning the 1,000 nonface images in training set using the current cascaded classifier.
From the viewpoint of training, one difficulty is to collect the meaningful examples to represent nonfaces since there are, in fact, infinite many nonface images. If the majority of nonface examples used for validation are obvious different from the face sample, the actual false positive rate for the resulting classifier may be unacceptable. Therefore, we partition the learning process into two phases, say, the shallow learning phase and the deep learning phase, depending on the difficulty of nonface examples used for validation. Before learning, we randomly sample 1,000,000 nonface examples (of size 18x18) for validation in the shallow learning phase from the 1,000 nonface images. After shallow learning phase completed, we use the partially completed cascaded classifier to exhaustively scan the 1,000 nonface images in the validation set to collect all false positive examples. In deep learning phase, we only use these latter examples for validation.
EXPERIMENTAL RESULTS
The resulting cascaded face detector constructed using our proposed scheme has 16 stages with 481 weak classifiers in total. Figure 10 shows the development of detection rate and false positive rate of our validation data as the number of weak classifiers grows. In the Figure, the vertical solid lines in the Figure mark the start of the stage classifiers, the shaded area represents the period of shallow learning phase (the first 12 stages) and the nonshaded area (the last 4 stages) represents the period of deep learning phase.
To understand the entropy and error rate changes as the learning process proceeds, Fig. 11 shows the development of these two rates as the number of weak classifiers grows. The Figure reflects that the earlier stageclassifiers can achieve their subgoals (i.e., the assigned detection rate and false positive rate) rather easily. For example, the first 12 stages are made up by just only 194 classifiers. Furthermore, these two curves both tend to be flattened at their right ends. This reveals that the weak classifiers in the latter stages, in fact, make decision based on many noncommon detail features. Figure 11 also implies that the lower expected entropy leads to lower weighted error. Hence, it is effective and reasonable to use the expected entropy of NBBP features to direct the feature selection process of AdaBoost algorithm.
An experiment was conducted using MIT/CMU test set which consists of 130 gray scale images with a total of 507 frontal faces. Figure 12 shows the output of our face detector on some test images from MIT/CMU test set. On a Intel Core 2 Quad 2.33GHz processor, our face detector can process 320x240 video in about 22 fps. For comparison, we draw the ROC curves of our face detector and of the one trained using Gentle AdaBoost algorithm with MBLBP as features on the same test set in Fig. 13.

Fig. 10: 
The development of detection rate and false positive rate
in training 

Fig. 11: 
The development of entropy and weighted error rate in training 

Fig. 12: 
Output of our face detector on a number of test images from
the MIT/CMU test set 

Fig. 13: 
ROC curves of two face detectors basedon the MIT/CMU test
set 
DISCUSSION
As can be seen from the comparative results for the monolithic classifiers
in Fig. 7, the proposed NBBP features had higher discriminant
power than the Haarlike ones. It was caused by that NBBP features can detect
diverse image structures. Furthermore, NBBP feature was illumination invariant
and more robust on reflecting the structure of the underlying patterns. This
facilitated the face detection system to learn the true facial structure from
training set with face and nonface images. From Fig. 8, the
proposed approach was more efficient than original one about 9 times. The most
time consuming and critical part of AdaBoost algorithm lain in selection of
the optimal weak classifier from a huge number of candidates. The higher efficiency
of the proposed approach can be associated with two reasons. The first one was
that the total amount of competitive features for NBBP is much smaller that
Haarlike ones. Secondly, we used the expected entropy defined by Eq.
6, to measure the optimality of each weak classifier without a bruteforce
manner. Compared to the original Discrete AdaBoost algorithm, present scheme
found the optimal weak classifier by simply evaluating the entropies of all
possible classifiers. We also compared the performance of the proposed scheme
with the Gentle AdaBoost based on MBLBP features (Zhang
et al., 2007). Gentle AdaBoost used adaptive Newton steps for minimizing
a weighed squared error at each step of feature selction process. According
to Fig. 13, the Gentle AdaBoost had smaller false positive
rate than the proposed scheme with detection rate lower than 88%. However, this
detection rate was not acceptable for realworld applications. For higher detection
rate, i.e., higher than 90%, present approach had better performance than the
Gentle AdaBoost algorithm with MBLBP features.
For building a realtime face detector by our approach, we used two phases
learning scheme, say, the shallow learning phase and the deep learning phase,
depending on the difficulty of nonface examples used for validation. As can
be seen from Figure 10, it showed the development of detection
rate and false positive rate of our validation data as the number of weak classifiers
grows. In the Figure, the vertical solid lines in the Figure marked the start
of the stage classifiers and the shaded area represented the period of shallow
learning phase. The false positive rate approximated to zero rapidly after the
5th stage classifier. It represented that the resulting cascaded face detector
constructed using present proposed scheme can achieve high detection rate and
very low false positive rate. However, AdaBoost algorithm had the potential
to overfit the training set because its objective is to minimize error on the
training set (Bylaner and Tate, 2006). We used another
validation set of nonface images to train our face detector after the shallow
learning phase. At the beginning of deep learning phase, the nonshaded area
in Fig. 10, the false positive rate was increasing obviously.
It implied that our face detector was suffering from the overfitting problem.
After the deep learning phase, our face detector can really reduce the actual
false positive rate. So present face detector was more robust to apply on the
realworld applications.
CONCLUSIONS
In this study, we propose NBBP features for face detection and use entropy to guide the feature selection process of Discrete AdaBoost algorithm. This, as a result, can avoid a brute force approach to feature selection and, hence, can improve the learning efficiency to a certain degree. Present experiment justifies that our scheme works well both in learning efficiency and classification performance. Furthermore, the proposed concept can be easily applied to other AdaBoost relevant algorithms, e.g., Gentle AdaBoost, RealBoost and FloatBoost.
ACKNOWLEDGMENTS
The authors would like to thank CMU for providing the face database.