http://www.clrc.rhul.ac.uk/people/vlad/index.shtml">Vladimir
Vapnik. The SVM algorithm is based on the statistical learning theory and
the VapnikChervonenkis (VC) dimension introduced by Vladimir Vapnik and http://www.clrc.rhul.ac.uk/people/vlad/index.shtml">Alexey
Chervonenkis. After the discovery of SVM they have applied to the biological
data mining^{[28]}, drug discovery^{[6,8]}.
In SVM The Optimum Separation Hyperplane (OSH) is the linear classifier with the maximum margin for a given finite set of learning patterns. Consider the classification of two classes of patterns that are linearly separable, i.e., a linear classifier can perfectly separate them. The linear classifier is the hyperplane H (w•x+b = 0) with the maximum width (distance between hyperplanes H_{1} and H_{2}). Consider a linear classifier characterized by the set of pairs (w, b) that satisfies the following inequalities for any pattern x_{i} in the training set:
These equations can be expressed in compact form as:
or
Because we have considered the case of linearly separable classes, each such
hyperplane (w, b) is a classifier that correctly separates all patterns from
the training set:
For all points from the hyperplane H (w•x+b = 0), the distance between origin
and the hyperplane H is b/w. We consider the patterns from the class 1
that satisfy the equality w•x+b = 1 and determine the hyperplane H_{1};
the distance between origin and the hyperplane H_{1} is equal to 1b/w.
Similarly, the patterns from the class +1 satisfy the equality w•x+b = +1 and
determine the hyperplane H_{2}; the distance between origin and the
hyperplane H_{2} is equal to +1b/w. Of course, hyperplanes H,
H_{1} and H_{2} are parallel and no training patterns are located
between hyperplanes H_{1} and H_{2}. Based on the above considerations,
the distance between hyperplanes (margin) H_{1} and H_{2} is
2/w.
From these considerations it follows that the identification of the optimum separation hyperplane is performed by maximizing 2/w, which is equivalent to minimizing w^{2}/2. The problem of finding the optimum separation hyperplane is represented by the identification of (w, b) which satisfies: For which w is minimum:
Denoting the training sample as:
SVM discriminate hype plane can be written as:
Where:
w 
= 
A weight vector 
b 
= 
A bias 
According to the generalization bound in statistical learning theory^{[29]},
we need to minimize the following objective function for a 2norm soft margin
version of SVM:
subject to yi((w.x_{i})+b)≥1€_{I,}i = 1
in which, slack variable ξi is introduced when the problem is infeasible. The constant C>0 is a penalty parameter and a larger C corresponds to assigning a larger penalty to errors. By building a Lagrangian and using the KarushKuhnTucker (KKT) complimentarily conditions^{[30,31]}, we can obtain the value of optimization problem (1). Because of the KKT conditions, only those Lagrangian multipliers, α is, which make the constraint active are nonzeros, we denote these points corresponding to the nonzero α is as support vectors (sv). Therefore we can describe the classification hyper plane in terms of α and b:
To address the unbalanced problem, C in Eq. 1 is separated
as C+ and C to adjust the
penalties on the false positive vs. false negative, then Equation becomes:
The SVM obtained by the above equation is named as balanced SVM.
Bagging: One of the most widely used techniques for creating an ensemble
is bagging (short for Bootstrap Aggregation Learning), where a base classifier
is provided with a set of patterns obtained randomly resampling the original
set of examples and then trained independently of the other classifiers. The
final hypothesis is obtained as the sum of the averaged predictions. The algorithm
is summarized below:
• 
Let S = {(x_{i},y_{i});I = 1,……m} be training
set 
• 
Generate T bootstrap samples s^{t}, t = 1,…..,T
from S 
•  Train
the classifier f_{t }with the set of examples s^{t} to
minimize the classification error S_{j }I(y_{i
≠} f_{t}(x_{j})), where I(S) is the indicator
of the set S 
•  Set
the ensemble predictor at time t to be F^{t}(x) = sgn(1/tS^{t}_{i=1
}f_{i}^{t}(x)) 
Bagging as a procedure capable to reduce the variance of predictors mimicking averaging over several training sets. For well behaved loss functions, bagging can provide generalization bounds with a rate of convergence of the same order as Tikhonov regularization. The key observation is that using bagging, an ∝stable algorithm can becomes strongly ∝stable with appropriate sampling schemes. Strongly ∝stable algorithms provide fast rates of convergence from the empirical error to the true expected prediction error. The key fact in the previous analysis is that certain sampling plans allow some points to affect only a subset of learners in the ensemble. The importance of this effect is also remarked in^{[9,10]}. In these studies, empirical evidence is presented to show that bagging equalizes the influence of training points in the estimation procedure, in such a way that points highly influential (the so called leverage points) are downweighted. Since in most situations leverage points are badly influential, bagging can improve generalization by making robust an unstable base learner. From this point of view, resampling has an effect similar to robust Mestimators where the influence of sample points is (globally) bounded using appropriate loss functions, for example the Huber's loss or the Tukey's bisquare loss. Since in uniform resampling all the points in the sample have the same probability of being selected, it seems counterintuitive that bagging has the ability to selectively reduce the influence of leverage points. The explanation is that leverage points are usually isolated in the feature space. To remove the influence of a leverage point it is enough to eliminate this point from the sample but to remove the influence of a nonleverage point we must in general remove a group of observations. Now, the probability that a group of size K be completely ignored by bagging is (1ˇK = m) m which decays exponentially with K. For K = 2 for example (1 ˇ K = m)m » 0:14 while (1 ˇ 1= m)m » 0:368. This means that bootstrapping allows the ensemble predictions to depend mainly on\common" examples, which in turns allows to get a better generalization. Thus Bagging helps to improve stable of single learning machines, but unbalance also reduce its generalization performance, therefore, we propose to employ asymmetric bagging to handle the unbalanced problem, which only execute the bootstrapping on the negative examples since there are far more negative examples than the positive ones. Tao et al.^{[18]} applied asymmetric bagging to another unbalanced problem of relevance feedback in image retrieval and obtained satisfactory results. This way make individual classifier of bagging be trained on a balanced number of positive and negative examples, so for solving the unbalanced problem asymmetric bagging is used Asymmetric bagging: In ABSVM, the aggregation is implemented by the Majority Voting Rule (MVR). The asymmetric bagging strategy solves the unstable problem of SVM classifiers and the unbalance problem in the training set. However, it cannot solve the problem of irrelevant and weak redundant features in the datasets. We can solve it by feature selection embedded in the bagging method.

Assymetric bagging SVM approach:
PRIFEB: Feature selection for the individuals can help to improve the accuracy
of bagging and is based on the conclusion of^{[19]} where they concluded
that reducing the error of Support Vector Machines (SVMs) will reduce the error
of bagging of SVMs. At the same time, we used embedded feature selection to
reduce the error of SVMs effectively. Prediction Risk based Feature selection
for Bagging (PRIFEB) which uses the embedded feature selection method with the
prediction risk criteria for bagging of SVMs to test if feature selection can
effectively improve the accuracy of bagging methods and furthermore improve
the degree prediction of drug discovery. In PRIFEB, the prediction risk criteria
is used to rank the features, which evaluates one feature through estimating
prediction error of the data sets when the values of all examples of this feature
are replaced by their mean value:
Where:
ERR 
= 
The training error 
ERR 
= 
The test error on the training data set with the mean value of ith feature
and defined as: 
Where:
l 
= 
The number of examples 
D 
= 
The number of features 

= 
The mean value of the ith feature 
Y^{~}() 
= 
The prediction value of the jth example after the value of the ith feature
is replaced by its mean value 
Finally, the feature corresponding with the smallest will be deleted, because
this feature causes the smallest error and is the least important one.
The basic steps of PRIFEB are described as follows. Suppose Tr(x^{1}, x^{2},…., x^{D},C) is the training set and p is the number of individuals of ensemble. Tr and p are input into the procedure and ensemble model L is the output. Step 1: Generate a training subset Trk from Tr by using Bootstrap sampling algorithm the size of T_{rk} is three quarters of the size of T_{r}. Step 2: Train an individual model L_{k} on the training subset T_{rk} by using support vector machines algorithm and calculate the training error ERR. Step 3: Compute the prediction risk value S_{i} using equation. If S_{i} is greater than 0, the i^{th} feature is selected as one of optimal features. Step 4: Repeat step 3 until all the features in T_{rk} are computed. Step 5: Generate the optimal training subset T_{rkˇoptimal} from T_{rk} according to the optimal features obtained in Step 3. Step 6: Retrain the individual model L_{k} on the optimal training subset T_{rkˇoptimal} by using support vector machines. Step 7: Repeat from Step 26 until p models are set up, Step 8: Ensemble the obtained models L by the way of majority voting method for classification problems.
SPRAG algorithm: Feature selection has been used in ensemble learning
and obtained some interesting results, Li and Liu^{[32]} proposed to
use the embedded feature selection method with the prediction risk criteria
for bagging of SVMs, where feature selection can effectively improve the accuracy
of bagging methods. As a feature selection method, the prediction risk criteria
was proposed by Moody and Utans^{[33]} which evaluates one feature through
estimating prediction error of the data sets when the values of all examples
of this feature are replaced by their mean value:
Where:
AUC 
= 
The prediction AUC on the training data set 
AUC 
= 
The prediction AUC on the training data set with the mean value of ith
feature 
Finally, the feature corresponding with the smallest will be deleted, because
this feature causes the smallest error and is the least important one. The embedded
feature selection model with the prediction risk criteria is employed to select
relevant features for the individuals of bagging of SVMs, which is named as
Prediction Risk based Feature selection for Bagging (PRIFEB). PRIFEB has been
compared with MIFEB (Mutual Information based Feature selection for Bagging)
and ordinary bagging, which showed that PRIFEB improved bagging on different
data sets^{[33]}. Since the asymmetric bagging method can overcome both
the problems of unstable and unbalance and PRIFEB can overcome the problem of
irrelevant features. So we propose a hybrid algorithm to combine the two algorithms.
The basic idea of SPRAG algorithm is that we first use bootstrap sampling to generate a negative sample and combine it with the whole positive sample to obtain an individual training subset. Then, prediction risk based feature selection is used to select optimal features and we obtain an individual model by training SVM on the optimal training subset. Finally, ensemble the individual SVM classifiers by using majority voting Rule to obtain the final model.
Learning and performance measurement:
•  Generate
a training subset S^{}_{rk } for negative training set
S^{}_{r } by using the bootstrap sampling technique,
the size of S^{}_{rk }is same with that of S_{r}^{+} 
•  train
the individual model L_{k } on the training subset S^{}_{rk
È} S_{r}^{+
}by using the support vector machine algorithm and calculate the
AUC value on the training subset 
•  compare
the prediction risk value R_{i }using the equation 
•  If
R_{i }is greater than 0 the i^{th }feature is selected
as one of optimal features 
•  Enerate
the optimal training subset S_{rkoptimal} from S_{rk}
according to the above optimal features 
•  Train
the individual model N_{k} on the optimal training subset S_{rkoptimal}
by using support vector machines. 
•  Ensemble
the obtained model N by the way of majority voting method for classification
problems 
Since the class distribution of the used data set is unbalanced, classification accuracy may be misleading. Therefore, AUC (Area Under the Curve of Receiver Operating Characteristic (ROC))^{[19]} is used to measure the performance. At the same time, we will present detail results of prediction accuracy (ACC), which consists of two parts True Positives Ratio (TPR) and True Negatives Ratio (TFR). ACC, TPR and TNR are defined as:
where, #A means the number of A. TPR also names as sensitivity and TFR names
as specificity. In the following, we present the analysis of the results from
our experiments. The AUC and ACC values are averaged over 10 random runs.
Numerical experiments:
NCI AntiHIV drug screen data set: The NCI AntiHIV Drug Screen data set (NCI)
is taken. It has a categorical response measuring how a compound protects human
CEM cells from HIV1 infection. It has 29374 examples, of which 542 (1.85%)
is positive and 28832 (98.15%) is negative. The structure parameters^{[34]}
consist 64 BCUT descriptors.
RESULT
Experiments are performed to investigate if asymmetric bagging and feature
selection help to improve the performance of bagging. Support vector machines
with C = 100, σ = 0.1 are used as individual classifiers and the number
of individuals is 5. For balanced SVM, balanced_bridge is used to denote the
ratio of C+ to C. For ordinary bagging, each individual has one third of the
training data set, while for ABSVM, the size of individual data subset is twice
of the positive sample in the whole data set. The 3fold cross validation scheme
is used to validate the results, experiments on each algorithm are repeated
10 times.
Table 1: 
Result for using SVM on the NCI data set 

Table 2: 
Result for using balanced SVM (balanced ridge = 0.01) on
the NCI data set 

Table 3:  Result
for using bagging of balanced SVM (balanced ridge = 0.01) on the NCI data
set 

DISCUSSION
Table 16 list the results of ordinary
SVM, balancedSVM, bagging of balancedSVM, ordinary bagging, ABSVM and SPRAG
(special prediction risk based feature selection for asymmetric bagging), from
which we can see that:
•  Balanced
SVM obtained a slight improvement of ordinary SVM 
• 
Bagging methods improves stability of single ones and obtain
better results than single ones do. Especially on balancedSVM, bagging
improves 0.1322 from single one 
Table 4:  Result
for using ordinary bagging on the NCI data set 

Table 5:  Result
for using ABSVM on the NCI data set 

Table 6:  Result
for using our modified algorithm on the NCI data set 

•  Ordinary
bagging gets a high ACC value, with a proper AUC value, but TPR is very
low, which means that few of the positive examples are predicted correctly 
•  ABSVM
reduces the ACC value, but improves the AUC value of ordinary bagging.
TPR increases from 9.2367.53% 
•  PRIFEB
improves both the ACC and AUC values of ABSVM, TPR are improved dramatically
and it is 90.96% in average 
As for the above results, we think that: •  Since
single SVM is not stable and can not obtain valuable results and bagging
can improve them 
• 
Although ordinary bagging gets a high ACC value, it is trivial,
because few positive examples are predicted correctly. If we simply predict
all the labels as negative, we can get a high value as 98.15%, which is
the ratio of negative sample to the whole sample 
• 
Since this is a drug discovery problem, we pay more attention
to positives. AUC is more valuable than ACC to measure a classifier. Asymmetric
bagging improves the AUC value of ordinary bagging and our modified algorithm
further significantly improves it to a higher one 79.58% in average, simultaneously,
TPR are improved from 9.2390.95%, which shows our modified algorithm is
proper to solve the unbalanced drug discovery problem. 
• 
Asymmetric bagging wins in two aspects, one is that it make
the individual data subset balanced, the second is that it pay more attention
to the positives and always put the positives in the data set, which makes
TPR is higher than ordinary bagging and AUC is improved 
• 
Feature selection using prediction risk as criterion also
wins in two aspects, one is that embedded feature selection is dependent
with the used learning machine, it will select features which benefit the
generalization performance of individual classifiers, the second is that
different features selected for different individual data subsets, which
makes more diversity of bagging and improves their whole performance. 
CONCLUSION To address the unbalanced problem of drug discovery, we propose to apply asymmetric bagging and feature selection to the modeling of QSAR of drug molecules. ABSVM and our modified novel algorithm are compared with ordinary bagging of support vector machines on a large drug molecular activities data set, experiments show that asymmetric bagging and feature selection can improve the prediction ability in terms of AUC and TPR. Since this is a drug discovery problem, the positive sample is few but important, AUC and TPR is more proper than ACC to measure the generalization performance of classifiers. This work introduces asymmetric bagging into prediction of drug activities and furthermore extends feature selection to asymmetric bagging. Extension of this paper includes test the proposed algorithms with a higher number of individuals. This work only concerns an embedded feature selection model with the prediction risk criteria and one of the future works will try to employ more efficient feature selection methods for this task. " target="_blank">View Fulltext
 Related
Articles  Back
