Asian Science Citation Index is committed to provide an authoritative, trusted and significant information by the coverage of the most important and influential journals to meet the needs of the global scientific community.  
ASCI Database
308-Lasani Town,
Sargodha Road,
Faisalabad, Pakistan
Fax: +92-41-8815544
Contact Via Web
Suggest a Journal
Journal of Computer Science
Year: 2009  |  Volume: 5  |  Issue: 10  |  Page No.: 764 - 772

Unbalance Quantitative Structure Activity Relationship Problem Reduction in Drug Design

D. Pugazhenthi and S.P. Rajagopalan    

Abstract: Problem statement: Activities of drug molecules can be predicted by Quantitative Structure Activity Relationship (QSAR) models, which overcome the disadvantage of high cost and long cycle by employing traditional experimental methods. With the fact that number of drug molecules with positive activity is rather fewer than that with negatives, it is important to predict molecular activities considering such an unbalanced situation. Approach: Asymmetric bagging and feature selection was introduced into the problem and Asymmetric Bagging of Support Vector Machines (AB-SVM) was proposed on predicting drug activities to treat unbalanced problem. At the same time, features extracted from structures of drug molecules affected prediction accuracy of QSAR models. Hybrid algorithm named SPRAG was proposed, which applied an embedded feature selection method to remove redundant and irrelevant features for AB-SVM. Results: Numerical experimental results on a data set of molecular activities showed that AB-SVM improved AUC and sensitivity values of molecular activities and SPRAG with feature selection further helps to improve prediction ability. Conclusion: Asymmetric bagging can help to improve prediction accuracy of activities of drug molecules, which could be furthermore improved by performing feature selection to select relevant features from the drug.

http://www.clrc.rhul.ac.uk/people/vlad/index.shtml">Vladimir Vapnik. The SVM algorithm is based on the statistical learning theory and the Vapnik-Chervonenkis (VC) dimension introduced by Vladimir Vapnik and http://www.clrc.rhul.ac.uk/people/vlad/index.shtml">Alexey Chervonenkis. After the discovery of SVM they have applied to the biological data mining[28], drug discovery[6,8].

In SVM The Optimum Separation Hyperplane (OSH) is the linear classifier with the maximum margin for a given finite set of learning patterns. Consider the classification of two classes of patterns that are linearly separable, i.e., a linear classifier can perfectly separate them. The linear classifier is the hyperplane H (w•x+b = 0) with the maximum width (distance between hyperplanes H1 and H2). Consider a linear classifier characterized by the set of pairs (w, b) that satisfies the following inequalities for any pattern xi in the training set:

These equations can be expressed in compact form as:

or

Because we have considered the case of linearly separable classes, each such hyperplane (w, b) is a classifier that correctly separates all patterns from the training set:

For all points from the hyperplane H (w•x+b = 0), the distance between origin and the hyperplane H is |b|/||w||. We consider the patterns from the class -1 that satisfy the equality w•x+b = -1 and determine the hyperplane H1; the distance between origin and the hyperplane H1 is equal to |-1-b|/||w||. Similarly, the patterns from the class +1 satisfy the equality w•x+b = +1 and determine the hyperplane H2; the distance between origin and the hyperplane H2 is equal to |+1-b|/||w||. Of course, hyperplanes H, H1 and H2 are parallel and no training patterns are located between hyperplanes H1 and H2. Based on the above considerations, the distance between hyperplanes (margin) H1 and H2 is 2/||w||.

From these considerations it follows that the identification of the optimum separation hyperplane is performed by maximizing 2/||w||, which is equivalent to minimizing ||w||2/2. The problem of finding the optimum separation hyperplane is represented by the identification of (w, b) which satisfies: For which ||w|| is minimum:

Denoting the training sample as:

SVM discriminate hype plane can be written as:

Where:

w = A weight vector
b = A bias

According to the generalization bound in statistical learning theory[29], we need to minimize the following objective function for a 2-norm soft margin version of SVM:

subject to yi((w.xi)+b)≥1-€I,i = 1

in which, slack variable ξi is introduced when the problem is infeasible. The constant C>0 is a penalty parameter and a larger C corresponds to assigning a larger penalty to errors. By building a Lagrangian and using the Karush-Kuhn-Tucker (KKT) complimentarily conditions[30,31], we can obtain the value of optimization problem (1). Because of the KKT conditions, only those Lagrangian multipliers, α is, which make the constraint active are non-zeros, we denote these points corresponding to the non-zero α is as support vectors (sv). Therefore we can describe the classification hyper plane in terms of α and b:

To address the unbalanced problem, C in Eq. 1 is separated as C+ and C- to adjust the

penalties on the false positive vs. false negative, then Equation becomes:


The SVM obtained by the above equation is named as balanced SVM.

Bagging: One of the most widely used techniques for creating an ensemble is bagging (short for Bootstrap Aggregation Learning), where a base classifier is provided with a set of patterns obtained randomly resampling the original set of examples and then trained independently of the other classifiers. The final hypothesis is obtained as the sum of the averaged predictions. The algorithm is summarized below:

Let S = {(xi,yi);I = 1,……m} be training set

Generate T bootstrap samples st, t = 1,…..,T from S

for t = 1 to T

Train the classifier ft with the set of examples st to minimize the classification error Sj I(yi ≠ ft(xj)), where I(S) is the indicator of the set S

Set the ensemble predictor at time t to be Ft(x) = sgn(1/tSti=1 fit(x))

End for

Bagging as a procedure capable to reduce the variance of predictors mimicking averaging over several training sets. For well behaved loss functions, bagging can provide generalization bounds with a rate of convergence of the same order as Tikhonov regularization. The key observation is that using bagging, an ∝-stable algorithm can becomes strongly ∝-stable with appropriate sampling schemes. Strongly ∝-stable algorithms provide fast rates of convergence from the empirical error to the true expected prediction error. The key fact in the previous analysis is that certain sampling plans allow some points to affect only a subset of learners in the ensemble. The importance of this effect is also remarked in[9,10]. In these studies, empirical evidence is presented to show that bagging equalizes the influence of training points in the estimation procedure, in such a way that points highly influential (the so called leverage points) are down-weighted. Since in most situations leverage points are badly influential, bagging can improve generalization by making robust an unstable base learner. From this point of view, resampling has an effect similar to robust M-estimators where the influence of sample points is (globally) bounded using appropriate loss functions, for example the Huber's loss or the Tukey's bisquare loss.

Since in uniform resampling all the points in the sample have the same probability of being selected, it seems counterintuitive that bagging has the ability to selectively reduce the influence of leverage points. The explanation is that leverage points are usually isolated in the feature space. To remove the influence of a leverage point it is enough to eliminate this point from the sample but to remove the influence of a non-leverage point we must in general remove a group of observations. Now, the probability that a group of size K be completely ignored by bagging is (1ˇK = m) m which decays exponentially with K. For K = 2 for example (1 ˇ K = m)m » 0:14 while (1 ˇ 1= m)m » 0:368. This means that bootstrapping allows the ensemble predictions to depend mainly on\common" examples, which in turns allows to get a better generalization.

Thus Bagging helps to improve stable of single learning machines, but unbalance also reduce its generalization performance, therefore, we propose to employ asymmetric bagging to handle the unbalanced problem, which only execute the bootstrapping on the negative examples since there are far more negative examples than the positive ones. Tao et al.[18] applied asymmetric bagging to another unbalanced problem of relevance feedback in image retrieval and obtained satisfactory results. This way make individual classifier of bagging be trained on a balanced number of positive and negative examples, so for solving the unbalanced problem asymmetric bagging is used

Asymmetric bagging: In AB-SVM, the aggregation is implemented by the Majority Voting Rule (MVR). The asymmetric bagging strategy solves the unstable problem of SVM classifiers and the unbalance problem in the training set. However, it cannot solve the problem of irrelevant and weak redundant features in the datasets. We can solve it by feature selection embedded in the bagging method.

Assymetric bagging SVM approach:
PRIFEB:
Feature selection for the individuals can help to improve the accuracy of bagging and is based on the conclusion of[19] where they concluded that reducing the error of Support Vector Machines (SVMs) will reduce the error of bagging of SVMs. At the same time, we used embedded feature selection to reduce the error of SVMs effectively. Prediction Risk based Feature selection for Bagging (PRIFEB) which uses the embedded feature selection method with the prediction risk criteria for bagging of SVMs to test if feature selection can effectively improve the accuracy of bagging methods and furthermore improve the degree prediction of drug discovery. In PRIFEB, the prediction risk criteria is used to rank the features, which evaluates one feature through estimating prediction error of the data sets when the values of all examples of this feature are replaced by their mean value:

Where:

ERR = The training error
ERR = The test error on the training data set with the mean value of ith feature and defined as:

Where:

l = The number of examples
D = The number of features
= The mean value of the ith feature
Y~() = The prediction value of the jth example after the value of the ith feature is replaced by its mean value

Finally, the feature corresponding with the smallest will be deleted, because this feature causes the smallest error and is the least important one.

The basic steps of PRIFEB are described as follows.

Suppose Tr(x1, x2,…., xD,C) is the training set and p is the number of individuals of ensemble. Tr and p are input into the procedure and ensemble model L is the output.

Step 1: Generate a training subset Trk from Tr by using Bootstrap sampling algorithm the size of Trk is three quarters of the size of Tr.

Step 2: Train an individual model Lk on the training subset Trk by using support vector machines algorithm and calculate the training error ERR.

Step 3: Compute the prediction risk value Si using equation. If Si is greater than 0, the ith feature is selected as one of optimal features.

Step 4: Repeat step 3 until all the features in Trk are computed.

Step 5: Generate the optimal training subset Trkˇoptimal from Trk according to the optimal features obtained in Step 3.

Step 6: Re-train the individual model Lk on the optimal training subset Trkˇoptimal by using support vector machines.

Step 7: Repeat from Step 2-6 until p models are set up,

Step 8: Ensemble the obtained models L by the way of majority voting method for classification problems.

SPRAG algorithm: Feature selection has been used in ensemble learning and obtained some interesting results, Li and Liu[32] proposed to use the embedded feature selection method with the prediction risk criteria for bagging of SVMs, where feature selection can effectively improve the accuracy of bagging methods. As a feature selection method, the prediction risk criteria was proposed by Moody and Utans[33] which evaluates one feature through estimating prediction error of the data sets when the values of all examples of this feature are replaced by their mean value:

Where:

AUC = The prediction AUC on the training data set
AUC = The prediction AUC on the training data set with the mean value of ith feature

Finally, the feature corresponding with the smallest will be deleted, because this feature causes the smallest error and is the least important one. The embedded feature selection model with the prediction risk criteria is employed to select relevant features for the individuals of bagging of SVMs, which is named as Prediction Risk based Feature selection for Bagging (PRIFEB). PRIFEB has been compared with MIFEB (Mutual Information based Feature selection for Bagging) and ordinary bagging, which showed that PRIFEB improved bagging on different data sets[33]. Since the asymmetric bagging method can overcome both the problems of unstable and unbalance and PRIFEB can overcome the problem of irrelevant features. So we propose a hybrid algorithm to combine the two algorithms.

The basic idea of SPRAG algorithm is that we first use bootstrap sampling to generate a negative sample and combine it with the whole positive sample to obtain an individual training subset. Then, prediction risk based feature selection is used to select optimal features and we obtain an individual model by training SVM on the optimal training subset. Finally, ensemble the individual SVM classifiers by using majority voting Rule to obtain the final model.

Learning and performance measurement:

Begin

For k = 1 to T do

Generate a training subset S-rk for negative training set S-r by using the bootstrap sampling technique, the size of S-rk is same with that of Sr+

train the individual model Lk on the training subset S-rk È Sr+ by using the support vector machine algorithm and calculate the AUC value on the training subset

for i = 1 to D do

compare the prediction risk value Ri using the equation

If Ri is greater than 0 the ith feature is selected as one of optimal features

End for

Enerate the optimal training subset Srk-optimal from Srk according to the above optimal features

Train the individual model Nk on the optimal training subset Srk-optimal by using support vector machines.

End for

Ensemble the obtained model N by the way of majority voting method for classification problems

End

Since the class distribution of the used data set is unbalanced, classification accuracy may be misleading. Therefore, AUC (Area Under the Curve of Receiver Operating Characteristic (ROC))[19] is used to measure the performance. At the same time, we will present detail results of prediction accuracy (ACC), which consists of two parts True Positives Ratio (TPR) and True Negatives Ratio (TFR). ACC, TPR and TNR are defined as:


where, #A means the number of A. TPR also names as sensitivity and TFR names as specificity. In the following, we present the analysis of the results from our experiments. The AUC and ACC values are averaged over 10 random runs.

Numerical experiments:
NCI AntiHIV drug screen data set:
The NCI AntiHIV Drug Screen data set (NCI) is taken. It has a categorical response measuring how a compound protects human CEM cells from HIV-1 infection. It has 29374 examples, of which 542 (1.85%) is positive and 28832 (98.15%) is negative. The structure parameters[34] consist 64 BCUT descriptors.

RESULT

Experiments are performed to investigate if asymmetric bagging and feature selection help to improve the performance of bagging. Support vector machines with C = 100, σ = 0.1 are used as individual classifiers and the number of individuals is 5. For balanced SVM, balanced_bridge is used to denote the ratio of C+ to C-. For ordinary bagging, each individual has one third of the training data set, while for AB-SVM, the size of individual data subset is twice of the positive sample in the whole data set. The 3-fold cross validation scheme is used to validate the results, experiments on each algorithm are repeated 10 times.

Table 1: Result for using SVM on the NCI data set

Table 2:

Result for using balanced SVM (balanced ridge = 0.01) on the NCI data set



Table 3:

Result for using bagging of balanced SVM (balanced ridge = 0.01) on the NCI data set

DISCUSSION

Table 1-6 list the results of ordinary SVM, balanced-SVM, bagging of balanced-SVM, ordinary bagging, AB-SVM and SPRAG (special prediction risk based feature selection for asymmetric bagging), from which we can see that:

Balanced SVM obtained a slight improvement of ordinary SVM

Bagging methods improves stability of single ones and obtain better results than single ones do. Especially on balanced-SVM, bagging improves 0.1322 from single one

Table 4:

Result for using ordinary bagging on the NCI data set



Table 5:

Result for using AB-SVM on the NCI data set



Table 6:

Result for using our modified algorithm on the NCI data set


Ordinary bagging gets a high ACC value, with a proper AUC value, but TPR is very low, which means that few of the positive examples are predicted correctly

AB-SVM reduces the ACC value, but improves the AUC value of ordinary bagging. TPR increases from 9.23-67.53%

PRIFEB improves both the ACC and AUC values of AB-SVM, TPR are improved dramatically and it is 90.96% in average

As for the above results, we think that:

Since single SVM is not stable and can not obtain valuable results and bagging can improve them

Although ordinary bagging gets a high ACC value, it is trivial, because few positive examples are predicted correctly. If we simply predict all the labels as negative, we can get a high value as 98.15%, which is the ratio of negative sample to the whole sample
Since this is a drug discovery problem, we pay more attention to positives. AUC is more valuable than ACC to measure a classifier. Asymmetric bagging improves the AUC value of ordinary bagging and our modified algorithm further significantly improves it to a higher one 79.58% in average, simultaneously, TPR are improved from 9.23-90.95%, which shows our modified algorithm is proper to solve the unbalanced drug discovery problem.
Asymmetric bagging wins in two aspects, one is that it make the individual data subset balanced, the second is that it pay more attention to the positives and always put the positives in the data set, which makes TPR is higher than ordinary bagging and AUC is improved
Feature selection using prediction risk as criterion also wins in two aspects, one is that embedded feature selection is dependent with the used learning machine, it will select features which benefit the generalization performance of individual classifiers, the second is that different features selected for different individual data subsets, which makes more diversity of bagging and improves their whole performance.

CONCLUSION

To address the unbalanced problem of drug discovery, we propose to apply asymmetric bagging and feature selection to the modeling of QSAR of drug molecules. AB-SVM and our modified novel algorithm are compared with ordinary bagging of support vector machines on a large drug molecular activities data set, experiments show that asymmetric bagging and feature selection can improve the prediction ability in terms of AUC and TPR. Since this is a drug discovery problem, the positive sample is few but important, AUC and TPR is more proper than ACC to measure the generalization performance of classifiers. This work introduces asymmetric bagging into prediction of drug activities and furthermore extends feature selection to asymmetric bagging. Extension of this paper includes test the proposed algorithms with a higher number of individuals. This work only concerns an embedded feature selection model with the prediction risk criteria and one of the future works will try to employ more efficient feature selection methods for this task.

" target="_blank">View Fulltext    |   Related Articles   |   Back
 
 
   
 
 
 
  Related Articles

No Article Found
 
 
 
Copyright   |   Desclaimer   |    Privacy Policy   |   Browsers   |   Accessibility