HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2008 | Volume: 8 | Issue: 19 | Page No.: 3407-3414
DOI: 10.3923/jas.2008.3407.3414
DSKR: A Direct Sparse Kernel Regression
Chen Xiao-Feng, Wang Shi-Tong and Cao Su-Qun

Abstract: The number of support vectors plays a crucial role in designing fast kernel machine. The prediction time is linearly depend on the size of training example set and smaller set of support vectors leads to faster classification or regression. When applications require high prediction speed, SVM may not perform well. To overcoming this problem, recently, a direct sparse kernel learning framework was proposed to deal with binary classification problem, which can reduce the size of support vector set efficiently. In this research, we extend the direct sparse kernel learning framework to regression problem and present a method named DSKR to build sparse kernel regression. The DSKR has two advantages as follows: (1) it can obtain sparse kernel regression with arbitrary user-defined number of support vectors; (2) it achieves less regression error than traditional SVR with less number of support vectors. Experimental comparisons are made with other kernel regression methods on both synthetic and real-world data sets. The comparisons show that the proposed DSKR gives promising results.

Fulltext PDF Fulltext HTML

How to cite this article
Chen Xiao-Feng, Wang Shi-Tong and Cao Su-Qun, 2008. DSKR: A Direct Sparse Kernel Regression. Journal of Applied Sciences, 8: 3407-3414.

Keywords: kernel learning, expansion vector, sparse learning, Sparse kernel regression and support vector machine

INTRODUCTION

Kernel methods have been used widely in many aspects of pattern recognition and machine learning community, such as classification, regression and clustering (Vapnik, 1998; Ben et al., 2001). In kernel methods, examples in input space are mapped to feature space by utilizing kernel functions. Kernel methods are usually solved by many optimization methods. Traditionally, examples in training set with non-zero Lagrangian coefficiences are support vectors, which can be used to construct kernel machines. The number of support vectors plays a crucial role in designing fast kernel machine. Steinwart (2003) suggested that the number of support vectors is linearly depend on the size of training set, so cutting down the number of support vectors is a important way for faster classification or regression.

Many methods have been proposed for reducing the size of support vector set since decades ago. Here we only survey the methods which are relative to this research and presented in recent few years. Schölkopf et al. (2000) proposed v-support vector machine (v-SVM)) which optimizing the number of support vectors by a adjustable parameter v. The methods in Downs et al. (2001) and Thies and Weber (2006) can be viewed as taking a post-processing step to simplify kernel machines. The studies in Tipping (2001), Roth (2001) and Lawrence et al. (2002) yielded fast kernel machines from the view of probability theory. The methods in Williams and Seeger (2001), Nair et al. (2002) and Drineas and Mahoney (2005) reduced the training complexity through replacing original kernel matrix by its sparse version. The methods in Vincent and Bengio (2002) and Keerthi et al. (2006) generated sparse kernel machines by taking basic function selection methods. The reduced support vector machine (RSVM) is another fast training kernel method (Lin and Lin, 2003; Lee and Huang, 2007). RSVM selects user-defined number of random examples from training set, i.e., the subset of training set, to train kernel machines. The advantage of RSVM is its fast training property. However, the size of support vectors set of RSVM is not always small with relatively small amount of training examples. In some experimental results, RSVM even needs much more support vectors than traditional SVM (Lin and Lin, 2003). Significant vector learning for regression was proposed for nonlinear system identification (Gao and Shi, 2005; Gao et al., 2007). Wang et al. (2006) proposed a method for sparse support vector regression based on orthogonal forward selection. Direct sparse kernel learning framework is a novel framework for reducing run-time complexity which was presented in Wu et al. (2006).

In this research, we follow these studies and consider the problem of building direct sparse kernel regression through optimizing the size of support vector set directly.

MATERIALS AND METHODS

Direct sparse kernel learning framework: First, let us briefly review direct sparse kernel learning framework. The framework was presented in Wu et al. (2006). The general form of the direct sparse kernel learning framework can be written as follows:

(1)

(2)

(3)

(4)

where, f(w,b,ξ) is the object function. W and b are, respectively weigh vector and bias. ξ = [ξ1,...,ξNξ]T is a vector of slack variables. Nξ is the number of slack variables. Ng and Nh are the number of inequality constraints and the number of equality constraints, respectively. zi is an expansion vector and Z = [z1,...zNz] ε RxxNz is the matrix of expansion vector. d is the dimension of input space. Nz is the number of expansion vectors and it is a user-defined parameter. βi is the coefficient of the expansion vector zi and β = [β1,...,βNz]T is the coefficient vector of expansion vectors. Φ(.) is a mapping function which maps examples from input space into feature space. Note that instead of support vector in traditional kernel methods, expansion vector is used in sparse kernel learning framework, since the obtained expansion vectors may not be in training set. For the sake of simplicity, we call the vectors for constructing kernel machine expansion vector uniformly without considering they are in training set or not.

Different from traditional kernel learning methods, given user-defined Nz, direct sparse kernel learning framework adds an non-convex constraint Eq. 4 into its general form. The constraint guarantees the number of expansion vectors in obtained kernel machine equals to Nz.

Since the constraint Eq. 4 is non-convex, Eq. 1-4 can not be solved with a global optimal solution. So a two-part iterative strategy is used to deal with the optimizing problem. In first part, the expansion vectors Z = [z1,...zNz] in Eq. 1-4 are fixed, by solving Eq. 1-4, we can obtain the corresponding coefficients β = [β1,...,βNz]T of the fixed Z. In the second part, with fixing the obtained β and Z is optimized to minimize the object function in Eq. 1. The two parts execute iteratively until a convergent solution of Z and β is met.

Based on the framework Wu et al. (2006) proposed a sparse large margin method for binary classifier named Sparse Large Margin Classifier (SLMC). In this research, we extend the direct sparse kernel learning framework to regression problem and proposed Direct Sparse Kernel Regression (DSKR) method. Below let us give its detailed description.

DSKR: A method to build sparse kernel regression directly: Consider the training set {(xi,yi)}i=1N with the user-defined number Nz, where xi ε Rd is the input example, yi is the output target of xi. N is the size of training set. To obtain a sparse kernel regression, we solve the optimizing problem as follows:

(5)

(6)

(7)

(8)

(9)

where, C is trade-off parameter.

We solve Eq. 5-9 under the direct sparse kernel framework. First, by fixing Z, Eq. 5-9 can be transformed into the problem below:

(10)

(11)

(12)

(13)

(14)

Then we derive the dual problem of Eq. 10-14. Substituting Eq. 14 into Eq. 11 and 12, we have:

(15)

(16)

Where:

(17)

K(zj,xi) denotes Φ(zj)T Φ(xi)(j = 1,...,Nz; i = 1,...,N).

Substituting Eq. 13-16 into Eq. 10, by introducing the Lagrangian multiplier, we have:

(18)

where, αi≥0, αi*≥0, μi≥0, μi*≥0(i = 1,...,N) are Lagrangian multipliers, Kz is the kernel matrix of Z with Kijz = K(zi,zj).

Setting to zero the derivative of L with respect to β, b, ξi and ξi*, we immediately have

(19)

(20)

(21)

(22)

Eq. 19 leads to

(23)

Substituting Eq. 20-23 into Eq. 18, we arrive at the dual problem of Eq. 10-14:

(24)

(25)

(26)

Where:

(27)

In summary, we describe the proposed method DSKR as follows:

Initialize the expansion vectors Z = [z1,...,zNz] from the training set.

Obtain kernel regression with two-part strategy

Terminate if a convergent optimizing solution of both Z and β is met.
Fix the expansion vectors Z, solve optimizing problem Eq. 24-26 and obtain the expansion coefficiences β.
Fix β obtained in step 2b, search optimal Z to minimize Eq. 24, then go back to step 2a.
Output the obtained kernel regression.

In step 1, we initialize the expansion vectors zi(i = 1,...,Nz). Two methods may be used for this aim (Wu et al., 2006). They are random selection method and clustering selection method. In the first method, Nz examples are randomly chosen from the training set to serve as initial expansion vectors. While in the second method, the training set are firstly partitioned into Nz groups by clustering methods such as K-means (Wu et al., 2006), then the clustering centers is used to initialize the expansion vectors.

In step 2b, Eq. 24-26 with the fixed Z is actually a quadratic programming problem which had been well solved by support vector machine (Vapnik, 1998).

In step 2c, with the fixed β, Eq. 24 becomes the following optimizing problem:

(28)

Since Eq. 28 is an unconstrained minimization problem, Newton`s method and its variations are the appropriate choices for solving it. Here we use limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) method (Liu and Nocedal, 1989) to handle it.

As mentioned earlier, Z = [z1,...zNz] ε RxxNz. Let Zuv denote an arbitrary element in Z, which locates in the u-th (1≤u≤d) row and the v-th (1≤v≤Nz) column of Z. In other words, Zuv is the u-th attribute of the expansion vector zv.

Fig. 1:
Experimental result of DSKR on sinc function with Nz = 3. The small dots and circles are training examples and expansion vectors, respectively. Solid line shows the obtained regression result

In terms of LBFGS, ∇W(Z) can be computed as follows:

(29)

Where:

(30)

(31)

where, αiz and αiz* are the corresponding Lagrangian multiplier of the fixing xi(i = 1,...,N).

Sinc Zuv is the u attribute of the expansion vector zv and Ψz(xi) = [K(zi,xi),...,K(zNz,xi)]T, so other expansion vectors don`t contain element Zuv, we have (V` = 1,...,v–1, v+1,...Nz) in Eq. 30 can be computed as follows:

(32)

In the following, we illustrate the application of DSKR to regression using a toy experiment.

Fig. 2: Experimental result of DSKR on sinc function with Nz = 7

Fig. 3: Experimental result of DSKR on sinc function with Nz = 11

The data set is generated by sinc function, i.e., f(x) = sin(x)/x+t, where x takes from a uniform distribution over the interval [-10 10]. t denotes a random noise generated by MATLAB code 0.001xrandn (1,1). The data set contains 67 examples. In the experiments, c = 300, ε = 0.001. Gaussian kernel function K(xi,xj) = exp(–r||xi–xj||2) with r = 3 is used. Expansion vectors are initialized by K-means clustering method. Figure 1-3, respectively show the experimental results with Nz = 3, 7, 11. From these results, we can see that DSKR generates sparse regression with Nz expansion vectors and Nz is relative to the regression error. With a small Nz, such as Nz = 3, the regression error takes large value. However, with the increasing of Nz, the regression error decreases gradually, when Nz = 11, the performance of DSKR is quiet good. A simple MATLAB implement of DSKR for the experiment is available at (http://combustion.ustc.edu.cn/.).

RESULTS AND DISCUSSION

Experimental settings: Here, we study the performance of DSKR. We compare DSKR with ε-support vector regression (ε–svr), v-support vector regression (v–svr) (Schölkopf et al., 2000) and Relevant Vector Machine (RVM) (Tipping, 2001) on both synthetic and real-world data sets. v–svr generates sparse kernel regression by adjusting parameter v. RVM is a sparse kernel regression which is often used as a baseline to measure the performance of sparse kernel regression methods (Roth, 2001; Wu et al., 2006; Gao et al., 2007). The code of RVM is available at (http://www.miketipping.com/.) ε–svr and v-svr are implemented by LIBSVM which can be download from (http://www.csie.ntu.edu.tw.) Gaussian kernel K(xi,xj) = exp(–r||xi–xj||2) is used in all experiments. The parameters in experiment are chosen as below: ε is computed by

The optimal parameter C and r are selected from 5-fold cross validation by using ε–svr on training set (Tipping, 2001). The obtained C and r can lead to the best generalization performance. For v–svr, we adjust v to get the desired number of expansion vectors. For DSKR, we set Nz directly. We give the experimental results of DSKR with both random selection and K-means clustering based initialization. All the experiments are run on a personal computer with Intel 1.4 GHz processors, 384 M memory and Windows 2000.

Two error criteria are used for performance comparison:

Mean Absolute Error (MAE)

(33)

Mean Relative Error (MRE)

(34)

Data sets: Our experiments perform on seven data sets. Among them, two are synthetic while five are real-world data sets.

Two synthetic data sets are shown in Table 1 and they are sinc1 and 3-D Mexican hat1. Different noises are added into both data sets for comparing the robustness of DSKR with other three methods. Let y denotes the output targets, snr denotes the signal-to-noise ratio, noise is added by MATLAB code ynoise = awgn (y, snr, `measured`). Three ratios snr = 10, 15 and 25 are used.

Five real-world regression data sets are bodyfat, housing, mpg, pyrim and triazines which are shown in Table 2. Bodyfat is from Statlib which is available at (http://lib.stat.cmu.edu/.). It is the data set for estimating the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men. Bodyfat has 252 examples with 14 attributes. Other four data sets are all from UCI repository of machine learning databases which can be download from (http://www.ics.uci.edu/~mlearn/MLRepository.htm.). Housing is for concerning housing values in suburbs of Boston. It has 506 examples with 13 attributes. Mpg is used to estimate city-cycle fuel consumption in miles per gallon. There are 392 examples with 7 attributes in mpg. Pyrim and triazines are both qualitative structure activity relationships data sets. Pyrim has 74 examples with 27 attributes. Triazines has 186 examples with 60 attributes. All attributes of these data sets are scaled to the interval [-1 1].

Results: The experimental results are given in Table 3-6 where Nz denotes the number of expansion vectors. We adjust Nz according to the size of data sets in experiments to get rational results for comparison. For sinc1 and Pyrim, Nz = 20. For 3-D Mexican hat1 and Triazines, Nz = 30. For Bodyfat, Nz = 4. Nz of housing and mpg is 100. The number of the expansion vectors of RVM can not be chosen a priori, so we give its experimental results directly. In all tables, DSKR(R) and DSKR(K) denote DSKR with random selection initialization and DSKR with K-means clustering based initialization, respectively.

From these experimental results, we can see that initialization methods affect the performance of DSKR. Regression errors of DSKR with random selection initialization method always become bigger than these obtained by K-means clustering based selection initialization. The reason for this phenomenon is that random initialization may not lead to the uniform distribution of initial expansion vectors. However, since the difference between two kinds of regression errors is small, random selection initialization method can still be acceptable for DSKR.

Compared with ε–svr, DSKR can reduce the size of expansion vectors efficiently. For example, for the experiments on sinc1 and pyrim, the DSKR only need less than 30% percentage of ε–svr with much better results. Although v–svr can optimize Nz by adjusting parameter v, its regression errors are always bigger than DSKR and even bigger than ε–svr for the above data sets.

Table 1: Synthetic regression data sets

Table 2: Real-world regression data sets

Table 3: Experimental result on sinc1

Table 4: Experimental result on 3-D Mexican hat1

Table 5: Experimental result on bodyfat, housing and mpg

Table 6: Experimental result on pyrim and triazines

In most cases, RVM results in more sparse regression than DSKR. However, the regression error of RVM is usually bigger than DSKR. Given the optimal parameter C and r, the MAE and MRE of DSKR with K-means clustering based initialization are always less than those of RVM. On the other hand, the drawback of RVM is that Nz can not select a priori, in other word, Nz is only the training result of RVM. While in DSKR, we can optimize Nz according to the size of training set, i.e., Nz can be selected form 1 to N. For more expansion vectors can decrease the regression error, so DSKR has the advantage over RVM on this aspect. From Table 7-10, we also give the experimental results of RVM and DSKR with the same Nz, where the obtained results indicate that DSKR with K-means clustering based initialization usually outperforms RVM, except the data set Triazines. However, too small number of expansion vectors can not guarantee the regression performance.

Table 7: Experimental result on sinc1 of RVM and DSKR

Table 8: Experimental result on 3-D Mexican hat1 of RVM and DSKR

Table 9: Experimental result on housing and mpg of RVM and DSKR

Table 10: Experimental result on pyrim and triazines of RVM and DSKR

As we can see from Table 7-10, the regression errors are much bigger than these in Table 3-6. For DSKR can optimize sparse kernel regression with arbitrary Nz, it can improve the regression performance with increasing Nz.

CONCLUSIONS

The number of support vectors plays an important role in designing a fast kernel machine. Reducing the number of support vectors can accelerate the prediction time of a kernel machine. In this research, we consider the regression problem and present a direct sparse kernel regression method named DSKR to optimize the number of support vectors in kernel regression. By adding a constraint to traditional support vector regression directly, DSKR can obtain a sparse kernel regression with user-defined number of support vectors. Experiments on two synthetic data sets and five real-world data sets show the effectiveness of DSKR. The performance of DSKR can be comparable with the state-of-the-art of other methods. DSKR may further be improved in future. For example, how to reduce the training time of DSKR is a open issue which is worthy to be solved in future.

ACKNOWLEDGMENTS

This research is supported by HongKong PolyU Grant, National 973 Key Project (grant No. 2006CB705700), 2007 National 863 project (Grant No. 2007AA1Z158), 2007 two grants from National Science Foundation of China (Grant No. 60773206, No. 60704047), 2005 National Defense Research Grant from Ministry of Education of China (Grant No. A1420461266), 2007 Cultivation Fund of the Key Scientific and Technical Innovation Project of Ministry of Education of China, New_century Outstanding Young Scholar Grant of Ministry of Education of China (Grant No. NCET-04-0496), National KeySoft Lab. at Nanjing University, The Key Lab. of Computer Science at Institute of Software, CAS, China, The Key Lab. of Computer Information Technologies at JiangSu Province, the 2004 key project of Ministry of Education of China and National Key Lab. of Pattern Recognition at Institute of Automation, CAS, China.

REFERENCES

  • Ben-Hur, A., D. Horn, H.T. Siegelmann and V. Vapnik, 2001. Support vector clustering. J. Mach. Learn. Res., 2: 125-137.
    Direct Link    


  • Downs, T., K.E. Gates and A. Masters, 2001. Exact simplification of support vector solutions. J. Mach. Learn. Res., 2: 293-297.
    Direct Link    


  • Downs, T., K.E. Gates and A. Masters, 2001. Exact simplification of support vector solutions. J. Mach. Learn. Res., 2: 293-297.
    Direct Link    


  • Gao, J., D. Shi and X. Liu, 2007. Significant vector learning to construct sparse kernel regression models. Neural Networks, 20: 791-798.
    CrossRef    


  • Keerthi, S.S., O. Chapelle and D. DeCoset, 2006. Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res., 7: 1493-1515.
    Direct Link    


  • Lee, Y.J. and S.Y. Huang, 2007. Reduced support vector machines: A statistical theory. IEEE Trans. Neural Networks, 18: 1-13.
    CrossRef    


  • Lin, K.M. and C.J. Lin, 2003. A study on reduced support vector machines. IEEE Trans. Neural Networks, 14: 1449-1459.
    CrossRef    


  • Liu, D.C. and J. Nocedal, 1989. On the limited memory BFGS method for large scale optimization. Math Program., 45: 503-528.
    CrossRef    


  • Nair, P.B., A. Choudhury and A.J. Keane, 2002. Some greedy learning algorithms for sparse regression and classification with mercer kernels. J. Mach. Learn. Res., 3: 781-801.
    Direct Link    


  • Scholköpf, B., A. Smola, R.C. Williamson and P.L. Bartlett, 2000. New support vector algorithms. Neural Comput., 12: 1207-1245.
    CrossRef    Direct Link    


  • Steinwart, I., 2003. Sparseness of support vector machine. J. Machine Learn. Res., 4: 1071-1105.
    Direct Link    


  • Thies, T. and F. Weber, 2006. Optimal reduced-set vectors for support vector machines with a quadratic kernel. Neural Comput., 16: 1769-1777.
    CrossRef    


  • Tipping, M.E., 2001. Sparse Bayesian learning and the relevance vector machine. J. Machine Learn. Res., 1: 211-244.
    Direct Link    


  • Vapnik, V.N., 1998. Statistical Learning Theory. 1st Edn., John Wiley and Sons, New York


  • Vincent, P. and Y. Bengio, 2002. Kernel matching pursuit. Mach. Learn., 48: 165-187.


  • Wang, X.X., S. Chen, D. Lowe and C.J. Harris, 2006. Sparse support vector regression based on orthogonal forward selection for the generalized kernel model. Neur. Comput., 70: 462-474.
    CrossRef    


  • Wu, M.R., B. Scholkopf and G. Bakir, 2006. A direct method for building sparse kernel learning algorithms. J. Mach. Learn. Res., 7: 603-624.
    Direct Link    


  • Williams, C.K.I. and M. Seeger, 2001. Using the Nystrom method to speed up kernel machines. Proceedings of 13th Neural Information Processing Systems, December 10-14, 2001 MIT Press, Denver, USA., pp: 682-688.


  • Gao, J. and D. Shi, 2005. Sparse kernel regression modeling based on L1 significant vector learning. Proceedings of the International Conference on Neural Networks and Brain, October 13-15, 2005, IEEE Express, Beijing, China, pp: nil25-1930.


  • Lawrence, N., M. Seeger and R. Herbrich, 2002. Fast sparse gaussian process methods: The informative vector machine. Proceedings of the 15th Neural Information Processing System, December 9-14, 2002, MIT Press, Vancouver, Canada, pp: 606-616.

  • © Science Alert. All Rights Reserved