HOME JOURNALS CONTACT

Information Technology Journal

Year: 2014 | Volume: 13 | Issue: 6 | Page No.: 1257-1261
DOI: 10.3923/itj.2014.1257.1261
Convergence Rate of Least Square Regressions with Data Dependent Hypothesis
Ye Peixin, Han Yongjie and Duan Liqin

Abstract: We estimated the error of least square regression with data dependent hypothesis and coefficient regularization algorithms based on general kernel. When the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we derive a dimensional-free learning rate m-1/6.

Fulltext PDF Fulltext HTML

How to cite this article
Ye Peixin, Han Yongjie and Duan Liqin, 2014. Convergence Rate of Least Square Regressions with Data Dependent Hypothesis. Information Technology Journal, 13: 1257-1261.

Keywords: learning rate, Least square regressions, data dependent hypothesis, coefficient regularization and Mercer kernel

INTRODUCTION

Kernel-based least regression square learning is a very popular field in recent years (Vapnik, 1995; Cucker and Smale, 2001, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009; Mairal et al., 2010; Schmit, 1907; Koltchinskii 2006; Aronszajn, 1950). Some mathematical foundations of it have been established, (Carmeli et al., 2006; Christmann and Steinwart, 2007, 2008; Cucker and Zhou, 2007; Evgeniou et al., 2000; Hiriart-Urruty and Lemarechas, 2001; Sheng et al., 2012a-c; Sheng and Ye, 2011; Huaisheng and Ye, 2011; Sheng and Xiang, 2011, 2012). In this study, we study some mathematical aspects of least square learning algorithms. We consider some error estimate of least square regression learning with general kernel and coefficient regularization. In particular when the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we derive a dimensional-free learning rate m¯1/6.

Let us formulate the problem of learning in a standard way. Let X⊂ q, Y⊂ be Borel sets and let ρ be a Borel probability measure on Z = XxY. For function f: X→Y define the error:

For each input x∈X and output y∈Y, (f(x)-y)2 is the error suffered from the use of f as a model for the process producing y from x. By integrating over XxY (w.r.t. ρ, of course) we average out the error over all pairs (x; y). Hence the word “error” for Ep(f).

For every x∈X, let ρ(y|x) be the conditional (w.r.t. x) probability measure on Y . Let also ρx be the marginal probability measure of ρ on X, i.e. the measure on X defined by ρx (S) = ρ(π-1 (S) where π: X is the projection. For every integrable function φ: XxY→R a version of Fubini’s Theorem relates ρ, ρ(y|x) and ρx as follows:

This “breaking” of ρ into the measures ρ(y|x) and ρx corresponds to looking at XxY as a product of an input domain X and an output set Y. In what follows, unless otherwise specified, integrals are to be understood over ρ, ρ(y|x) or ρx.

Define fp: X→Y by:

The function fp is called the regression function of ρ. For each x∈X, fp (x) is the average of the y coordinate of {x}xy (in topological terms, the average of y on the fiber of x).

It is clear that if:

then it minimizes the error E(f) over all f∈L2x). Thus, in the sense of error E(.) the regression function fp (x) is the best to describe the relation between inputs x∈X and outputs y∈Y.

In most cases, the distribution ρ(x,y) is unknown and what one can know is a set of samples z = {zi}mi = 1 = {(xi, yi)}mi = 1∈Zm which are drawn independently and identically distributed according to ρ(x,y). Our goal is to find an estimator fz on the base of given data z that approximates fp well with high probability. This is an ill-posed problem and the regularization technique is needed. In many areas of machine learning, the following Tikhonov regularization scheme is commonly used to overcome the ill-posed-ness:

(1)

Usually H is taken as a Reproducing Kernel Hilbert Space (RKHS) induced by a Mercer kernel which is continuous, symmetric and positive semi-definite on XxX, (Vapnik, 1995; Cucker and Smale, 2001, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009). The RKHS HK associated with the kernel K is defined to be the closure of the linear span of the set of functions:

{Kx = K(x,.): x∈X}

with the inner product 〈Kx, KyHK = K(x,y). The reproducing property takes the form:

(2)

This kind of kernel scheme has been studied due to a lot of literatures, c.f. (Vapnik, 1995; Cucker and Smale, 2001; Cucker and Smale, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009; Mairal et al., 2010; Schmit, 1907; Koltchinskii, 2006).

In this study, we consider a different kernel scheme. Let K:XxX→ be a continuous and bounded function which is called general kernel. For a given data := {y1, y2,..., ym}⊂X the data dependent hypothesis space is given by:

Every hypothesis function is determined by its coefficients and the penalty is imposed on these coefficients. Then, there comes the general coefficient regularized scheme:

(3)

where, Ω(α) is a positive function on m.

Formulation (3) is a data dependent scheme which has been found many applications in the design of support vector machines, micro-array analysis and variable selection (Mairal et al., 2010; Schmit, 1907; Koltchinskii, 2006; Aronszajn, 1950). We now study a particular coefficient regularization.

We endow m with usual inner product, i.e., for any α = (α1, α2,..., αm)T, b = (b1, b2,..., bm)Tm, we take:

In particular .

Set:

We have the following coefficient regularization with l2-penalization:

(4)

Equation 4 is a strict convex optimization problem whose solution may be analyzed with tools from convex analysis (Aronszajn, 1950). Based on this consideration, we shall give the explicit expression for the solution of Eq. 4, with which and a inequality for convex functions show the robustness of the solutions (Lemma 3). Thus, we will use a new approach to estimate the learning rate fαz-fρ||2, ρx.

For this purpose, we define the integral regularized risk scheme corresponding to Eq. 4 as:

(5)

Then, we have the following error decomposition:

(6)

where the first term of the right hand-side is called the sample error and the second term is called the approximation error. So the estimate of learning error id reduced to those of sample error and approximation error.

In this study, we assume |y|<M almost surely. So the regression function fρ is bounded and square integrable with respect to ρx. For the kernel function K, we only assume it is continuous and bounded. We denote:

In study of Gnecco and Sanguineti (2009) the following results have been obtained:

Theorem 1: Let K(x,y) be a general kernel on XxX, α2 and α(ρ) be defined as in Eq. 4-5, respectively. Then, for any 0<δ<1, with confidence 1-δ, there holds:

(7)

Theorem 2: Under the assumption of Theorem 1, for any 0<δ<1, with confidence 1<δ, there holds:

(8)

where:

Define the integral operator corresponding to K(x,y) as:

This operator is a compact operator that maps a Hilbert space L2x) to itself.

We assume fρ(x) = Lk (φ,x) and φ∈L2x) which implies fρ lies in the range of Lk. Under this assumption, we obtain the following upper estimate for K(fρ,λ).

Theorem 3 (Sheng et al., 2012a): Let K(x, y) be a general kernel on XxX, fρ(x) = Lk (φ,x) with φ∈L2x). Then:

There is a discrete set ⊂X such that:


If = {x1,...,xm} ⊂X is taken from the sample z, then, for any δ∈(0,1), with confidence 1-δ, there holds:

where:

The above three Theorems are our motivation of our present work. In the next section we provide our main results.

RESULTS

Here, we will show if K(x,y) belongs to some kind of Mercer kernel the convergence rate for the functional K(fρ, λ) can be derived. Based on this result, we will derive a dimension-free learning rate.

The integral operator:

is a compact operator that maps a Hilbert space L2x) to itself. Then, by the Schmidt expansion (Schmit, 1907; Carmeli et al., 2006):

(9)

where, is a non-increasing sequence of eigenvalues of LK and forms the corresponding orthonormal eigenfunctions. The convergence is absolute and uniform on XxX. We want to find an approximating sequence of the form:

where, α(fρ) = (α1(fρ),...., αm(fρ)) and provide the estimate for:

which can be served as a upper estimate for the functional K(fρ, λ).

By the Mercer’s theorem (Carmeli et al., 2006) we know the eigenvalues λi≥0. We assume further that 0<λi<1 and the eigenfunctions forms a complete orthonormal basis of L2x). Moreover, we assume that |φi(x)|≤1, for all i and x∈X.

Define αi(f) = ∫x f(t)φi(t)d ρx (t) for f∈L2. Then our main results can be stated as the following theorems:

Theorem 4: If fρ satisfies:

(10)

then, there is an absolute constant C>0 such that:

(11)

where:

is the VC dimension of the family of real valued functions {Kx(t): t∈X}.

Theorem 5: If fρ satisfies (9) then, for a large probability:

To prove Theorem 4, we will exploit the following known result on error bound for sparse approximation. To this end, we first recall the concept of VC dimension. The VC dimension of a family F = {Fx} of real valued functions on a set X is the maximum number h of points (ti) in X that can be separated into two district classes in all 2h possible ways, by using functions of the form Fx(t)-α, where the parameters x and α vary in X and , respectively (Vapnik, 1995).

Lemma 1 (Christmann and Steinwart, 2007): Let X⊂q, H∈L1(X), f be a function on X having the integral representation f(x) = ∫XKx(t)H(t)dt and h the VC dimension of the family Kx(t): t∈X}. If there exists k>0 such that for all x and t one has |Kx(t)|≤k, then, for any m, there exist y1, y2,...., ym∈X, c1, c2,...., cm∈{-1,1} and an absolute constant C such that:

(12)

Now, we are ready to prove Theorem 4.

Proof of theorem 4: The proof of Theorem is based on the Schmidt expansion. In fact, since 0<λ1<1, (9) implies:

and hence:

Take:

then, by Eq. 9 we know H∈C(X) and:

By Lemma 1, there are = {y1, y2,...., ym}⊂X and c1, c2,...., cm∈{-1,1} and an absolute constant C>0, such that (12) holds.

Define:

Then:

and:

Therefore:

Thus we complete the proof of Theorem 4.

Theorem 5 can be derived easily from Theorem 2 and 4.

Proof of theorem 5: Combining theorem 2-4, we have:

Taking λ = m-1/3, we get the desired result.

Finally, we want to point out that the similar rate can be derived for classification problem. We will study this problem in the future work.

ACKNOWLEDGMENTS

This work was supported in part by the Natural Science Foundation of China (Grant No. 11101220, 10971251, 11271199 and 11201104).

REFERENCES

  • Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York, USA., pp: 176-208


  • Cucker, F. and S. Smale, 2001. On the mathematical foundations of learning theory. Bull. Am. Math. Soc., 39: 1-49.
    Direct Link    


  • Cucker, F. and S. Smale, 2002. Best choices for regularization parameters in learning theory: On the bias-variance problem. Found. Compt. Math., 2: 413-428.
    CrossRef    Direct Link    


  • De Vito, E., A. Caponnetto and L. Rosasco, 2005. Model selection for regularized least-squares algorithm in learning theory. Found. Compt. Math., 5: 59-85.
    CrossRef    Direct Link    


  • Wu, Q., Y.M. Ying and D.X. Zhou, 2006. Learning rates of least-square regularized regression. Found. Compt. Math., 6: 171-192.
    Direct Link    


  • Caponnetto, A. and E. De Vito, 2007. Optimal rates for the regularized least-squares algorithm. Found. Compt. Math., 7: 331-368.
    CrossRef    Direct Link    


  • De Mol, C., E. De Vito and L. Rosasco, 2009. Elastic-net regularization in learning theory. J. Complexity, 25: 201-230.
    CrossRef    Direct Link    


  • Gnecco, G. and M. Sanguineti, 2009. The weight-decay technique in learning from data: An optimization point of view. Comput. Manage. Sci., 6: 53-79.
    CrossRef    Direct Link    


  • Mairal, J., F. Bach, J. Ponce and G. Sapiro, 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11: 19-60.
    Direct Link    


  • Schmit, E., 1907. Zur theorie der linearen und nichtlinearen integralgleichungen. Math. Ann., 63: 433-476.
    CrossRef    Direct Link    


  • Aronszajn, N., 1950. Theory of reproducing kernels. Trans. Am. Math. Soc., 68: 337-404.
    Direct Link    


  • Carmeli, C., E. De Vito and A. Toigo, 2006. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl., 4: 377-408.
    CrossRef    


  • Christmann, A. and I. Steinwart, 2007. Consistency and robustness of kernel based regression methods. Bernoulli, 13: 799-819.
    Direct Link    


  • Christmann, A. and I. Steinwart, 2008. Consistency of kernel-based quantile regression. Appl. Stoch. Model. Bus. Ind., 24: 171-183.
    Direct Link    


  • Cucker, F. and D.X. Zhou, 2007. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, New York, ISBN: 9780521865593, Pages: 236


  • Evgeniou, T. M. Pontil and T. Poggio, 2000. Regularization networks and support vector machines. Adv. Comput. Math., 13: 1-50.
    Direct Link    


  • Hiriart-Urruty, J.B. and C. Lemarechas, 2001. Fundamental of Convex Analysis. Springer, New York, ISBN: 9783540422051, Pages: 259


  • Sheng, B., Z.X. Chen, J.L. Wang and P.X. Ye, 2012. Learning rates of Tikhonov regularized regressions based on sample dependent RKHS. J. Comput. Anal. Appl., 14: 341-359.
    Direct Link    


  • Huaisheng, B.A.O. and P.X. Ye, 2011. Error analysis for support vector machine classifiers on unit sphere of Euclidean space. J. Comput. Inform. Syst., 7: 3023-3030.
    Direct Link    


  • Sheng, B. and P.X. Ye, 2011. Least square regression learning with data dependent hypothesis and coefficient regularzation. J. Comput., 6: 671-675.


  • Sheng, B. and D. Xiang, 2011. The consistency analysis of coefficient regularized classification with convex loss. WSEAS Trans. Math., 10: 291-300.
    Direct Link    


  • Sheng, B. and D. Xiang, 2012. Bound the learning rates with generalized gradients. WSEAS Trans. Signal Process., 8: 1-10.
    Direct Link    


  • Sheng, B., P.X. Ye and J.L. Wang, 2012. Learning rates for least square regressions with coefficient regularization. Acta Math. Sin. English Ser., 28: 2205-2212.
    CrossRef    Direct Link    


  • Sheng, B., P.X. Ye and Z.X. Chen, 2012. The Non-Smooth Analysis Method in Kernel Learning. Science Press, Beijing, China


  • Koltchinskii, V., 2006. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat., 34: 2593-2656.
    CrossRef    Direct Link    

  • © Science Alert. All Rights Reserved