Convergence Rate of Least Square Regressions with Data Dependent Hypothesis

Information Technology Journal

Year: 2014 | Volume: 13 | Issue: 6 | Page No.: 1257-1261
DOI: 10.3923/itj.2014.1257.1261

Convergence Rate of Least Square Regressions with Data Dependent Hypothesis

Ye Peixin, Han Yongjie and Duan Liqin

Abstract: We estimated the error of least square regression with data dependent hypothesis and coefficient regularization algorithms based on general kernel. When the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we derive a dimensional-free learning rate m^-1/6.

Fulltext PDF Fulltext HTML

How to cite this article

Ye Peixin, Han Yongjie and Duan Liqin, 2014. Convergence Rate of Least Square Regressions with Data Dependent Hypothesis. Information Technology Journal, 13: 1257-1261.

Keywords: learning rate, Least square regressions, data dependent hypothesis, coefficient regularization and Mercer kernel

INTRODUCTION

Kernel-based least regression square learning is a very popular field in recent years (Vapnik, 1995; Cucker and Smale, 2001, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009; Mairal et al., 2010; Schmit, 1907; Koltchinskii 2006; Aronszajn, 1950). Some mathematical foundations of it have been established, (Carmeli et al., 2006; Christmann and Steinwart, 2007, 2008; Cucker and Zhou, 2007; Evgeniou et al., 2000; Hiriart-Urruty and Lemarechas, 2001; Sheng et al., 2012a-c; Sheng and Ye, 2011; Huaisheng and Ye, 2011; Sheng and Xiang, 2011, 2012). In this study, we study some mathematical aspects of least square learning algorithms. We consider some error estimate of least square regression learning with general kernel and coefficient regularization. In particular when the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we derive a dimensional-free learning rate m^¯1/6.

Let us formulate the problem of learning in a standard way. Let X⊂^q, Y⊂ be Borel sets and let ρ be a Borel probability measure on Z = XxY. For function f: X→Y define the error:

For each input x∈X and output y∈Y, (f(x)-y)² is the error suffered from the use of f as a model for the process producing y from x. By integrating over XxY (w.r.t. ρ, of course) we average out the error over all pairs (x; y). Hence the word “error” for E_p(f).

For every x∈X, let ρ(y|x) be the conditional (w.r.t. x) probability measure on Y . Let also ρ_x be the marginal probability measure of ρ on X, i.e. the measure on X defined by ρ_x (S) = ρ(π^-1 (S) where π: X is the projection. For every integrable function φ: XxY→R a version of Fubini’s Theorem relates ρ, ρ(y|x) and ρ_x as follows:

This “breaking” of ρ into the measures ρ(y|x) and ρ_x corresponds to looking at XxY as a product of an input domain X and an output set Y. In what follows, unless otherwise specified, integrals are to be understood over ρ, ρ(y|x) or ρ_x.

Define f_p: X→Y by:

The function f_p is called the regression function of ρ. For each x∈X, f_p (x) is the average of the y coordinate of {x}xy (in topological terms, the average of y on the fiber of x).

It is clear that if:

then it minimizes the error E(f) over all f∈L₂ (ρ_x). Thus, in the sense of error E(.) the regression function f_p (x) is the best to describe the relation between inputs x∈X and outputs y∈Y.

In most cases, the distribution ρ(x,y) is unknown and what one can know is a set of samples z = {z_i}^m_{i = 1} = {(x_i, y_i)}^m_{i = 1}∈Z^m which are drawn independently and identically distributed according to ρ(x,y). Our goal is to find an estimator f_z on the base of given data z that approximates f_p well with high probability. This is an ill-posed problem and the regularization technique is needed. In many areas of machine learning, the following Tikhonov regularization scheme is commonly used to overcome the ill-posed-ness:

(1)

Usually H is taken as a Reproducing Kernel Hilbert Space (RKHS) induced by a Mercer kernel which is continuous, symmetric and positive semi-definite on XxX, (Vapnik, 1995; Cucker and Smale, 2001, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009). The RKHS H_K associated with the kernel K is defined to be the closure of the linear span of the set of functions:

{K_x = K(x,.): x∈X}

with the inner product 〈K_x, K_y〉_HK = K(x,y). The reproducing property takes the form:

(2)

This kind of kernel scheme has been studied due to a lot of literatures, c.f. (Vapnik, 1995; Cucker and Smale, 2001; Cucker and Smale, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009; Mairal et al., 2010; Schmit, 1907; Koltchinskii, 2006).

In this study, we consider a different kernel scheme. Let K:XxX→ be a continuous and bounded function which is called general kernel. For a given data := {y₁, y₂,..., y_m}⊂X the data dependent hypothesis space is given by:

Every hypothesis function is determined by its coefficients and the penalty is imposed on these coefficients. Then, there comes the general coefficient regularized scheme:

(3)

where, Ω(α) is a positive function on ^m.

Formulation (3) is a data dependent scheme which has been found many applications in the design of support vector machines, micro-array analysis and variable selection (Mairal et al., 2010; Schmit, 1907; Koltchinskii, 2006; Aronszajn, 1950). We now study a particular coefficient regularization.

We endow ^m with usual inner product, i.e., for any α = (α₁, α₂,..., α_m)^T, b = (b₁, b₂,..., b_m)^T∈^m, we take:

In particular .

Set:

We have the following coefficient regularization with l₂-penalization:

(4)

Equation 4 is a strict convex optimization problem whose solution may be analyzed with tools from convex analysis (Aronszajn, 1950). Based on this consideration, we shall give the explicit expression for the solution of Eq. 4, with which and a inequality for convex functions show the robustness of the solutions (Lemma 3). Thus, we will use a new approach to estimate the learning rate f_αz-f_ρ||_{2, ρx}.

For this purpose, we define the integral regularized risk scheme corresponding to Eq. 4 as:

(5)

Then, we have the following error decomposition:

(6)

where the first term of the right hand-side is called the sample error and the second term is called the approximation error. So the estimate of learning error id reduced to those of sample error and approximation error.

In this study, we assume |y|<M almost surely. So the regression function f_ρ is bounded and square integrable with respect to ρ_x. For the kernel function K, we only assume it is continuous and bounded. We denote:

In study of Gnecco and Sanguineti (2009) the following results have been obtained:

Theorem 1: Let K(x,y) be a general kernel on XxX, α₂ and α^(ρ) be defined as in Eq. 4-5, respectively. Then, for any 0<δ<1, with confidence 1-δ, there holds:

(7)

Theorem 2: Under the assumption of Theorem 1, for any 0<δ<1, with confidence 1<δ, there holds:

(8)

where:

Define the integral operator corresponding to K(x,y) as:

This operator is a compact operator that maps a Hilbert space L₂(ρ_x) to itself.

We assume f_ρ(x) = L_k (φ,x) and φ∈L₂ (ρ_x) which implies f_ρ lies in the range of L_k. Under this assumption, we obtain the following upper estimate for K(f_ρ,λ).

Theorem 3 (Sheng et al., 2012a): Let K(x, y) be a general kernel on XxX, f_ρ(x) = L_k (φ,x) with φ∈L₂ (ρ_x). Then:

•	There is a discrete set ⊂X such that:

•	If = {x₁,...,x_m} ⊂X is taken from the sample z, then, for any δ∈(0,1), with confidence 1-δ, there holds:

where:

The above three Theorems are our motivation of our present work. In the next section we provide our main results.

RESULTS

Here, we will show if K(x,y) belongs to some kind of Mercer kernel the convergence rate for the functional K(f_ρ, λ) can be derived. Based on this result, we will derive a dimension-free learning rate.

The integral operator:

is a compact operator that maps a Hilbert space L₂(ρ_x) to itself. Then, by the Schmidt expansion (Schmit, 1907; Carmeli et al., 2006):

(9)

where, is a non-increasing sequence of eigenvalues of L_K and forms the corresponding orthonormal eigenfunctions. The convergence is absolute and uniform on XxX. We want to find an approximating sequence of the form:

where, α(f_ρ) = (α₁(f_ρ),...., α_m(f_ρ)) and provide the estimate for:

which can be served as a upper estimate for the functional K(f_ρ, λ).

By the Mercer’s theorem (Carmeli et al., 2006) we know the eigenvalues λ_i≥0. We assume further that 0<λ_i<1 and the eigenfunctions forms a complete orthonormal basis of L₂ (ρ_x). Moreover, we assume that |φ_i(x)|≤1, for all i and x∈X.

Define α_i(f) = ∫_x f(t)φ_i(t)d ρ_x (t) for f∈L₂. Then our main results can be stated as the following theorems:

Theorem 4: If f_ρ satisfies:

(10)

then, there is an absolute constant C>0 such that:

(11)

where:

is the VC dimension of the family of real valued functions {K_x(t): t∈X}.

Theorem 5: If f_ρ satisfies (9) then, for a large probability:

To prove Theorem 4, we will exploit the following known result on error bound for sparse approximation. To this end, we first recall the concept of VC dimension. The VC dimension of a family F = {F_x} of real valued functions on a set X is the maximum number h of points (t_i) in X that can be separated into two district classes in all 2^h possible ways, by using functions of the form F_x(t)-α, where the parameters x and α vary in X and , respectively (Vapnik, 1995).

Lemma 1 (Christmann and Steinwart, 2007): Let X⊂^q, H∈L₁(X), f be a function on X having the integral representation f(x) = ∫_XK_x(t)H(t)dt and h the VC dimension of the family K_x(t): t∈X}. If there exists k>0 such that for all x and t one has |K_x(t)|≤k, then, for any m, there exist y₁, y₂,...., y_m∈X, c₁, c₂,...., c_m∈{-1,1} and an absolute constant C such that:

(12)

Now, we are ready to prove Theorem 4.

Proof of theorem 4: The proof of Theorem is based on the Schmidt expansion. In fact, since 0<λ₁<1, (9) implies:

and hence:

Take:

then, by Eq. 9 we know H∈C(X) and:

By Lemma 1, there are = {y₁, y₂,...., y_m}⊂X and c₁, c₂,...., c_m∈{-1,1} and an absolute constant C>0, such that (12) holds.

Define:

Then:

and:

Therefore:

Thus we complete the proof of Theorem 4.

Theorem 5 can be derived easily from Theorem 2 and 4.

Proof of theorem 5: Combining theorem 2-4, we have:

Taking λ = m^-1/3, we get the desired result.

Finally, we want to point out that the similar rate can be derived for classification problem. We will study this problem in the future work.

ACKNOWLEDGMENTS

This work was supported in part by the Natural Science Foundation of China (Grant No. 11101220, 10971251, 11271199 and 11201104).

REFERENCES

Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York, USA., pp: 176-208

Cucker, F. and S. Smale, 2001. On the mathematical foundations of learning theory. Bull. Am. Math. Soc., 39: 1-49.
Direct Link

Cucker, F. and S. Smale, 2002. Best choices for regularization parameters in learning theory: On the bias-variance problem. Found. Compt. Math., 2: 413-428.
CrossRef Direct Link

De Vito, E., A. Caponnetto and L. Rosasco, 2005. Model selection for regularized least-squares algorithm in learning theory. Found. Compt. Math., 5: 59-85.
CrossRef Direct Link

Wu, Q., Y.M. Ying and D.X. Zhou, 2006. Learning rates of least-square regularized regression. Found. Compt. Math., 6: 171-192.
Direct Link

Caponnetto, A. and E. De Vito, 2007. Optimal rates for the regularized least-squares algorithm. Found. Compt. Math., 7: 331-368.
CrossRef Direct Link

De Mol, C., E. De Vito and L. Rosasco, 2009. Elastic-net regularization in learning theory. J. Complexity, 25: 201-230.
CrossRef Direct Link

Gnecco, G. and M. Sanguineti, 2009. The weight-decay technique in learning from data: An optimization point of view. Comput. Manage. Sci., 6: 53-79.
CrossRef Direct Link

Mairal, J., F. Bach, J. Ponce and G. Sapiro, 2010. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11: 19-60.
Direct Link

Schmit, E., 1907. Zur theorie der linearen und nichtlinearen integralgleichungen. Math. Ann., 63: 433-476.
CrossRef Direct Link

Aronszajn, N., 1950. Theory of reproducing kernels. Trans. Am. Math. Soc., 68: 337-404.
Direct Link

Carmeli, C., E. De Vito and A. Toigo, 2006. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl., 4: 377-408.
CrossRef

Christmann, A. and I. Steinwart, 2007. Consistency and robustness of kernel based regression methods. Bernoulli, 13: 799-819.
Direct Link

Christmann, A. and I. Steinwart, 2008. Consistency of kernel-based quantile regression. Appl. Stoch. Model. Bus. Ind., 24: 171-183.
Direct Link

Cucker, F. and D.X. Zhou, 2007. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, New York, ISBN: 9780521865593, Pages: 236

Evgeniou, T. M. Pontil and T. Poggio, 2000. Regularization networks and support vector machines. Adv. Comput. Math., 13: 1-50.
Direct Link

Hiriart-Urruty, J.B. and C. Lemarechas, 2001. Fundamental of Convex Analysis. Springer, New York, ISBN: 9783540422051, Pages: 259

Sheng, B., Z.X. Chen, J.L. Wang and P.X. Ye, 2012. Learning rates of Tikhonov regularized regressions based on sample dependent RKHS. J. Comput. Anal. Appl., 14: 341-359.
Direct Link

Huaisheng, B.A.O. and P.X. Ye, 2011. Error analysis for support vector machine classifiers on unit sphere of Euclidean space. J. Comput. Inform. Syst., 7: 3023-3030.
Direct Link

Sheng, B. and P.X. Ye, 2011. Least square regression learning with data dependent hypothesis and coefficient regularzation. J. Comput., 6: 671-675.

Sheng, B. and D. Xiang, 2011. The consistency analysis of coefficient regularized classification with convex loss. WSEAS Trans. Math., 10: 291-300.
Direct Link

Sheng, B. and D. Xiang, 2012. Bound the learning rates with generalized gradients. WSEAS Trans. Signal Process., 8: 1-10.
Direct Link

Sheng, B., P.X. Ye and J.L. Wang, 2012. Learning rates for least square regressions with coefficient regularization. Acta Math. Sin. English Ser., 28: 2205-2212.
CrossRef Direct Link

Sheng, B., P.X. Ye and Z.X. Chen, 2012. The Non-Smooth Analysis Method in Kernel Learning. Science Press, Beijing, China

Koltchinskii, V., 2006. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat., 34: 2593-2656.
CrossRef Direct Link

HOME JOURNALS CONTACT

Information Technology Journal

Year: 2014 | Volume: 13 | Issue: 6 | Page No.: 1257-1261 DOI: 10.3923/itj.2014.1257.1261

Convergence Rate of Least Square Regressions with Data Dependent Hypothesis

Ye Peixin, Han Yongjie and Duan Liqin

How to cite this article

Ye Peixin, Han Yongjie and Duan Liqin, 2014. Convergence Rate of Least Square Regressions with Data Dependent Hypothesis. Information Technology Journal, 13: 1257-1261.

Keywords: learning rate, Least square regressions, data dependent hypothesis, coefficient regularization and Mercer kernel

REFERENCES

Year: 2014 | Volume: 13 | Issue: 6 | Page No.: 1257-1261
DOI: 10.3923/itj.2014.1257.1261