Abstract: We estimated the error of least square regression with data dependent hypothesis and coefficient regularization algorithms based on general kernel. When the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we derive a dimensional-free learning rate m-1/6.
INTRODUCTION
Kernel-based least regression square learning is a very popular field in recent years (Vapnik, 1995; Cucker and Smale, 2001, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009; Mairal et al., 2010; Schmit, 1907; Koltchinskii 2006; Aronszajn, 1950). Some mathematical foundations of it have been established, (Carmeli et al., 2006; Christmann and Steinwart, 2007, 2008; Cucker and Zhou, 2007; Evgeniou et al., 2000; Hiriart-Urruty and Lemarechas, 2001; Sheng et al., 2012a-c; Sheng and Ye, 2011; Huaisheng and Ye, 2011; Sheng and Xiang, 2011, 2012). In this study, we study some mathematical aspects of least square learning algorithms. We consider some error estimate of least square regression learning with general kernel and coefficient regularization. In particular when the kernel belongs to some kind of Mercer kernel, under a very mild regularity condition on the regression function, we derive a dimensional-free learning rate m¯1/6.
Let us formulate the problem of learning in a standard way. Let X⊂
For each input x∈X and output y∈Y, (f(x)-y)2 is the error suffered from the use of f as a model for the process producing y from x. By integrating over XxY (w.r.t. ρ, of course) we average out the error over all pairs (x; y). Hence the word error for Ep(f).
For every x∈X, let ρ(y|x) be the conditional (w.r.t. x) probability measure on Y . Let also ρx be the marginal probability measure of ρ on X, i.e. the measure on X defined by ρx (S) = ρ(π-1 (S) where π: X is the projection. For every integrable function φ: XxY→R a version of Fubinis Theorem relates ρ, ρ(y|x) and ρx as follows:
This breaking of ρ into the measures ρ(y|x) and ρx corresponds to looking at XxY as a product of an input domain X and an output set Y. In what follows, unless otherwise specified, integrals are to be understood over ρ, ρ(y|x) or ρx.
Define fp: X→Y by:
The function fp is called the regression function of ρ. For each x∈X, fp (x) is the average of the y coordinate of {x}xy (in topological terms, the average of y on the fiber of x).
It is clear that if:
then it minimizes the error E(f) over all f∈L2 (ρx). Thus, in the sense of error E(.) the regression function fp (x) is the best to describe the relation between inputs x∈X and outputs y∈Y.
In most cases, the distribution ρ(x,y) is unknown and what one can know is a set of samples z = {zi}mi = 1 = {(xi, yi)}mi = 1∈Zm which are drawn independently and identically distributed according to ρ(x,y). Our goal is to find an estimator fz on the base of given data z that approximates fp well with high probability. This is an ill-posed problem and the regularization technique is needed. In many areas of machine learning, the following Tikhonov regularization scheme is commonly used to overcome the ill-posed-ness:
(1) |
Usually H is taken as a Reproducing Kernel Hilbert Space (RKHS) induced by a Mercer kernel which is continuous, symmetric and positive semi-definite on XxX, (Vapnik, 1995; Cucker and Smale, 2001, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009). The RKHS HK associated with the kernel K is defined to be the closure of the linear span of the set of functions:
{Kx = K(x,.): x∈X}
with the inner product 〈Kx, Ky〉HK = K(x,y). The reproducing property takes the form:
(2) |
This kind of kernel scheme has been studied due to a lot of literatures, c.f. (Vapnik, 1995; Cucker and Smale, 2001; Cucker and Smale, 2002; De Vito et al., 2005; Wu et al., 2006; Caponnetto and De Vito, 2007; De Mol et al., 2009; Gnecco and Sanguineti, 2009; Mairal et al., 2010; Schmit, 1907; Koltchinskii, 2006).
In this study, we consider a different kernel scheme. Let K:XxX→
Every hypothesis function is determined by its coefficients and the penalty is imposed on these coefficients. Then, there comes the general coefficient regularized scheme:
(3) |
where, Ω(α) is a positive function on
Formulation (3) is a data dependent scheme which has been found many applications in the design of support vector machines, micro-array analysis and variable selection (Mairal et al., 2010; Schmit, 1907; Koltchinskii, 2006; Aronszajn, 1950). We now study a particular coefficient regularization.
We endow
In particular
Set:
We have the following coefficient regularization with l2-penalization:
(4) |
Equation 4 is a strict convex optimization problem whose solution may be analyzed with tools from convex analysis (Aronszajn, 1950). Based on this consideration, we shall give the explicit expression for the solution of Eq. 4, with which and a inequality for convex functions show the robustness of the solutions (Lemma 3). Thus, we will use a new approach to estimate the learning rate fαz-fρ||2, ρx.
For this purpose, we define the integral regularized risk scheme corresponding to Eq. 4 as:
(5) |
Then, we have the following error decomposition:
(6) |
where the first term of the right hand-side is called the sample error and the second term is called the approximation error. So the estimate of learning error id reduced to those of sample error and approximation error.
In this study, we assume |y|<M almost surely. So the regression function fρ is bounded and square integrable with respect to ρx. For the kernel function K, we only assume it is continuous and bounded. We denote:
In study of Gnecco and Sanguineti (2009) the following results have been obtained:
Theorem 1: Let K(x,y) be a general kernel on XxX, α2 and α(ρ) be defined as in Eq. 4-5, respectively. Then, for any 0<δ<1, with confidence 1-δ, there holds:
(7) |
Theorem 2: Under the assumption of Theorem 1, for any 0<δ<1, with confidence 1<δ, there holds:
(8) |
where:
Define the integral operator corresponding to K(x,y) as:
This operator is a compact operator that maps a Hilbert space L2(ρx) to itself.
We assume fρ(x) = Lk (φ,x) and φ∈L2 (ρx) which implies fρ lies in the range of Lk. Under this assumption, we obtain the following upper estimate for K(fρ,λ).
Theorem 3 (Sheng et al., 2012a): Let K(x, y) be a general kernel on XxX, fρ(x) = Lk (φ,x) with φ∈L2 (ρx). Then:
• | There is a discrete set |
• | If |
where:
The above three Theorems are our motivation of our present work. In the next section we provide our main results.
RESULTS
Here, we will show if K(x,y) belongs to some kind of Mercer kernel the convergence rate for the functional K(fρ, λ) can be derived. Based on this result, we will derive a dimension-free learning rate.
The integral operator:
is a compact operator that maps a Hilbert space L2(ρx) to itself. Then, by the Schmidt expansion (Schmit, 1907; Carmeli et al., 2006):
(9) |
where,
where, α(fρ) = (α1(fρ),...., αm(fρ)) and provide the estimate for:
which can be served as a upper estimate for the functional K(fρ, λ).
By the Mercers theorem (Carmeli et al., 2006) we know the eigenvalues λi≥0. We assume further that 0<λi<1 and the eigenfunctions
Define αi(f) = ∫x f(t)φi(t)d ρx (t) for f∈L2. Then our main results can be stated as the following theorems:
Theorem 4: If fρ satisfies:
(10) |
then, there is an absolute constant C>0 such that:
(11) |
where:
is the VC dimension of the family of real valued functions {Kx(t): t∈X}.
Theorem 5: If fρ satisfies (9) then, for a large probability:
To prove Theorem 4, we will exploit the following known result on error bound for sparse approximation. To this end, we first recall the concept of VC dimension. The VC dimension of a family F = {Fx} of real valued functions on a set X is the maximum number h of points (ti) in X that can be separated into two district classes in all 2h possible ways, by using functions of the form Fx(t)-α, where the parameters x and α vary in X and
Lemma 1 (Christmann and Steinwart, 2007): Let X⊂
(12) |
Now, we are ready to prove Theorem 4.
Proof of theorem 4: The proof of Theorem is based on the Schmidt expansion. In fact, since 0<λ1<1, (9) implies:
and hence:
Take:
then, by Eq. 9 we know H∈C(X) and:
By Lemma 1, there are
Define:
Then:
and:
Therefore:
Thus we complete the proof of Theorem 4.
Theorem 5 can be derived easily from Theorem 2 and 4.
Proof of theorem 5: Combining theorem 2-4, we have:
Taking λ = m-1/3, we get the desired result.
Finally, we want to point out that the similar rate can be derived for classification problem. We will study this problem in the future work.
ACKNOWLEDGMENTS
This work was supported in part by the Natural Science Foundation of China (Grant No. 11101220, 10971251, 11271199 and 11201104).