INTRODUCTION
When two or more independent variables of a linear regression are highly correlated
with each other, multicollinearity is said to exist. Multicollinearity produces
unexpectedly large standard errors of the Ordinary Least Squares (OLS) estimates.
The presence of multicollinearity in the data set is suggested by nonsignificant
results in individual tests on the regression coefficients for important explanatory
variables where in fact they are significant. Since, multicollinearity causes
major interpretive problems in regression analysis, it is very crucial to investigate
and detect its presence to reduce its destructive effects on the regression
estimates. There are different primary sources of multicollinearity, such as
the data collection method employed, constraints on the model or in the population
being sampled, model specification and an over determined model (Montgomery
et al., 2001). It is now evident that high leverage points which
fall far from the majority of the explanatory variables are another source of
multicollinearity (Kamruzzaman and Imon, 2002). These
points are considered as good or bad leverage points according to whether they
follow the same regression line as the other data in the data set or not. Furthermore,
collinearity influential observations are the observations which can change
the collinearity pattern of the data. They may be enhancing or reducing collinearity
in the data set. All the high leverage points are not collinearity influential
observations and vice versa (Hadi, 1988). While large
magnitude of high leverage points which exist in more than one explanatory variable
may be collinearityenhancing observations and they may bring collinearity for
explanatory variables in noncollinear data sets. Among different multicollinearity
diagnostics which exists in the literatures, the Variance Inflation Factors
(VIF) is the commonly used method (Montgomery et al.,
2001; Belsley et al., 1980) which is sensitive
to the presence of high leverage points. It is worth mentioning that the high
leverage points which is pointed by Kamruzzaman and Imon
(2002) as a new source of multicollinearity in noncollinear data set, is
based on the classical diagnostics method. It is important to note that the
estimation of regression parameters is unbiased in the presence of multicollinearity
(Montgomery et al., 2001; Belsley
et al., 1980). However, when high leverage points cause collinearity
in the data set, the coefficient estimations will become bias and this situation
is not desired (Bagheri and Midi, 2009). Thus, it is
imperative to investigate the source of collinearity in the data set since,
the remedial measures to solve the problem of multicollinearity are highly dependent
on the understanding of the differences among these sources of multicollinearity
(Montgomery et al., 2001). The use of classical
diagnostic methods for the detection of multicollinearity caused by high leverage
points may produce a misleading conclusion. In collinear or noncollinear data
sets, these points have different effect on the classical diagnostics methods.
Sometimes the classical methods in the presence of high leverage points indicate
noncollinearity for the collinear data set or collinearity for the noncollinear
data set. Therefore, if the high leverage points are the only source of multicollinearity
in the data set, robust methods can be employed to remedy this problem. In this
situation, other remedial measures of multicollinearity problem such as the
Principal Component Regression (PCR) (Jolliffe, 1982)
are not necessary. Unfortunately, little work has been explored in the effect
of high leverage points on the classical multicollinearity diagnostics method.
In the presence of high leverage collinearity influential observations, the
classical multicollinearity diagnostics methods such as the VIF diagnose the
existence of collinearity in the data set but it doesn’t reveal the source
of multicollinearity. Making the VIF resistant to these high leverage points
will guide us to the detection of the source of multicollinearity in the data
set. High leverage points are the only source of multicollinearity if the Robust
VIF doesn’t diagnose multicollinearity in their presence. However, on the
contrary, the VIF suggests that multicollinearity exists in the data set. Subsequently,
the RVIF is unable to detect the existence of mlticollinearity. For a through
overview of the robust methods, one can refer to Rousseeuw and Leroy
(2003), Wilcox (2005), Maronna et
al. (2006) and Andersen (2008). It is important
to point out that the formulation of VIF is based on the coefficient determination
of fitted regression line when each of the explanatory variables is regressed
with the remaining predictor variables using the Ordinary Least Squares (OLS)
Method. Nevertheless, it is now evident that outliers in the X or Y direction
have an unduly effect on the OLS estimates (Midi et al.,
2009) and subsequently the classical VIF. This situation has motivated us
to develop new robust VIFs which are based on Robust coefficient determination
(RR^{2}). Hence, the Robust coefficient determination (RR^{2})
has to be formulated first, prior to propose any robust VIF. Several robust
coefficient determinations exist in the literature of robust methods such as
Rousseeuw and Hubert (1997) and Splus 6 Robust Library
User’s Guide (2001). The Generalized Mestimators (GMestimators) is one
of the robust methods which have desirable properties that attempts to downweight
high leverage points as well as large residuals. In this study, prior to developing
a robust VIF, a new GMestimator is proposed. The proposed method incorporates
Sestimators defined by Rousseeuw and Yohai (1984) and
DRGP (MVE) which is introduced by Midi et al. (2009).
A robust coefficient determination is then proposed based on applying the new
GM estimator to fit the regression model. Following this, the robust VIF is
developed. Moreover, another Robust VIF is proposed by employing robust coefficient
determination based on MMestimator which is introduced by Yohai
(1987). Our new proposed Robust VIFs will be applied to a wellknown noncollinear
data set. To compare the performance of these robust multicollinearity diagnostics
methods, a Monte Carlo simulation study will also be carried out.
MATERIALS AND METHODS
New Proposed robust estimator: The multiple regression model can be expressed as:
where, Y is an nx1 vector of response or dependent variables, X is an nxp (n>k) matrix of predictors (explanatory variables), β is a px1 vector of unknown finite parameters to be estimated and is an nx1 vector of random errors. When the Ordinary Least Squares (OLS) method is employed to estimate the regression parameters we obtain:
In the presence of outliers in the X or Ydirections, the OLS estimations
are not reliable. Thus, utilizing robust methods will be necessary. The robust
estimators through downweighting the outliers aim to fit the model according
to the majority of the data. One of the resistant robust methods against high
leverage points are GMestimators (Hill, 1977) which is
introduced by Schweppe. The major aim of these methods is to downweight those
high leverage points which have large residuals or bad leverage points. These
estimators have high efficiency and bounded influence properties which achieve
a moderate break down point equal to 1/p (Simpson, 1995).
The GMestimator is the solution of the normal equation:
where, π_{i} is defined to downweight high leverage points with high residuals, s is a robust scale estimate and ψfunction may be a monotonic ψfunction such as Huber’s ψfunction which is defined as:
It is noticeable that k = 1.345 has been chosen to achieve 95% efficiency under normal error distribution. Iteratively Reweighted Least Squares (IRLS) may be used to solve Eq. 4. At convergence, the GMestimator is written as follows:
where, in this case the diagonal elements of W are the weights w_{i }defined as:
Multistage GMestimators have been developed in order to overcome the weakness
of GMestimators, that is low break down point. These estimators may have high
break down point if we obtain appropriate initial estimators. One of the first
GMestimator with high efficiency, high breakdown (50%) and bounded influence
was proposed by Coakley and Hettmansperger (1993) and
(Wilcox, 2005). This method incorporated two most practical
high breakdown robust estimators, that is the Least Median of Squares (LMS)
and the Least Trimmed Squares (LTS). The LTS estimator is used as initial estimator
and the LMS estimator is integrated in the development of scale estimate (Rousseeuw,
1984). The Robust Mahalanobis Distance (RMD) based on Minimum Volume Estimator
(MVE) (Rousseeuw and van Zomeren, 1990) (RMDMVE) also
defined as leverage estimates. Rousseeuw (1985) defined
RMDMVE as:
where, T_{R} (X) and C_{R} (X) are robust location and shape
estimate of MVE. In this estimator πweight is a ratio of the χ^{2}
cutoff value to the squared Robust Mahalanobis Distance. A one step Newton Raphson
has been used as convergence approach (for more details, one can refer to Wilcox
(2005).
One of the drawbacks of this estimator is in the definition of πweight
which depends on RMDMVE. The RMDMVE tends to swamp some low leverage points
even though it can identify high leverage points correctly. Thus, it will produce
low weights to some of the good leverages as well (Midi
et al., 2009; Imon et al., 2009).
Improving the precision of these estimators requires an effective diagnostics
method that will identify appropriate π–weight. Simpson
(1995) discussed several types of Multistage GMestimators and a comparison
evaluation of the existing robust estimators. In this study, a new GMestimator
whose major aim is downweighting those high leverage points which have large
residuals will be utilized in developing our proposed method. The new GMestimator
which we called GM (DRGP)estimator is proposed by employing the Sestimator
as initial estimator instead of the LTS estimator in the GMestimator algorithm
of Coakley and Hettmansperger (1993). It has been verified
that the asymptotic efficiency of this estimator is high (Andersen,
2008) for studying more about Sestimator and its properties. To overcome
the shortcoming of πweight in their algorithm, we propose to employ Diagnostic
Robust Generalized Potential based on MVE (DRGP (MVE)) which is proposed by
Midi et al. (2009). This latest diagnostics method
of high leverage points has an attractive feature whereby it is able to identify
the exact number of high leverage points that exist in the data set. The DRGP
(MVE) can be defined as:
where, R {i; RMD_{i}^{2},<MADcutoff(RMD^{2}_{i})}, a nonparametric cutoff point which is called MADcutoff and defined as MADcutoff(θ) = Median(θ)+K*Mad(θ) where, K is set to be the constant values of 2 or 3. The RMD_{i}^{2} is introduced in Eq. 7. Furthermore, D = R^{T} and R^{T} indicates the complement of collection R.The merit of this method is swamping less good leverage as high leverage points compared to the RMDMVE.
Hence, the proposed algorithm for finding GM (DRGP)estimator is as follow:
To compute a GM estimator, begin by setting k = 0 and computing the Sestimate
of the intercept and slope parameters, .
Proceed as follows:
Step 1:Compute the residuals of the Sestimator as initial estimator
and
scale the residuals by applying:
Step 2: Form πfunction as:
where, p_{i} is defined in Eq. 8.
Step 3: Define the initial weights as:
for i = 1,…,n where, a Huber’s ψfunction which has been introduced in Eq. 4 is applied.
Step 4:Use these weights to obtain a weighted least squares estimates,
.
Increase k by step 1
Step 5:Repeat steps 14 until convergence. That is, iterate until the change in the estimated parameters is small
Another high break down point estimator which has been defined by Rousseeuw
and Yohai (1984) is called Sestimators. These estimators are the solution
that finds the smallest possible dispersion of the residuals:
Instead of minimizing the variance of the residuals; this robust Sestimation minimizes a robust Mestimate of the residual scale:
where, b is a constant defined as b = Eφ[ρ(e)] and φ represents the standard normal distribution. Differentiating Eq. 8 and solving the following resultant equation:
where, ψ is an appropriate weight function. The Sestimates have high
efficiency which is more than the LTS estimators relative to the OLS (Croux
et al., 1994) and has high breakdown point of 50%.
It is important to note that the MMestimator is one of the most important
robust estimators which is first proposed by Yohai (1987).
These estimators combine high breakdown value estimators (50%) and Mestimators
which have high efficiency (approximately 95% relative
to OLS under the GaussMarkov assumptions). The MMestimators in the name refers to the fact that more than one Mestimation procedure is used to calculate the final estimates.
Robust Variance Inflation Factor: Marquardt (1970)
proposed the most popular diagnostic tools of multicollinearity, namely, the
Variance Inflation Factor (VIF) which measures how much the variance of the
estimated regression coefficients are inflated as compared to when the predictor
variables are not linearly related. Thus it is define as follows:
where, R^{2} is the coefficient determination of each of the explanatory variables when regressed on the other explanatory variable in the ordinary regression model by using the OLS method. Moderate collineriaty exists in the data set when VIF is between 5 and 10. Any VIF value that exceeds its cutoff point 10, indicates that the associated regression coefficients are poorly estimated because of severe multicollinearity.
This multicollinearity diagnostics method is highly sensitive to the presence of high leverage points because of the effect of these points on R^{2}. Consequently, misleading conclusions are obtained from the classical VIF. In this respect, it is imperative to formulate a robust diagnostics method to avoid from making a wrong conclusion. Since, the computation of VIF is highly dependent on the calculation of R^{2}, the robust version of VIF also can be defined from RR^{2}. In this study, we develop RVIF which are based on two robust coefficient determinations, namely the RR^{2} = (MM) and the RR^{2} (GM(DRGP).
The RR^{2} (MM) is one of the handiest Robust coefficient determination (RR^{2}) and can be obtained from robust library of SPLUS software. It can be calculated as follows:
If the corresponding coefficient estimates are the initial Sestimates (RR^{2}), the RR^{2} (MM) is computed using the initial Sestimates. If an intercept term is included in the model, then the RR^{2} (MM) is defined as:
where, s_{e} = ×^{0} and s^{y} is the minimized ×(μ), for a regression model with only an intercept term with parameter μ. If the corresponding coefficient estimates computed using the final Mestimates, it can be obtained through Final Mestimator . If an intercept term μ is included in the model, then the RR^{2} (MM) is defined as:
where, is
the location Mestimate corresponding to the local minimum of:
such that:
where, μ^{*} is the sample median estimate. RR^{2} (MM) also can be obtained from coefficient determination of MMestimators algorithm. The RVIF which is defined by replacing R^{2} in Eq. 12 by RR^{2} = (MM) is called the RVIF (MM). Another proposed robust coefficient determination RR^{2} = (GM(DRGP)) may be defined as follows:
where,
r_{t} and w_{(GM)} are the residual and weight, respectively after the algorithm converged. Subsequently, the RVIF (GM (DRGP)) can be formulated by substituting the R^{2} in Eq. 12 with the RR^{2}(GM(DRGP).
RESULTS
Commercial properties data: In order to investigate the effect of high
leverage points on multicolllinearity pattern of the data, a noncollinear data
set which was introduced by Kutner et al. (2004)
is considered. Commercial Properties data containing 81 observations is taken
from the suburban commercial properties. The response variable is rental rates
which were regressed to the age (X_{1}), operating expenses and taxes
(X_{2}) and vacancy rates (X_{3}). This data set contains high
leverage points while these high leverage points cannot cause collinearity in
this data set (Bagheri and Midi, 2009). According to
Bagheri and Midi (2009) adding large magnitude of high
leverage points with the same observations to more than two explanatory variables,
cause multicollinearity problem for noncollinear data set. Hence, this data
set has been modified to have high leverage collinearityenhancing observations.
In order to modify this data set, the first observation of the first two explanatory
variables is replaced with a large value of high leverage point (equal to 300).
Figure 1a and b present the scatter plot
of the original and the modified commercial properties data set.
The coefficient estimations, standard deviations and tvalues of the original and the modified commercial properties data set are presented in Table 1 and 2, respectively. The Bootstrap standard deviation of the MM and our proposed GM (DRGP)estimates are also computed (Anderson, 2008).
The classical and the Robust VIFs for the original and the modified Commercial Properties data set are exhibited in Table 3.
To verify the merit of our new robust VIF method, a Monte Carlo simulation will be carried.

Fig. 1: 
The scatter plot of (a) original and (b) modified commercial
properties data set 
Table 1: 
Parameter estimations, standard deviations and tvalues of
original commercial properties data set 

Table 2: 
Parameter estimation and standard deviation of modified commercial
properties data set 

Table 3: 
Classical and robust VIF for original and modified commercial
properties data set 

Monte Carlo simulation study: A simulation study is conducted to further asses the performance of our new proposed Robust VIF. Three explanatory variables were considered in which each variable was generated from N (0, 1). We refer to this generated data as the clean independent variables. In order to create collinearityenhancing observations, the clean data is replaced by certain percentage of high leverage points. The level of high leverage points varied from zero to 25. We considered moderate sample size equals to 100 and 10000 replications in each simulation run. The Magnitude of Contamination (MC) in the Xdirection has been varied from 20, 50, 100 and 300. In order to obtain collinearityenhancing observations, the high leverage points were replaced in all three explanatory variables. The average values of the classical and robust VIF were computed over 10000 simulation runs. To be certain on the result of the simulation study, the percentage of errors for the classical and the robust diagnostics method are calculated to indicate the degree of multicollinearity, whether severe or moderate.
Table 4 shows the performance of our new proposed robust methods that is RVIF (MM) and RVIF (GM (DRGP)) when there aren’t any high leverage points in the data set and MC is equal to 100.
Table 5 and 6 exhibit the performance of
classical and robust VIF for the small to moderate magnitude of contamination,
that is 20 and 50 and moderate to large magnitude of contamination, that is
100 to 300, respectively.
Table 4: 
The performance of classical and robust VIF methods in the
noncollinear data set 

Table 5: 
The effect of different percentage and magnitude of high leverage
points equal to 20 and 50 on classical and robust VIF when n=100 

#%HL: Percentage of high leverage points, MC: Magnitude of
contamination 
Table 7 displays the percentage of error for each of the multicollinearity diagnostics method in the identification of the degree of collinearity when the percentage of high leverage point is 25 and different magnitude of high leverage points.
Table 6: 
The effect of different percentage and magnitude of high leverage
points equal to 100 and 300 on classical and robust VIF when n = 100 

#%HL: Percentage of high leverage points, MC: Magnitude of
contamination 
Table 7: 
The error percentage of classical and Robust VIF for diagnosing
collinearity pattern of the data set in 10000 replications for 25 percent
high leverage points 

DISCUSSION
At first we discuss the results of the numerical examples. According to Figure
1a, none of the explanatory variables in the original data set are collinear
(Kutner et al., 2004). It is interesting to see
that as soon as the data is modified, the added high leverage points in x_{1
}and x_{2} pull the regression line toward themselves and change
the collinearity pattern of the data which obviously can be seen from Fig.
1b.
It is worth mentioning that for the original data set the Ftest is significant
(Fvalue=17.53 and pvalue=0.0) which indicates that a linear relationship exist
between our variables in the model. From Table 1, that the
OLS parameter estimations are significant at significance level of 5% when we
compare tvalues with t (0.975, 77) =1.99. The results of the coefficient estimations
of the two robust methods are also significant. Thus the classical and the robust
methods confirm that none of the coefficient estimations are zero. After modifying
the data set, the Ftest is significant (Fvalue = 14.75 and pvalue = 0.0)
while, β_{3} for the OLS is not significant (tvalue = 0.1777).
This behavior is an indicator for the existence of multicollinearity in the
data set (Montgomery et al., 2001; Kutner
et al., 2004; Chatterjee and Hadi, 2006).
However, the results of Table 2 suggested that both robust
methods coefficient determinations are significant. These indicate that the
robust methods are fitting the model to the majority of the data and resistant
to the high leverage points (Maronna et al., 2006;
Andersen, 2008). Consequently, these points can’t
cause any change in the parameter estimates.
The results of Table 3 indicate that when this data set doesn’t
contain any collinearityenhancing observations, the classical VIF, RVIF (MM)
and RVIF (GM (DRGP)) are not exceeding their cutoff points. Thus these results
confirm that this data set is noncollinear data set. However, after the data
is modified by creating high leverage points which cause collinearity in the
data set, the classical VIF indicates severe multicollinearity (Midi et al., 2009; Imon et al., 2009). However, our proposed
RVIF (GM (DRGP)) and RVIF (MM) are resistant to these added high leverage points
and doesn’t show collinearity for the data compared to the CVIF which indicates
severe multicollinearity. It is evident from the results that the high leverage
points are the source of multicollinearity in the data set. The results reveal
that the high leverage points which are claimed by Kamruzzaman
and Imon (2002) to cause multicollinearity in noncollinear data set, is
based on the classical VIF which is not resistant to these points.
Finally we discuss the results obtained from the simulations. According to Table 4, it is important to note that in normal situations, the values of the RVIF (MM) and RVIF (GM (DRGP)) are close to the classical VIF which indicates that they are as good as CVIF in diagnosing the collinearity pattern of the data correctly.
As to be expected, when there aren’t any high leverage points in the data
set, the CVIF confirmed that the data set is not collinear (Table
4) (Hair et al., 2006). It can be observed
from Table 5 and 6 that by increasing the
percentage and magnitude of high leverage points, the value of CVIF become larger.
Hence, the CVIF is very sensitive to the presence of added high leverage points
to the data set. However, the RVIF (MM) increases for the small values of high
leverage points such as 20 while for the large values of high leverage points
such as 300 it doesn’t change drastically (Table 6).
Hence, the RVIF (MM) for small values of high leverage points is not as resistant
as high values.
It is evident from the results that the RVIF (GM (DRGP)) is not affected by the increased in the percentage and magnitude of high leverage points.
It is important to note that a method that has very small or high percentage of error reveals that it can detect correctly or unable to detect correctly the degree of multicollinearity in the data, respectively. Inevitably, robust methods have high percentage of error due to their resistance to the high leverage points. It can be observed from Table 7 that the percentage of errors for the CVIF is zero for different magnitude of high leverage points considered in this study. The results reveal that the CVIF has detected a severe multicollinearity due to their sensitivity to the added high leverage points in the data sets. On the other hands, the RVIF (MM) has detected a moderate degree of multicollinearity for small magnitude of high leverage point. This shows that this method is not robust against small magnitude of high leverage points. However, by increasing the magnitude of high leverage points up to 300, it becomes resistant to these points. The results also point out that the RVIF (GM (DRGP)) always robust against high leverage collinearity enhancing observations.
CONCLUSIONS
In this study, we proposed a robust RVIF (MM) and RVIF (GM (DRGP)) for detecting the source of multicollinearity which is caused by the high leverage points in the data set. Recently, high leverage points are known to be another source of multicollinearity. The results of the study signify that the high leverage points has an unduly effects on the classical multicollinearity diagnostics, specifically the VIF. In this situation, sometimes the classical VIF conveys the misleading interpretation about collinearity pattern of the data. The results of the real data and simulation study reveal that the high leverage points are the source of multicolinearity evidently by the failure of the RVIF in diagnosing multicollinearity, contradicts to the result of the CVIF which indicates existence of multicollinearity in the data set. Another important conclusion is that the RVIF (GM (DRGP)) is the most resistant diagnostic measure to high leverage points, followed by the RVIF (MM).