Asian Science Citation Index is committed to provide an authoritative, trusted and significant information by the coverage of the most important and influential journals to meet the needs of the global scientific community.  
ASCI Database
308-Lasani Town,
Sargodha Road,
Faisalabad, Pakistan
Fax: +92-41-8815544
Contact Via Web
Suggest a Journal
Journal of Mathematics and Statistics
Year: 2009  |  Volume: 5  |  Issue: 2  |  Page No.: 97 - 106

Two-Step Robust Diagnostic Method for Identification of Multiple High Leverage Points

Arezoo Bagheri, Habshah Midi and A.H.M. Rahmatullah Imon    

Abstract: Problem statement: High leverage points are extreme outliers in the X-direction. In regression analysis, the detection of these leverage points becomes important due to their arbitrary large effects on the estimations as well as multicollinearity problems. Mahalanobis Distance (MD) has been used as a diagnostic tool for identification of outliers in multivariate analysis where it finds the distance between normal and abnormal groups of the data. Since the computation of MD relies on non-robust classical estimations, the classical MD can hardly detect outliers accurately. As an alternative, Robust MD (RMD) methods such as Minimum Covariance Determinant (MCD) and Minimum Volume Ellipsoid (MVE) estimators had been used to identify the existence of high leverage points in the data set. However, these methods tended to swamp some low leverage points even though they can identify high leverage points correctly. Since, the detection of leverage points is one of the most important issues in regression analysis, it is imperative to introduce a novel detection method of high leverage points. Approach: In this study, we proposed a relatively new two-step method for detection of high leverage points by utilizing the RMD (MVE) and RMD (MCD) in the first step to identify the suspected outlier points. Then, in the second step the MD was used based on the mean and covariance of the clean data set. We called this method two-step Robust Diagnostic Mahalanobis Distance (RDMDTS) which could identify high leverage points correctly and also swamps less low leverage points. Results: The merit of the newly proposed method was investigated extensively by real data sets and Monte Carlo Simulations study. The results of this study indicated that, for small sample sizes, the best detection method is (RDMDTS) (MVE)-mad while there was not much difference between (RDMDTS) (MVE)-mad and (RDMDTS) (MCD)-mad for large sample sizes. Conclusion/Recommendations: In order to swamp less low leverage as high leverage point, the proposed robust diagnostic methods, (RDMDTS) (MVE)-mad and (RDMDTS) (MCD)-mad were recommended.

(1)

Where:
T(X) = The estimated multivariate location which is usually the multivariate arithmetic mean
C(X) = The estimated covariance matrix which is usually the sample covariance matrix

The distribution of the MD with both the true location and shape parameters and the conventional location and shape parameters is well known[5]. If there are only a few outliers, large values of MD, indicate that the point xi is an outlier[2]. Any value of which the MD exceeds the cutoff is considered as outliers where p is the number of explanatory variables[16]. Data sets with multiple outliers are subject to problems of masking and swamping[20]. Masking occurs when a group of outlying points skews the mean and covariance estimates toward these points and the resulting distance of the outlying point from the mean is small. While, swamping occurs when a group of outlying points skews the mean and covariance estimates toward these points and away from other inlying points and the resulting distance from the inlying points to the mean is large. Mahalanobis Distance is known to suffer from masking problems[24].

Mahalanobis Distances give a one-dimensional measure of how far a point is from a location with respect to a shape. Utilizing MD, we can find the points that are unusually far away from a location and call those points outliers. A large body of diagnostic tools is available in the literature for detection of high leverage points in linear regression[4,11,12,27]. Mahalanobis Distance (MD) is one of these well-known multivariate methods for detecting high leverage points as well. Although it is a reliable diagnostic tool for detecting high leverage points, it suffers from masking problem. Most of the classical diagnostic methods fail to identify the multiple high leverage points due to their masking effects[14]. Problems of masking can be resolved by using robust estimates of shape and location, which by definition are less affected by outliers. Outlying points are less likely to enter into the calculation of the robust procedures, so they will not be able to influence the parameters used in the MD. The inlying points, which all come from the underlying distribution, will completely determine the estimate of the location and shape of the data. Several robust estimators of multivariate location and scatter have been proposed, such as Maronna’s pioneering paper on multivariate M-estimation [17], the Minimum Volume Ellipsoid (MVE) and the Minimum Covariance Determinant (MCD) estimators by Rousseeuw[22]. For a thorough overview of robust multivariate estimation, one can refer to the article by Maronna and Yohai[18].

The Minimum Covariance Determinant (MCD) method of Rousseeuw[22] aims to find h observations out of n whose covariance matrix C has the lowest determinant. In the Minimum Volume Estimator (MVE), proposed by Rousseeuw[22], an ellipsoid of the smallest volume with a subset of p objects (non-contaminated data) is constructed. In one of the proposed iterative algorithms, n+1 object is selected iteratively at random in each of iterations and their mean and covariance are determined. Then, the ellipsoid containing exactly p data objects is found by deflating or expanding the data covariance. The steps of the algorithm are repeated until the subset of p objects yielding the smallest volume of the covariance ellipsoid is found.

Finally the robust MD distance can be written as:

(2)

where, TR (X) and CR (X) are robust location and shape estimate such as MCD or MVE. By using a robust location and shape estimate in the RMD, outlying points will not skew the estimates and can be identified as outliers by large values of the RMD. Unfortunately, using robust estimates gives RMDs with unknown distributional properties[25]. The use of quantile as cutoff point for RMD will prone to declare some good and low leverage as high leverage points and sand often lead to identifying too many points as outliers[25]. To develop robust multivariate estimators, Rousseeuw and Leroy[23] first proposed to detect outliers by RMD and then find the estimates by using the reweighted least squares regression when the weight function is a hard rejection function. Specifically, the latter proposal consists of discarding those observations whose RMD exceeds a certain fix threshold value. Previously, the MVE was commonly used as initial estimator for these procedures. In the context of linear regression, many estimators have been proposed that aim to reconcile high efficiency and robustness. Typically, these methods are also two-stage procedures[6,10,15,22,28,29].

Let us consider a k variables regression model as:

(3)

The weight matrix W = X (XT X)-1 XT is the orthogonal projector matrix onto the model space, or hat matrix which is traditionally used as a measure of leverage points in regression analysis. If a diagonal entry Wii of W is large, changing yi will move the fitted surface appreciably towards the altered value. Therefore, Wii is said to measure the leverage of the observation yi. Different cutoff points exist in the literature for the hat matrix to find high leverage points such as twice-the- mean-rule (2 k/n) by[11], thrice-the- mean-rule (3 k/n)[27] when k and n are the number of variables and observations respectively and three interval range of Huber[12] (observations with 0.2< Wii <0.5 are risky to consider in analysis and those with Wii ≥0.5 should be avoided when Wii is diagonal elements of hat matrix).

The hat matrix may fail to identify the high leverage points because of the effect of high leverage points in leverage structure[7]. Hadi[7] introduced another diagnostic tool as follows:

(4)

where, wii = xiT(XT X)-1 xi is the diagonal element of W and the i-th diagonal potential pii can be defined as:

where, X(i) is the data matrix X without the i-th row. He proposed a cutoff point for potential values (pii) as Median (pii) +c Mad (pii) where Mad = median |pii-median(pii)|/0.6745 and c can be taken as constant values of 2 or 3. Observations exceeding Hadi’s cutoff point is considered as high leverage points. But this method also can’t detect all of the high leverage points.

Imon[13] introduced another diagnostic tool as generalized potentials for the whole data set as follows: Let consider that D is deleted group from data set, those which suspected as outliers (the choice of this deletion group is very important since the omission of this group determines the weights for the whole data set). R is the remaining set after deleting d<(n-k) therefore it contains (n-d) cases. If we assume that the suspected data are the last d rows of X and Y so the weight matrix W = X (XT X)-1 XT can be written as:

where, UR = XR(XT X)-1 and UD = XD(XT X)-1 are symmetric matrices of order (n-d) and d respectively. V = XR(XT X)-1 is an (n-d)×d matrix. Now we can define:

, for i = 1,2,…n

where, is the i-th diagonal element of X(XRT XR)-1 XT matrix.

Then Imon[14] introduced generalized potentials for all members in a data set which are defined as:

(5)

We should notice that there isn’t any finite upper bound for pii* ’s and the derivation of the theoretical distribution of them are not easy. He introduced the same cutoff point as potential values Median (pii*) + c Mad (pii*) for the generalized potential as well.

Habshah et al.[6] developed a new method for determining outlying points in multivariate data set by combining the RMD (MVE) method for detecting the suspected group (D group) in generalized potential method which is proposed by[14]. This method which is called DRGP (MVE) is also a two-step method for high leverage point detection. In their methods, the mad cutoff point has been used in the first and second steps.However, this method can identify more swamped low leverage points. According to Werner [28], "A successful method of identifying outliers in all multivariate situations would be ideal, but is unrealistic". By "successful", he means both highly sensitive, the ability to detect genuine outliers and highly specific, the ability not to swamp regular points as outliers. Therefore a practical and efficient robust detection method of high leverage points (outliers in X-direction) is the method which is sensitive to detect genuine high leverage points and specific, thus it swamps less low leverage as high leverage.

MATERIALS AND METHODS

In this study, we propose a two-step diagnostic tool for detecting multiple high leverage points which can detect less swamped low leverages. In order to improve DRGP (MVE) performance proposed by[6], we follow the idea of Rousseeuw and Leroy[23] in developing robust multivariate estimators and propose a relatively new method for high leverage points identification which is called two-steps Robust Diagnostic Mahalanobis Distance (RDMDTS). In the first step, the RMD (MCD) or RMD (MVE) method is used to detect the suspected outlier group which will be deleted from the data set resulting in the clean data for the next step. In the second step, we apply the MD for the entire data set that based on the mean and covariance matrix of the clean data set which was obtained from the first step. Therefore, Two-Steps Robust Diagnostic Mahalanobis Distance (RDMD TS) is written as follows:

(6)

where, T0(X) and C0(X) are the mean and covariance matrix of the clean data set. Two different cutoff points are considered, namely the where k is the number of explanatory variables and a new proposed one, that is Median (RDMDTS) +c Mad (RDMDTS). The procedure of this method can be summarized in the following algorithm.

First step:

Compute RMDi(MCD) or RMDi(MVE) for i = 1, …, n which is defined in equation (2) in multivariate cases (both x and y variables)

Compare these values with to detect outliers (if any) where p is the number of x and y variables together

Second step:

Find the mean and the covariance matrix of the clean subset of the explanatory variables, after removing the suspected outliers in the first step
Find the classical MD with the mean and covariance matrix of the clean data set in the first step for the entire data (for x variables only)
Compare these values with to detect high leverage points (if any) where k is the number of x variables. We refer to this method as (RDMDTS)-chi-sq or:
Compare these values with Median (RDMDTS)+c Mad (RDMDTS), to detect high leverage points (if any) where c is an appropriately chosen constant such as 2 or 3. We refer to this method as (RDMDTS)-mad
Those points with or (RDMDTS)<Median (RDMDTS)+c Mad (RDMDTS) are not considered high leverage points and are put back in the set of inliers

RESULTS

Numerical Examples: The two well-known data sets which are frequently referred to in the study of the identification of influential observations, high leverage points and outliers are considered in this study. It is important to note here that we changed the cutoff point of mad which is used by [6] to chi-square in the first step of the examples and also in the simulation study.

Hawkins-Bradu-Kass data: Hawkins et al.[9] constructed an artificial three-predictor data set containing 75 observations with 10 outliers (cases 1-10) and 14 high leverage points (cases 1-14). Most of the previous single case deletion identification methods fail to identify all of these influential observations. Some of them identify four high leverage points wrongly as outliers[23]. Table 1 shows the DRGP (MVE), DRGP (MCD), (RDMDTS) (MVE), (RDMDTS)(MCD), MD and their corresponding cutoff points.

Stack loss data: Here we consider the stack loss data[3] that have been extensively analyzed in the statistical literature. This three-predictor data set (Air flow, Cooling water inlet temperature and Acid concentration) contains 21 observations with five influential observations; three of them which (cases 1, 3 and 21) are high leverage outliers. One of the influential observations (case 4) is an outlier and another one (case 2) is a high leverage point. Table 2 illustrates the DRGP (MVE), DRGP (MCD), (RDMDTS)(MVE), (RDMDTS)(MCD), MD and their corresponding cutoff points. Another useful detection tool is proposed by Rousseeuw and Van Driessen[24] as DD plot.


Table 1: Diagnostic robust generalized potential based on MVE and MCD and two-step Robust diagnostics Mahalanobis distance based on MVE and MCD for hawkins-Bradu-Kass data

Table 2: Diagnostic robust generalized potential based on MVE and MCD and two-step robust diagnostics Mahalanobis distance based on MVE and MCD for stack loss data

In this plot, the classical MDi is plotted vs. robust MDi. The low leverage points should cluster below the cutoff point lines and the high leverage points will be separated from the bulk of the data and thus, will be located in the upper area of the cutoff points.

The DD plot of stack loss data set is shown in Fig. 1a (MD Vs RDMDTS (MCD)), (b) (MD Vs RDMDTS (MVE)) and Fig. 2a (MD Vs DRGP (MCD)) and 2b (MD Vs DRGP (MVE)). In both plot of Fig. 1, there are two cutoff point lines namely the Mad and the chi-square (), while there is only one cutoff point line (Mad) employed by DRGP in plot (a) and (b) of Fig. 2.

Simulation study: In order to investigate the merit of our newly proposed method, we designed a Monte Carlo simulation experiment. In this study, we compared the Robust Diagnostic Mahalanobis Distance (RDMDTS) with other existed methods, with sample sizes equal to 20, 40, 60, 100 and 200. The first 100 (1-α) % observations of the three regressors from these sample sizes are produced from Uniform (0, 1) and the remaining 100α% observations are constructed as high leverage points. The high leverage points are generated with unequal weights, where the last observations in each sample sizes are kept fixed at 10 value and the other high leverage points are the increments of five.


Fig. 1: (a): Mahalanobis distance against two-step robust diagnostic mahalanobis distance based on MCD, (b): Mahalanobis distance against two-step robust diagnostic mahalanobis distance based on MVE

Fig. 2: (a): Mahalanobis distance against diagnostic robust generalized potential based on MCD, (b): Mahalanobis distance against diagnostic robust generalized potential based on MVE

We run 10000 simulations for these 5 different sample sizes. The results are illustrated in Table 3.

DISCUSSION

Let us focus our attention to the results of Hawkin’s data presented in Table 1. The RMD (MCD) and the RMD (MVE) can detect 1-10 data as outliers. In addition to that RMD (MCD) identifies observations 11-14, 47 and 53 as outliers while RMD (MVE) swamp observations 11-14 and 53(not shown due to space limitations). Although these robust methods are more powerful than MD which can just detect 2 outliers, that is cases 12 and 14, they still can be improved so that their performance as high leverage detection tool is more powerful. As proposed in the second step of the (RDMDTS), we should find the mean and covariance matrix of the clean data set for both RMD (MCD) and RMD (MVE) after deleting the suspected outlier group. Finally we can find the distance of the whole data set with this clean mean and clean covariance matrix for the x variables only. It is obvious from Table 1 that both of our proposed method and Habshah et al.[6] method can detect 14 high leverage points from both mad and chi-square cutoff points. However the values of (RDMDTS) are further from their corresponding cutoff points compared to DRGP. Thus, this new method enhances the chance to detect these 14 observations as high leverage points.


Table 3: 10000 simulations for comparing RDMDTS and DRGP based on (MCD) and (MVE)
1#: LLP = Low Leverage Points, 2#: HLP = High Leverage Points, where # denotes cardinality

Let us now focus to the Stack loss data where the RMD (MVE) can detect 4 outliers and another outlier which is case 2. Furthermore, RMD (MCD) can detect 4 outliers and cases 2, 13, 14, 20 as outliers as well. The RMD (MVE) and RMD (MCD) are not presented due to space constraint. After deleting the outliers from the data set and utilizing the mean and covariance matrix from the cleaned data set in the first step, the (RDMDTS) can identify exactly 4 high leverage points. The DRGP (MCD) and DRGP (MVE) of Table 2 also can identify these 4 high leverage points. Like the results of Hawkin’s Data, similar conclusion can be drawn from this example regarding higher chances of (RDMDTS) for detection of high leverage points. The results of Table 2 show that (RDMDTS) can detect these 4 high leverage points easily. Due to MD masking problem, it cannot detect any high leverage points.

By looking at Fig. 1 and 2, it is obvious that MD couldn’t identify any high leverage points while the other 4 robust methods, can identify 4 high leverages easily.

Next, we will discuss the simulation results whether they confirm the conclusion of the numerical examples that our proposed method performs better than the DRGP and MD method. It can be observed from Table 3 that for small sample size, the (RDMDTS) based on MCD or MVE with chi-square cutoff points swamp more low leverage points compared to (RDMDTS) based on MCD or MVE with mad cutoff points. Nevertheless as soon as the number of sample sizes increases this cutoff point performs better and with this cutoff point we can find less low leverage but still it shows more low leverage than (RDMDTS)-mad. It is obvious from the results of Table 3 that (RDMDTS) (MVE)-mad outperforms (RDMDTS) (MCD)-mad in identifying less low leverages in small sample sizes.

In large sample sizes such as 200 (with 20 or 25% high leverage points) both of these two methods (RDMDTS) (MVE)-mad and (RDMDTS) (MCD)-mad are equally good and do credible job in detecting high leverage points. To compare (RDMDTS)-mad and DRGP based on MCD or MVE, we can say that the number of low leverage points which is identified are less when our newly proposed methods are used. When the sample size are 100 or 200 and 20 or 25% high leverage points are added, (RDMDTS)-mad can detect the exact high leverage points with no low leverage points while DRGP swamps some low leverages. When the number of sample size and high leverage points are very small, DRGP swamp less low leverage points compared to (RDMDTS)-mad (20 sample size and 5% high leverage points). When the number of high leverage points and the number of sample size increases, the (RDMDTS)-mad overcome DRGP in detecting less low leverages.

CONCLUSION

The presence of high leverage points affects all least squares models, which are extensively used in data exploration and modeling. In multivariate cases the identification of high leverage points is much more difficult. Furthermore it is difficult to detect outliers in p-variate data when p>2, as one can no longer rely on visual inspection. Among all outlier detection tools, Mahalonobis Distance is more powerful to detect a single outlier. This approach is not applicable for multiple outliers because of the masking effect, by which multiple outliers do not necessarily have large Mahalonobis distance value. It is better to use distances based on robust estimators of multivariate location and scatter [23]. In regression analysis, the robust distances are computed from the explanatory variables which allow us to detect high leverage points. The main insight behind this study is to introduce a two step robust diagnostic methods based on Robust Mahalanobis distance. This relatively new method not only can detect exactly the high leverage points but also it can identify less number of low leverage points than the existing methods such as Diagnostic Robust Generalized Potential. To investigate the superiority of our new method, a Monte Carlo simulation is carried out. The results of this study indicate that for small sample sizes, the best detection method is (RDMDTS) (MVE)-mad whereas there is not much difference between (RDMDTS) (MVE)-mad and (RDMDTS) (MCD)-mad for large sample sizes. Therefore, when the sample size is very small such as 20 and the number of high leverage is 5% of the data set, it is better to use DRGP (MVE) which can detect less low leverage points.

" target="_blank">View Fulltext    |   Related Articles   |   Back
   
 
 
 
  Related Articles

No Article Found
 
 
 
Copyright   |   Desclaimer   |    Privacy Policy   |   Browsers   |   Accessibility