Where:
T(X) 
= 
The estimated multivariate location which is usually the multivariate
arithmetic mean 
C(X) 
= 
The estimated covariance matrix which is usually the sample covariance
matrix 
The distribution of the MD with both the true location and shape parameters
and the conventional location and shape parameters is well known^{[5]}.
If there are only a few outliers, large values of MD, indicate that the point
x_{i} is an outlier^{[2]}. Any value of which the MD exceeds
the cutoff is
considered as outliers where p is the number of explanatory variables^{[16]}.
Data sets with multiple outliers are subject to problems of masking and swamping^{[20]}.
Masking occurs when a group of outlying points skews the mean and covariance
estimates toward these points and the resulting distance of the outlying point
from the mean is small. While, swamping occurs when a group of outlying points
skews the mean and covariance estimates toward these points and away from other
inlying points and the resulting distance from the inlying points to the mean
is large. Mahalanobis Distance is known to suffer from masking problems^{[24]}.
Mahalanobis Distances give a onedimensional measure of how far a point is from a location with respect to a shape. Utilizing MD, we can find the points that are unusually far away from a location and call those points outliers. A large body of diagnostic tools is available in the literature for detection of high leverage points in linear regression^{[4,11,12,27]}. Mahalanobis Distance (MD) is one of these wellknown multivariate methods for detecting high leverage points as well. Although it is a reliable diagnostic tool for detecting high leverage points, it suffers from masking problem. Most of the classical diagnostic methods fail to identify the multiple high leverage points due to their masking effects^{[14]}. Problems of masking can be resolved by using robust estimates of shape and location, which by definition are less affected by outliers. Outlying points are less likely to enter into the calculation of the robust procedures, so they will not be able to influence the parameters used in the MD. The inlying points, which all come from the underlying distribution, will completely determine the estimate of the location and shape of the data. Several robust estimators of multivariate location and scatter have been proposed, such as Maronna’s pioneering paper on multivariate Mestimation^{ [17]}, the Minimum Volume Ellipsoid (MVE) and the Minimum Covariance Determinant (MCD) estimators by Rousseeuw^{[22]}. For a thorough overview of robust multivariate estimation, one can refer to the article by Maronna and Yohai^{[18]}. The Minimum Covariance Determinant (MCD) method of Rousseeuw^{[22]} aims to find h observations out of n whose covariance matrix C has the lowest determinant. In the Minimum Volume Estimator (MVE), proposed by Rousseeuw^{[22]}, an ellipsoid of the smallest volume with a subset of p objects (noncontaminated data) is constructed. In one of the proposed iterative algorithms, n+1 object is selected iteratively at random in each of iterations and their mean and covariance are determined. Then, the ellipsoid containing exactly p data objects is found by deflating or expanding the data covariance. The steps of the algorithm are repeated until the subset of p objects yielding the smallest volume of the covariance ellipsoid is found. Finally the robust MD distance can be written as:
where, T_{R} (X) and C_{R} (X) are robust location and shape
estimate such as MCD or MVE. By using a robust location and shape estimate in
the RMD, outlying points will not skew the estimates and can be identified as
outliers by large values of the RMD. Unfortunately, using robust estimates gives
RMDs with unknown distributional properties^{[25]}. The use of quantile
as cutoff point for RMD will prone to declare some good and low leverage as
high leverage points and sand often lead to identifying too many points as outliers^{[25]}.
To develop robust multivariate estimators, Rousseeuw and Leroy^{[23]}
first proposed to detect outliers by RMD and then find the estimates by using
the reweighted least squares regression when the weight function is a hard rejection
function. Specifically, the latter proposal consists of discarding those observations
whose RMD exceeds a certain fix threshold value. Previously, the MVE was commonly
used as initial estimator for these procedures. In the context of linear regression,
many estimators have been proposed that aim to reconcile high efficiency and
robustness. Typically, these methods are also twostage procedures^{[6,10,15,22,28,29]}.
Let us consider a k variables regression model as: The weight matrix W = X (X^{T} X)^{1 }X^{T} is the orthogonal projector matrix onto the model space, or hat matrix which is traditionally used as a measure of leverage points in regression analysis. If a diagonal entry W_{ii} of W is large, changing y_{i} will move the fitted surface appreciably towards the altered value. Therefore, W_{ii} is said to measure the leverage of the observation y_{i}. Different cutoff points exist in the literature for the hat matrix to find high leverage points such as twicethe meanrule (2 k/n) by^{[11]}, thricethe meanrule (3 k/n)^{[27] }when k and n are the number of variables and observations respectively and three interval range of Huber^{[12]} (observations with 0.2< W_{ii} <0.5 are risky to consider in analysis and those with W_{ii} ≥0.5 should be avoided when W_{ii} is diagonal elements of hat matrix). The hat matrix may fail to identify the high leverage points because of the effect of high leverage points in leverage structure^{[7]}. Hadi^{[7]} introduced another diagnostic tool as follows: where, w_{ii} = x_{i}^{T}(X^{T} X)^{1} x_{i} is the diagonal element of W and the ith diagonal potential p_{ii} can be defined as:
where, X_{(i)} is the data matrix X without the ith row. He proposed a cutoff point for potential values (p_{ii}) as Median (p_{ii}) +c Mad (p_{ii}) where Mad = median p_{ii}median(p_{ii})/0.6745 and c can be taken as constant values of 2 or 3. Observations exceeding Hadi’s cutoff point is considered as high leverage points. But this method also can’t detect all of the high leverage points. Imon^{[13]} introduced another diagnostic tool as generalized potentials for the whole data set as follows: Let consider that D is deleted group from data set, those which suspected as outliers (the choice of this deletion group is very important since the omission of this group determines the weights for the whole data set). R is the remaining set after deleting d<(nk) therefore it contains (nd) cases. If we assume that the suspected data are the last d rows of X and Y so the weight matrix W = X (X^{T} X)^{1} X^{T }can be written as:
where, U_{R} = X_{R}(X^{T} X)^{1} and
U_{D} = X_{D}(X^{T} X)^{1} are
symmetric matrices of order (nd) and d respectively. V = X_{R}(X^{T}
X)^{1}
is an (nd)×d matrix. Now we can define:
,
for i = 1,2,…n
where, is
the ith diagonal element of X(X_{R}^{T} X_{R})^{1}
X^{T} matrix.
Then Imon^{[14]} introduced generalized potentials for all members in a data set which are defined as: We should notice that there isn’t any finite upper bound for pii^{*} ’s and the derivation of the theoretical distribution of them are not easy. He introduced the same cutoff point as potential values Median (p_{ii}^{*}) + c Mad (p_{ii}^{*}) for the generalized potential as well. Habshah et al.^{[6]} developed a new method for determining outlying points in multivariate data set by combining the RMD (MVE) method for detecting the suspected group (D group) in generalized potential method which is proposed by^{[14]}. This method which is called DRGP (MVE) is also a twostep method for high leverage point detection. In their methods, the mad cutoff point has been used in the first and second steps.However, this method can identify more swamped low leverage points. According to Werner ^{[28]}, "A successful method of identifying outliers in all multivariate situations would be ideal, but is unrealistic". By "successful", he means both highly sensitive, the ability to detect genuine outliers and highly specific, the ability not to swamp regular points as outliers. Therefore a practical and efficient robust detection method of high leverage points (outliers in Xdirection) is the method which is sensitive to detect genuine high leverage points and specific, thus it swamps less low leverage as high leverage. MATERIALS AND METHODS In this study, we propose a twostep diagnostic tool for detecting multiple high leverage points which can detect less swamped low leverages. In order to improve DRGP (MVE) performance proposed by^{[6]}, we follow the idea of Rousseeuw and Leroy^{[23]} in developing robust multivariate estimators and propose a relatively new method for high leverage points identification which is called twosteps Robust Diagnostic Mahalanobis Distance (RDMD^{TS}). In the first step, the RMD (MCD) or RMD (MVE) method is used to detect the suspected outlier group which will be deleted from the data set resulting in the clean data for the next step. In the second step, we apply the MD for the entire data set that based on the mean and covariance matrix of the clean data set which was obtained from the first step. Therefore, TwoSteps Robust Diagnostic Mahalanobis Distance (RDMD^{ TS}) is written as follows:
where, T_{0}(X) and C_{0}(X) are the mean and covariance matrix
of the clean data set. Two different cutoff points are considered, namely the
where k is the number of explanatory variables and a new proposed one, that
is Median (RDMD^{TS}) +c Mad (RDMD^{TS}). The procedure of this
method can be summarized in the following algorithm.
First step:
• 
Compute RMD_{i}(MCD) or RMD_{i}(MVE) for i
= 1, …, n which is defined in equation (2) in multivariate cases (both x
and y variables) 
• 
Compare these values with to
detect outliers (if any) where p is the number of x and y variables together 
Second step:
• 
Find the mean and the covariance matrix of the clean subset
of the explanatory variables, after removing the suspected outliers in the
first step 
• 
Find the classical MD with the mean and covariance matrix of the clean
data set in the first step for the entire data (for x variables only) 
• 
Compare these values with to
detect high leverage points (if any) where k is the number of x variables.
We refer to this method as (RDMD^{TS})chisq or: 
• 
Compare these values with Median (RDMD^{TS})+c Mad (RDMD^{TS}),
to detect high leverage points (if any) where c is an appropriately chosen
constant such as 2 or 3. We refer to this method as (RDMD^{TS})mad 
• 
Those points with or
(RDMD^{TS})<Median (RDMD^{TS})+c Mad (RDMD^{TS})
are not considered high leverage points and are put back in the set of inliers 
RESULTS
Numerical Examples: The two wellknown data sets which are frequently referred to in the study of the identification of influential observations, high leverage points and outliers are considered in this study. It is important to note here that we changed the cutoff point of mad which is used by^{ [6] }to chisquare in the first step of the examples and also in the simulation study. HawkinsBraduKass data: Hawkins et al.^{[9]} constructed an artificial threepredictor data set containing 75 observations with 10 outliers (cases 110) and 14 high leverage points (cases 114). Most of the previous single case deletion identification methods fail to identify all of these influential observations. Some of them identify four high leverage points wrongly as outliers^{[23]}. Table 1 shows the DRGP (MVE), DRGP (MCD), (RDMD^{TS}) (MVE), (RDMD^{TS})(MCD), MD and their corresponding cutoff points.
Stack loss data: Here we consider the stack loss data^{[3]}
that have been extensively analyzed in the statistical literature. This threepredictor
data set (Air flow, Cooling water inlet temperature and Acid concentration)
contains 21 observations with five influential observations; three of them which
(cases 1, 3 and 21) are high leverage outliers. One of the influential observations
(case 4) is an outlier and another one (case 2) is a high leverage point. Table
2 illustrates the DRGP (MVE), DRGP (MCD), (RDMD^{TS})(MVE), (RDMD^{TS})(MCD),
MD and their corresponding cutoff points. Another useful detection tool is proposed
by Rousseeuw and Van Driessen^{[24]} as DD plot.
Table 1: 
Diagnostic robust generalized potential based on MVE and MCD
and twostep Robust diagnostics Mahalanobis distance based on MVE and MCD
for hawkinsBraduKass data 


Table 2: 
Diagnostic robust generalized potential based on MVE and MCD
and twostep robust diagnostics Mahalanobis distance based on MVE and MCD
for stack loss data 

In this plot, the classical MD_{i} is plotted vs. robust MD_{i}.
The low leverage points should cluster below the cutoff point lines and the
high leverage points will be separated from the bulk of the data and thus, will
be located in the upper area of the cutoff points.
The DD plot of stack loss data set is shown in Fig. 1a (MD
Vs RDMD^{TS} (MCD)), (b) (MD Vs RDMD^{TS} (MVE)) and Fig.
2a (MD Vs DRGP (MCD)) and 2b (MD Vs DRGP (MVE)). In both plot of Fig.
1, there are two cutoff point lines namely the Mad and the chisquare (),
while there is only one cutoff point line (Mad) employed by DRGP in plot (a)
and (b) of Fig. 2.
Simulation study: In order to investigate the merit of our newly proposed
method, we designed a Monte Carlo simulation experiment. In this study, we compared
the Robust Diagnostic Mahalanobis Distance (RDMD^{TS}) with other existed
methods, with sample sizes equal to 20, 40, 60, 100 and 200. The first 100 (1α)
% observations of the three regressors from these sample sizes are produced
from Uniform (0, 1) and the remaining 100α% observations are constructed
as high leverage points. The high leverage points are generated with unequal
weights, where the last observations in each sample sizes are kept fixed at
10 value and the other high leverage points are the increments of five.
 Fig. 1: 
(a): Mahalanobis distance against twostep robust diagnostic
mahalanobis distance based on MCD, (b): Mahalanobis distance against twostep
robust diagnostic mahalanobis distance based on MVE 
 Fig. 2: 
(a): Mahalanobis distance against diagnostic robust generalized
potential based on MCD, (b): Mahalanobis distance against diagnostic robust
generalized potential based on MVE 
We run 10000 simulations for these 5 different sample sizes. The results are
illustrated in Table 3.
DISCUSSION Let us focus our attention to the results of Hawkin’s data presented in Table 1. The RMD (MCD) and the RMD (MVE) can detect 110 data as outliers. In addition to that RMD (MCD) identifies observations 1114, 47 and 53 as outliers while RMD (MVE) swamp observations 1114 and 53(not shown due to space limitations). Although these robust methods are more powerful than MD which can just detect 2 outliers, that is cases 12 and 14, they still can be improved so that their performance as high leverage detection tool is more powerful. As proposed in the second step of the (RDMD^{TS}), we should find the mean and covariance matrix of the clean data set for both RMD (MCD) and RMD (MVE) after deleting the suspected outlier group. Finally we can find the distance of the whole data set with this clean mean and clean covariance matrix for the x variables only. It is obvious from Table 1 that both of our proposed method and Habshah et al.^{[6] }method can detect 14 high leverage points from both mad and chisquare cutoff points. However the values of (RDMD^{TS}) are further from their corresponding cutoff points compared to DRGP. Thus, this new method enhances the chance to detect these 14 observations as high leverage points.
Table 3: 
10000 simulations for comparing RDMD^{TS} and DRGP
based on (MCD) and (MVE) 

1#: LLP = Low Leverage Points, 2#: HLP = High Leverage Points,
where # denotes cardinality 
Let us now focus to the Stack loss data where the RMD (MVE) can detect 4 outliers and another outlier which is case 2. Furthermore, RMD (MCD) can detect 4 outliers and cases 2, 13, 14, 20 as outliers as well. The RMD (MVE) and RMD (MCD) are not presented due to space constraint. After deleting the outliers from the data set and utilizing the mean and covariance matrix from the cleaned data set in the first step, the (RDMD^{TS}) can identify exactly 4 high leverage points. The DRGP (MCD) and DRGP (MVE) of Table 2 also can identify these 4 high leverage points. Like the results of Hawkin’s Data, similar conclusion can be drawn from this example regarding higher chances of (RDMD^{TS}) for detection of high leverage points. The results of Table 2 show that (RDMD^{TS}) can detect these 4 high leverage points easily. Due to MD masking problem, it cannot detect any high leverage points. By looking at Fig. 1 and 2, it is obvious that MD couldn’t identify any high leverage points while the other 4 robust methods, can identify 4 high leverages easily. Next, we will discuss the simulation results whether they confirm the conclusion of the numerical examples that our proposed method performs better than the DRGP and MD method. It can be observed from Table 3 that for small sample size, the (RDMD^{TS}) based on MCD or MVE with chisquare cutoff points swamp more low leverage points compared to (RDMD^{TS}) based on MCD or MVE with mad cutoff points. Nevertheless as soon as the number of sample sizes increases this cutoff point performs better and with this cutoff point we can find less low leverage but still it shows more low leverage than (RDMD^{TS})mad. It is obvious from the results of Table 3 that (RDMD^{TS}) (MVE)mad outperforms (RDMD^{TS}) (MCD)mad in identifying less low leverages in small sample sizes. In large sample sizes such as 200 (with 20 or 25% high leverage points) both of these two methods (RDMD^{TS}) (MVE)mad and (RDMD^{TS}) (MCD)mad are equally good and do credible job in detecting high leverage points. To compare (RDMD^{TS})mad and DRGP based on MCD or MVE, we can say that the number of low leverage points which is identified are less when our newly proposed methods are used. When the sample size are 100 or 200 and 20 or 25% high leverage points are added, (RDMD^{TS})mad can detect the exact high leverage points with no low leverage points while DRGP swamps some low leverages. When the number of sample size and high leverage points are very small, DRGP swamp less low leverage points compared to (RDMDTS)mad (20 sample size and 5% high leverage points). When the number of high leverage points and the number of sample size increases, the (RDMD^{TS})mad overcome DRGP in detecting less low leverages. CONCLUSION The presence of high leverage points affects all least squares models, which are extensively used in data exploration and modeling. In multivariate cases the identification of high leverage points is much more difficult. Furthermore it is difficult to detect outliers in pvariate data when p>2, as one can no longer rely on visual inspection. Among all outlier detection tools, Mahalonobis Distance is more powerful to detect a single outlier. This approach is not applicable for multiple outliers because of the masking effect, by which multiple outliers do not necessarily have large Mahalonobis distance value. It is better to use distances based on robust estimators of multivariate location and scatter^{ [23]}. In regression analysis, the robust distances are computed from the explanatory variables which allow us to detect high leverage points. The main insight behind this study is to introduce a two step robust diagnostic methods based on Robust Mahalanobis distance. This relatively new method not only can detect exactly the high leverage points but also it can identify less number of low leverage points than the existing methods such as Diagnostic Robust Generalized Potential. To investigate the superiority of our new method, a Monte Carlo simulation is carried out. The results of this study indicate that for small sample sizes, the best detection method is (RDMD^{TS}) (MVE)mad whereas there is not much difference between (RDMD^{TS}) (MVE)mad and (RDMD^{TS}) (MCD)mad for large sample sizes. Therefore, when the sample size is very small such as 20 and the number of high leverage is 5% of the data set, it is better to use DRGP (MVE) which can detect less low leverage points. " target="_blank">View Fulltext
 Related
Articles  Back
