Where: y=The nx1 vector of observed values for the response variable X=The nxp matrix of observed values for the m explanatory variables The vector β is an unknown px1 vector of regression coefficients and ε is the nx1 vector of error terms which is assumed to be independent, identically and normally distributed with mean 0 and constant variance, σ^{2}. In regression setting, there are two different ways of conducting bootstrapping; namely the Randomx ReSampling and the fixedx ReSampling which is also refer as bootstrapping the residuals. Riadh et al.^{[1]} use the randomx ReSampling together with the OLS method in their bootstrap algorithm. In this study, the fixedx ReSampling technique with OLS method is adopted. We call this estimator the Classical Bootstrap fixedx Resampling Method (CBRM). The CBRM procedure as enumerated by Efron and Tibshirani^{[3]} is summarized as follows:
According to Imon and Ali^{[5]}, there is no general agreement among statisticians on the number of the replications needed in bootstrap. B can be as small as 25, but for estimating standard errors, B is usually in the range of 25250. They point out that for bootstrap confidence intervals, a much larger values of B is required which normally taken to be in the range of 50010,000. Riadh et al.^{[1] }pointed out that the bootstrap standard deviation can be estimated as follows:
where, MSR is the mean squared residual denoted as:
and
The drawback of using the classical standard deviation and the classical mean in estimating the bootstrap scale and location in Eq. 2 and 4 is that it is very sensitive to outliers. As an alternative, a robust location and scale estimates which are less affected by outliers are proposed. The robust bootstrap location and robust scale estimates are given by (5) and (6) as follows:
Robust Bootstrap Based on the Fixedx Resampling (RBRM): Unfortunately, many researchers are not aware that the performance of the OLS can be very poor when the data set for which one often makes a normal assumption, has a heavytailed distribution which may arise as a result of outliers. Even with single outlier can have an arbitrarily large effect on the OLS estimates^{[8]}. It is now evident that the bootstrap estimates can be adversely affected by outliers, because the proportion of outliers in the bootstrap samples can be higher than that in the original data^{[4]}. These situations are not desirable because they might produce misleading results. An attempt has been made to make the bootstrap estimates more efficient. We propose to modify the CBRM procedure by using some logical procedure with robust Least Trimmed Squares (LTS) estimator, so that outliers have less influence on the parameter estimates. We call this estimator as Robust Bootstrap fixedx ReSampling Method (RBRM).We summarized the RBRM as follows:


The bootstrap scale and location estimates in Eq. 2 and 4 are based on Mean Squared Residuals which is sensitive to outliers. We propose to replace the Mean Squared Residual (MSR) with a more robust measure that is the Median Squared Residual (RMSR). The propose robust bootstrap location and robust bootstrap scale estimates are as follows:
where, RMSR is the Median Squared Residual and for each observation i, i = 1,2,…….., n and for each b = 1, 2,…, B; compute:
We also would like to compare the performance of (7) and (8) with the classical formulation of bootstrap standard deviation and location but based on Median Squared Residuals instead of Mean Squared Residuals. These measures are given by:
The RBRM procedures commences with estimating the robust regression parameters
using LTS method which trim some of the values from both size. This means that,
some values from the data which are labeled as outliers are deleted. In this
situation β, will be either larger or smaller thanβ_{lzt}
. In step 2, outliers might be present and it can be the candidate to be selected
in Step 3. Since we consider sampling with replacement, each outlier might be
chosen more than once. Consequently, there is a possibility that a bootstrap
samples may contain more outliers than the original sample. We try to overcome
this problem by determining the alpha value based on the percentage of outliers
in the bootstrap resamples which are detected in Step 3. In this respect, we
develop a dynamic detection subroutine program that can detect the proportion
of outliers in each bootstrap resample. Step 4 of RBRM includes the computation
of y bootstap by using the LTS based on the first three logical steps. The LTS
is expected to be more reliable than the OLS when outliers are present in the
data, because it is based on robust method which is not sensitive to outliers.
As mentioned earlier, the number of outliers that should be trimmed in the LTS
procedure depends on the alpha value that correspond to the percentage of outliers
detected. In this way, the effect of outliers is reduced. According to Riadh
et al.^{[1]}, the best model to be selected among several models,
is the one which has the smallest value of location and scale estimates or the
minimum scale estimate.
RESULTS Several well known data sets in robust regression are presented to compare the performance of the CBRM and the RBRM procedures. Comparisons between the estimators are based on their bootstrap locations and scales estimates. We have performed many examples and due to space constraints, we include only three real examples and one simulated data. The conclusions of other results are consistent and are not presented due to space limitations. All computations are done by using SPlus®6.2 for windows with Professional Edition. Hawkins, Bradu and Kass Data: Hawkin et al.^{[8]} constructed an artificial threepredictor data set containing 75 observations with 10 outliers in both of the spaces (cases 110), 4 outliers in the Xspace (cases 1114) and 61 low leverage inliers (cases 1575). Most of the single case deletion identification methods fail to identify the outliers in Yspace though some of them point out cases 1114 as outliers in the Yspace. We consider four models:
Table 1: 
CBRM results of Hawkins data 

Table 2: 
RBRM results of Hawkins data 

Table 3: 
CBRM results of Stackloos data 

Table 1 and 2 show the estimated bootstrap location and scale estimates based on CBRM and RBRM procedures. Stackloss data^{[8]}: The Stackloss data is a well known data set which is presented by Brownlee^{[9]}: The data describe the operation of plant for the Oxidation of ammonia to nitric acid and consist of 21 fourdimensional observations. The Stackloss (y) is related to the rate of operation (x1), the cooling water inlet temperature (x2) and the acid concentration (x3). Most robust statistics researchers concluded that observations 1, 2, 3 and 21 were outliers. We consider four models:
Table 3 and 4 show the bootstrap location and scale estimates of the Stackloss data based on CBRM and RBRM procedures.
Table 4: 
RBRM results of Stackloos data 

Coleman data^{[8]}: This data which was studied by Coleman et al.^{[10]} contains information on 20 schools from the MidAtlantic and new England states. Mosteller and Tukey^{[11]} analyzed this data with measurements of five independent variables. The previous study refer observations 3, 17 and 18 as outliers^{[8]}. We consider fifteen models as follow:
Table 5 and 6 show the results of CBRM and RBRM of the Coleman data.
Table 5: 
CBRM results of Coleman data 

Table 6: 
RBRM results of Coleman data 

Simulation study: A simulation study similar to that of Riadh et al.^{[1]} is presented to assess the performance of the RBRM procedure. Consider the problem of fitting a linear model:
In this study, we simulate a data set by putting:
where, ε_{i} is a random variable which possesses the distribution N(0, 0.04).
Table 7: 
CBRM results of simulated data 

Table 8: 
RBRM results of simulated data 

Then we started to contaminate the residuals. At each step, one ‘good’ residual was deleted and replaced with contaminated residual. The contaminated residual were generated as N (10, 9). We consider 5, 10, 15 and 20% contaminated residuals and three models:
Table 7 and 8 show the results of CBRM
and RBRM procedures. Graphical displays are used to explain why a particular
model is selected. We only present the results for Model 13 of the simulated
data at 5% outliers due to space limitations. The residuals plot before the
bootstrap procedure is shown in Fig. 1. Figure
24 shown the boxplot of the MSR boot for Model 13 while
Fig. 57 exemplified the boxplot of the
RMSR for Model 13.
 Fig. 1: 
Residuals before bootstrap 
 Fig. 2: 
The Boxplot for the MSR boot M1 
 Fig. 3: 
The Boxplot for the MSR boot M2 
DISCUSION Let us first focus our attention to the results of the Hawkin’s data for the CBRM procedure, presented in Table 1. Among the 4 models considered, the bootstrap location and scale estimate of Model 4 is the smallest.
 Fig. 4: 
The Boxplot for the MSR boot M3 
 Fig. 5: 
The Boxplot for the RMSR for M1 
 Fig. 6: 
The Boxplot for the RMSR for M2 
It is important to note that the scale estimatemedian based is smaller than the scale estimatesmean based. This indicates that the formulation of scale estimates based on median is more efficient than when based on mean. In this respect, the CBRM suggests that Model 4 is the best model. However, the results of Table 2 of the RBRM procedure signify that Model 1 is the best model. It can be seen that the scale estimate for Model 1 which is based on median is the smallest among the four models. It is interesting to note here that the overall results indicate that the scale estimatemedian based of the RBRM procedure is the smallest. Thus, the RBRM based on median has increased the efficiency of the estimates. It can be observed from Table 3 and 4 of the Stackloss data that the scale estimates based on median is more efficient than when based on mean for both CBRM and RBRM procedures. Similarly, the RBRMmedian based has the least value of the scale estimates. The CBRM indicates that Model 2 is the best model while the RBRM suggest that Model 1 is the best model. Nonetheless, the model selection based on RBRMmedian based is more efficient and more reliable. These are indicated by its location and scale estimates which are the smallest among the models considered. By looking at Table 5 of the Coleman data reveals that Model 14 of the CBRM is the best model, evident by the smallest value of the location and scale estimates. In fact, the location and scale estimate of Model 14 which is based on median is smaller than when based on mean. The results of RBRM in Table 6 signify that Model 1’s location and scale estimates is the smallest among the 15 models considered. For this model, the RBRM medianbased is more efficient than the RBRM meanbased. These are indicated by its location and scale estimates which are smaller than the RBRM mean based. The results of the simulated data in Table 7 shows that Model 2 is the best model for all outlier percentage levels because the scale estimate of Model 2 is the smallest compared to other models. Nonetheless, the RBRM results of Table 8 suggest that Model 1 is the best model. Similar to that of the Hawkin, Stackloss and Coleman data, the RBRM medianbased is more efficient than the RBRM meanbased. In fact the scale estimates of the RBRM medianbased are remarkably smaller than the RBRM meanbased for all outlier percentage levels. From the results of the simulation study indicates that the RBRM medianbased is more efficient and reliable procedure. Here, we would like to explain further why Model 2 is selected by considering only at 5% outliers due to space constraint. By looking at Fig. 1, it is obvious that there are 5% outliers in the residuals before the bootstrap is employed. It can be seen from Fig. 24 that the number of outliers of the MSR in Fig. 2 equal to 18 while only 14 in Fig. 3.
 Fig. 7: 
The Boxplot for the RMSR for M3 
The median of the MSR in Fig. 4 is very large compared to Fig. 2 and 3. Among the three models in Fig. 24, the CBRM chooses Model 2 as the best model because the proportion of outliers of the MSR is less than the other two models.
On the other hand, the RBRM select M1 as the best model. By comparing Fig.
57 with Fig. 24,
it can be seen that there is no outlier in the distribution of the Median Squared
Residuals when we employed RBRM method, while apparent outliers are seen in
the distribution of the Mean Squared Residuals, when the CBRM are employed.
In this situation, the RBRM has an attractive feature. Among the three models
being considered, the RMSR bootstrap resample of Model 1 is more efficient as
it is more compact in the central region compared to the other two models. In
this situation, Model 1 is recommended as the RMSR is more consistent and more
efficient.
CONCLUSION In this study, we propose a new robust bootstrap method for model selection criteria. The proposed bootstrap method attempts to overcome the problem of having more outliers in the bootstrap samples than the original data set. The RBRM procedure develops a dynamic subroutine program that is capable of detecting certain percentage of outliers in the data. The results indicate that the RBRM consistently outperformed the CBRM procedure. It emerges that the best model selected always corresponds to the RBRMmedian based that has the least bootstrap scale estimate. Hence, utilizing the RBRM medianbased in the model selection, can improve substantially the accuracy and the efficiency of the estimates. Thus the RBRM medianbased is more reliable for linear regression model selection. " target="_blank">View Fulltext
 Related
Articles  Back
