Asian Science Citation Index is committed to provide an authoritative, trusted and significant information by the coverage of the most important and influential journals to meet the needs of the global scientific community.  
ASCI Database
308-Lasani Town,
Sargodha Road,
Faisalabad, Pakistan
Fax: +92-41-8815544
Contact Via Web
Suggest a Journal
American Journal of Applied Sciences
Year: 2009  |  Volume: 6  |  Issue: 6  |  Page No.: 1191 - 1198

Linear Regression Model Selection Based on Robust Bootstrapping Technique

Hassan S. Uraibi, Habshah Midi, Bashar A. Talib and Jabar H. Yousif    

Abstract: Problem statement: Bootstrap approach had introduced new advancement in modeling and model evaluation. It was a computer intensive method that can replace theoretical formulation with extensive use of computer. The Ordinary Least Squares (OLS) method often used to estimate the parameters of the regression models in the bootstrap procedure. Unfortunately, many statistics practitioners are not aware of the fact that the OLS method can be adversely affected by the existence of outliers. As an alternative, a robust method was put forward to overcome this problem. The existence of outliers in the original sample may create problem to the classical bootstrapping estimates. There was possibility that the bootstrap samples may contain more outliers than the original dataset, since the bootstrap re-sampling is with replacement. Consequently, the outliers will have an unduly effect on the classical bootstrap mean and standard deviation. Approach: In this study, we proposed to use a robust bootstrapping method which was less sensitive to outliers. In the robust bootstrapping procedure, we proposed to replace the classical bootstrap mean and standard deviation with robust location and robust scale estimates. A number of numerical examples were carried out to assess the performance of the proposed method. Results: The results suggested that the robust bootstrap method was more efficient than the classical bootstrap. Conclusion/Recommendations: In the presence of outliers in the dataset, we recommend using the robust bootstrap procedure as its estimates are more reliable.

(1)

Where:

y=The nx1 vector of observed values for the response variable

X=The nxp matrix of observed values for the m explanatory variables

The vector β is an unknown px1 vector of regression coefficients and ε is the nx1 vector of error terms which is assumed to be independent, identically and normally distributed with mean 0 and constant variance, σ2. In regression setting, there are two different ways of conducting bootstrapping; namely the Random-x Re-Sampling and the fixed-x Re-Sampling which is also refer as bootstrapping the residuals. Riadh et al.[1] use the random-x Re-Sampling together with the OLS method in their bootstrap algorithm. In this study, the fixed-x Re-Sampling technique with OLS method is adopted. We call this estimator the Classical Bootstrap fixed-x Resampling Method (CBRM).

The CBRM procedure as enumerated by Efron and Tibshirani[3] is summarized as follows:

According to Imon and Ali[5], there is no general agreement among statisticians on the number of the replications needed in bootstrap. B can be as small as 25, but for estimating standard errors, B is usually in the range of 25-250. They point out that for bootstrap confidence intervals, a much larger values of B is required which normally taken to be in the range of 500-10,000.

Riadh et al.[1] pointed out that the bootstrap standard deviation can be estimated as follows:

(2)

where, MSR is the mean squared residual denoted as:

(3)

and

(4)

The drawback of using the classical standard deviation and the classical mean in estimating the bootstrap scale and location in Eq. 2 and 4 is that it is very sensitive to outliers. As an alternative, a robust location and scale estimates which are less affected by outliers are proposed. The robust bootstrap location and robust scale estimates are given by (5) and (6) as follows:

(5)

(6)

Robust Bootstrap Based on the Fixed-x Resampling (RBRM): Unfortunately, many researchers are not aware that the performance of the OLS can be very poor when the data set for which one often makes a normal assumption, has a heavy-tailed distribution which may arise as a result of outliers. Even with single outlier can have an arbitrarily large effect on the OLS estimates[8]. It is now evident that the bootstrap estimates can be adversely affected by outliers, because the proportion of outliers in the bootstrap samples can be higher than that in the original data[4]. These situations are not desirable because they might produce misleading results. An attempt has been made to make the bootstrap estimates more efficient. We propose to modify the CBRM procedure by using some logical procedure with robust Least Trimmed Squares (LTS) estimator, so that outliers have less influence on the parameter estimates. We call this estimator as Robust Bootstrap fixed-x Re-Sampling Method (RBRM).We summarized the RBRM as follows:

The bootstrap scale and location estimates in Eq. 2 and 4 are based on Mean Squared Residuals which is sensitive to outliers. We propose to replace the Mean Squared Residual (MSR) with a more robust measure that is the Median Squared Residual (RMSR). The propose robust bootstrap location and robust bootstrap scale estimates are as follows:

(7)

(8)

where, RMSR is the Median Squared Residual and for each observation i, i = 1,2,…….., n and for each b = 1, 2,…, B; compute:

(9)

(10)

We also would like to compare the performance of (7) and (8) with the classical formulation of bootstrap standard deviation and location but based on Median Squared Residuals instead of Mean Squared Residuals. These measures are given by:

(11)

(12)

The RBRM procedures commences with estimating the robust regression parameters using LTS method which trim some of the values from both size. This means that, some values from the data which are labeled as outliers are deleted. In this situation β, will be either larger or smaller thanβlzt . In step 2, outliers might be present and it can be the candidate to be selected in Step 3. Since we consider sampling with replacement, each outlier might be chosen more than once. Consequently, there is a possibility that a bootstrap samples may contain more outliers than the original sample. We try to overcome this problem by determining the alpha value based on the percentage of outliers in the bootstrap resamples which are detected in Step 3. In this respect, we develop a dynamic detection subroutine program that can detect the proportion of outliers in each bootstrap resample. Step 4 of RBRM includes the computation of y bootstap by using the LTS based on the first three logical steps. The LTS is expected to be more reliable than the OLS when outliers are present in the data, because it is based on robust method which is not sensitive to outliers. As mentioned earlier, the number of outliers that should be trimmed in the LTS procedure depends on the alpha value that correspond to the percentage of outliers detected. In this way, the effect of outliers is reduced. According to Riadh et al.[1], the best model to be selected among several models, is the one which has the smallest value of location and scale estimates or the minimum scale estimate.

RESULTS

Several well known data sets in robust regression are presented to compare the performance of the CBRM and the RBRM procedures. Comparisons between the estimators are based on their bootstrap locations and scales estimates. We have performed many examples and due to space constraints, we include only three real examples and one simulated data. The conclusions of other results are consistent and are not presented due to space limitations. All computations are done by using S-Plus®6.2 for windows with Professional Edition.

Hawkins, Bradu and Kass Data: Hawkin et al.[8] constructed an artificial three-predictor data set containing 75 observations with 10 outliers in both of the spaces (cases 1-10), 4 outliers in the X-space (cases 11-14) and 61 low leverage inliers (cases 15-75). Most of the single case deletion identification methods fail to identify the outliers in Y-space though some of them point out cases 11-14 as outliers in the Y-space.

We consider four models:


Table 1: CBRM results of Hawkins data

Table 2: RBRM results of Hawkins data

Table 3: CBRM results of Stackloos data

Table 1 and 2 show the estimated bootstrap location and scale estimates based on CBRM and RBRM procedures.

Stackloss data[8]: The Stackloss data is a well known data set which is presented by Brownlee[9]: The data describe the operation of plant for the Oxidation of ammonia to nitric acid and consist of 21 four-dimensional observations. The Stackloss (y) is related to the rate of operation (x1), the cooling water inlet temperature (x2) and the acid concentration (x3). Most robust statistics researchers concluded that observations 1, 2, 3 and 21 were outliers.

We consider four models:

Table 3 and 4 show the bootstrap location and scale estimates of the Stackloss data based on CBRM and RBRM procedures.


Table 4: RBRM results of Stackloos data

Coleman data[8]: This data which was studied by Coleman et al.[10] contains information on 20 schools from the Mid-Atlantic and new England states. Mosteller and Tukey[11] analyzed this data with measurements of five independent variables. The previous study refer observations 3, 17 and 18 as outliers[8].

We consider fifteen models as follow:

Table 5 and 6 show the results of CBRM and RBRM of the Coleman data.


Table 5: CBRM results of Coleman data

Table 6: RBRM results of Coleman data

Simulation study: A simulation study similar to that of Riadh et al.[1] is presented to assess the performance of the RBRM procedure. Consider the problem of fitting a linear model:

In this study, we simulate a data set by putting:

where, εi is a random variable which possesses the distribution N(0, 0.04).


Table 7: CBRM results of simulated data

Table 8: RBRM results of simulated data

Then we started to contaminate the residuals. At each step, one ‘good’ residual was deleted and replaced with contaminated residual. The contaminated residual were generated as N (10, 9). We consider 5, 10, 15 and 20% contaminated residuals and three models:

Table 7 and 8 show the results of CBRM and RBRM procedures. Graphical displays are used to explain why a particular model is selected. We only present the results for Model 1-3 of the simulated data at 5% outliers due to space limitations. The residuals plot before the bootstrap procedure is shown in Fig. 1. Figure 2-4 shown the box-plot of the MSR boot for Model 1-3 while Fig. 5-7 exemplified the box-plot of the RMSR for Model 1-3.


Fig. 1: Residuals before bootstrap

Fig. 2: The Box-plot for the MSR boot M1

Fig. 3: The Box-plot for the MSR boot M2

DISCUSION

Let us first focus our attention to the results of the Hawkin’s data for the CBRM procedure, presented in Table 1. Among the 4 models considered, the bootstrap location and scale estimate of Model 4 is the smallest.


Fig. 4: The Box-plot for the MSR boot M3

Fig. 5: The Box-plot for the RMSR for M1

Fig. 6: The Box-plot for the RMSR for M2

It is important to note that the scale estimate-median based is smaller than the scale estimates-mean based. This indicates that the formulation of scale estimates based on median is more efficient than when based on mean. In this respect, the CBRM suggests that Model 4 is the best model. However, the results of Table 2 of the RBRM procedure signify that Model 1 is the best model. It can be seen that the scale estimate for Model 1 which is based on median is the smallest among the four models. It is interesting to note here that the overall results indicate that the scale estimate-median based of the RBRM procedure is the smallest. Thus, the RBRM based on median has increased the efficiency of the estimates.

It can be observed from Table 3 and 4 of the Stackloss data that the scale estimates based on median is more efficient than when based on mean for both CBRM and RBRM procedures. Similarly, the RBRM-median based has the least value of the scale estimates. The CBRM indicates that Model 2 is the best model while the RBRM suggest that Model 1 is the best model. Nonetheless, the model selection based on RBRM-median based is more efficient and more reliable. These are indicated by its location and scale estimates which are the smallest among the models considered.

By looking at Table 5 of the Coleman data reveals that Model 14 of the CBRM is the best model, evident by the smallest value of the location and scale estimates. In fact, the location and scale estimate of Model 14 which is based on median is smaller than when based on mean. The results of RBRM in Table 6 signify that Model 1’s location and scale estimates is the smallest among the 15 models considered. For this model, the RBRM median-based is more efficient than the RBRM mean-based. These are indicated by its location and scale estimates which are smaller than the RBRM mean based.

The results of the simulated data in Table 7 shows that Model 2 is the best model for all outlier percentage levels because the scale estimate of Model 2 is the smallest compared to other models. Nonetheless, the RBRM results of Table 8 suggest that Model 1 is the best model.

Similar to that of the Hawkin, Stackloss and Coleman data, the RBRM median-based is more efficient than the RBRM mean-based. In fact the scale estimates of the RBRM median-based are remarkably smaller than the RBRM mean-based for all outlier percentage levels. From the results of the simulation study indicates that the RBRM median-based is more efficient and reliable procedure.

Here, we would like to explain further why Model 2 is selected by considering only at 5% outliers due to space constraint. By looking at Fig. 1, it is obvious that there are 5% outliers in the residuals before the bootstrap is employed. It can be seen from Fig. 2-4 that the number of outliers of the MSR in Fig. 2 equal to 18 while only 14 in Fig. 3.


Fig. 7: The Box-plot for the RMSR for M3

The median of the MSR in Fig. 4 is very large compared to Fig. 2 and 3. Among the three models in Fig. 2-4, the CBRM chooses Model 2 as the best model because the proportion of outliers of the MSR is less than the other two models.

On the other hand, the RBRM select M1 as the best model. By comparing Fig. 5-7 with Fig. 2-4, it can be seen that there is no outlier in the distribution of the Median Squared Residuals when we employed RBRM method, while apparent outliers are seen in the distribution of the Mean Squared Residuals, when the CBRM are employed. In this situation, the RBRM has an attractive feature. Among the three models being considered, the RMSR bootstrap resample of Model 1 is more efficient as it is more compact in the central region compared to the other two models. In this situation, Model 1 is recommended as the RMSR is more consistent and more efficient.

CONCLUSION

In this study, we propose a new robust bootstrap method for model selection criteria. The proposed bootstrap method attempts to overcome the problem of having more outliers in the bootstrap samples than the original data set. The RBRM procedure develops a dynamic subroutine program that is capable of detecting certain percentage of outliers in the data. The results indicate that the RBRM consistently outperformed the CBRM procedure. It emerges that the best model selected always corresponds to the RBRM-median based that has the least bootstrap scale estimate. Hence, utilizing the RBRM median-based in the model selection, can improve substantially the accuracy and the efficiency of the estimates. Thus the RBRM median-based is more reliable for linear regression model selection.

" target="_blank">View Fulltext    |   Related Articles   |   Back
 
 
   
 
 
 
  Related Articles

No Article Found
 
 
 
Copyright   |   Desclaimer   |    Privacy Policy   |   Browsers   |   Accessibility