Abstract: In this study, we calculate confidence intervals for the mean of a normal data and a contaminated normal data. Some robust estimators against outliers are also considered to construct confidence intervals that are more resistant to outliers than the Student t confidence interval. The confidence intervals of these estimators are computed and compared with each other for normal and contaminated normal data to determine which is better. The performance of these confidence intervals is evaluated and compared by calculating the estimated coverage probability, the average width and the standard error by using simulation. Sps t followed by MAD t are recommended at any rate of contamination, while Student t is not preferred at all for contaminated data and the sample mean and the sample standard deviation are not good choices for constructing confidence interval, but highly recommended for normal data without outliers as expected.
INTRODUCTION
The usual assumptions behind Student t confidence interval are that the distribution of data is normal (or approximately normal) and no major contamination due to outliers. Under these assumptions, the sample mean and the sample standard deviation are often used to construct this confidence interval.
However, these assumptions may not hold in many real-world problems. In particular, there are many situations where we have evidence that the underlying distribution is normal with some outliers that might affect the confidence interval or its coverage probability. These outliers may have a strong influence on the Student t confidence interval in the sense that they pull the width of the confidence interval too much in their direction and alter the coverage probability.
The literature showed that the sample median and inter-quartile range or the sample median and median absolute deviation or the sample median and Ginis mean difference are indeed more resistant to departures from normality and presence of outliers. In this study, we incorporate this observation into constructing some interval estimators for the mean of the normal distribution with contaminated data. The sample median (MD) is used to estimate the parameter μ, whereas the population standard deviation σ is estimated by using three robust measures of scale that includes the Inter-Quartile Range (IQR), Ginis mean difference (G) and median absolute deviation from the sample median (MAD).
Park and Cho (2003) proposed robust design to develop improvement in industrial production. They showed that the sample mean and variance are useful estimates under normality without contamination and the sample median and MAD or the sample median and the IQR are more useful under a contaminated normal.
Adrover et al. (2004) defined globally robust confidence intervals for the location among other things which takes in consideration a large scale of contaminated distributions. They constructed intervals that are stable in the sense of achieving coverage near the nominal level and informative in the sense of having short widths by taking into account the potential bias of the estimates. Our results showed that the proposed confidence intervals Sps t and MAD t satisfy the two conditions of globally robust confidence intervals under normal and contaminated normal and the Student fails to satisfy them. A result which is supported by the above mentioned reference.
Kibria (2006) considered some interval estimators such as Student t, Johnson t, Median t and Mean Absolute Deviation (MAD) t intervals for estimating the mean of a asymmetric distribution, in an effort to find a robust confidence interval, but Kibria did not try to find a robust confidence interval for contaminated normal data. So, we think it is important to try to obtain confidence intervals that are resistant to outliers.
Baklizi (2007, 2008) considered various modified procedures based on t confidence intervals as well as the approach based on empirical likelihood for the mean or difference of means of some skewed distributions. The performance was based on coverage probabilities and widths of those intervals. He found that intervals based on Bartlett corrected empirical likelihood and empirical likelihood procedures are superior for skewed heavy-tailed distributions.
There are other several alternative approaches available in the literature proposed for confidence intervals by several researchers at different times, among them Bloch and Gastwirth (1968), Guenther (1969), Gross (1976), Johnson (1978), Kafadar (1982), Horn (1983), Kleijnen et al. (1986), Hettmansperger and McKean (1998), Meeden (1999) and Willink (2005).
The objective of the study is to observe the performance of confidence intervals based on mean and standard deviation when the underlying data is contaminated. We investigate if the confidence intervals based on sample median and inter-quartile range; sample median and MAD; sample median and Ginis mean, are resistance to the outliers and compare the performance of these confidence intervals. Such investigations are carried out by a simulation procedure to determine the coverage probability, the average width and the standard error of each confidence interval method under the normal assumption with and without contaminated data and then select confidence intervals that are more resistant against the presence of outliers or maintains a coverage probability close to a desired nominal confidence coefficient (1-α) with good average width and small standard error.
SOME ROBUST ESTIMATORS
Here, we introduce several robust estimators against outliers that are used in this study for constructing the confidence interval for μ when σ is unknown.
The sample median (MD): The sample median for a random sample of n observations X1, X2, , Xn is defined as follows:
|
(1) |
The sample median is best known for being insensitive to outliers. Under the
normal distribution, the efficiency of the sample median drops off rapidly towards
its asymptotic value of 0.64 as sample size increases. The sample median has
a maximal 50% breakdown point (Rousseeuw and Croux, 1993).
Also, the sample median is difficult to handle in mathematical equations, does
not use all available values and can be misleading in distributions with a long
tail because it discards so much information (Betteley et
al., 1994; Francis, 1995). Even that the sample
median has emerged as a good estimator and is generally considered as an alternative
average to the sample mean especially when outliers are present in the data.
For a normal distribution with mean μ and standard deviation σ, the
standard error for the sample median is given by
The pseudo-standard deviation (Sps): The pseudo- standard deviation Sps based on the IQR can be written as:
|
(2) |
Under the normal distribution with mean μ and standard deviation σ, the scale estimator is unbiased estimator of σ. It has a breakdown point of 25%, but an efficiency of only 0.37 (Staudte and Sheather, 1990).
The Downton estimator (σ*): Downton (1966) introduced a family of estimators based on ordered sample values. Among this family of estimators, Downton proposed σ* as an estimator for the standard deviation σ of a normal population. Let X1, X2, , Xn be a random sample from a normal distribution with mean μ and variance σ2. Let X(1) ≤ X(2) ≤ ≤ X(n) denotes the corresponding order statistics. The Downtons estimator (σ*) is given by:
|
(3) |
Downton estimator has been also studied by David (1968), where he showed that this estimator is equivalent to Ginis mean difference which is a robust estimator of the standard deviation σ (Kendall and Stuart, 1958). Therefore, the Downton estimator can be written using the Gini mean difference, G, as:
|
(4) |
Where:
|
(5) |
Nair (1936) found that for a normal distribution
The median absolute deviation from the sample median (MAD): For a random sample X1, X2, , Xn with a sample median (MD), the median absolute deviation from the sample median is defined as follows:
MAD = 1.4826 Median {|Xi-MD|};
I = 1,2,
,n |
(6) |
The median absolute deviation from the sample median is a more robust scale estimator than the sample standard deviation, measures the deviation of the data from the median. It was proposed first by Hampel (1974), who attributed it to Gauss. It is often used as an initial value for the computation of more efficient robust estimators. The statistic bnMAD will be an approximately unbiased estimator of σ where, bn is a correction factor needed to make bnMAD unbiased when X1, X2, , Xn are normally distributed (Rousseeuw and Croux, 1993). This correction factor is given for n≤9 by:
|
and when n>9 then:
THE PROPOSED CONFIDENCE INTERVALS
Here, we will introduce some modified confidence intervals for μ when σ is unknown. Furthermore, the classical Students t confidence interval will be considered and compared.
The Student t confidence interval: Let X1, X2,
, Xn be a random sample from a normal population with mean μ
and standard deviation σ. The sample mean is normally distributed with
mean μ and standard deviation
|
was given by Student (1908) converges to standard normal
distribution and the confidence interval for μ is
|
(7) |
where, tα/2,n-1 is the upper α/2 percentage point of the Student t-distribution with (n-1) degrees of freedom, i.e., P(tn>tα, n-1) = α.
The Sps t confidence interval: This interval is a modification of the Student t confidence interval based on the sample median, MD, as an estimate for μ and the pseudo-standard deviation, Sps as an estimate for σ. Therefore we define the Sps t confidence interval for μ as:
|
(8) |
The Downton t confidence interval: The Downton t confidence interval for μ, is given as:
|
(9) |
This confidence interval is based on the sample median, MD, as an estimate for μ and the Downton estimator (σ*) based on the Ginis mean difference (G), as an estimate for standard deviation σ.
The MAD t confidence interval: The MAD t confidence interval for μ, is given as:
|
(10) |
This confidence is based on the sample median, MD, as an estimate for μ and the median absolute deviation from the sample median, MAD, as an estimate for σ.
RESULTS
Here, we are interested in comparing and studying the behavior of the proposed confidence intervals under the normal distribution with and without outliers and how the presence of outliers affects them by using a simulation study. The FORTRAN programs are used to run the simulation and to make the necessary tables. We generated 10000 random samples of sizes n = 10, 15, 20, 30, 40, 50 and 100 from Uniform (0, 1) and then use them to generate random samples from the normal distribution with and without contaminated data by considering the following two situations:
• | Uncontaminated distribution where all samples are generated from the standard normal distribution i.e., N(0, 1) |
• | Contaminated distributions where outliers are introduced in the data in two different combinations as follows: |
• | C10N3: A situation where 90% observation come from N(0, 1) and 10% from N(0, 9) | |
• | C20N3: A situation where 80% observation come from N(0, 1) and 20% from N(0, 9) |
The Simulated results for coverage probability
DISCUSSION
The performance (relative efficiency) of the proposed methods for the normal distribution when there are no outliers are examined first. Also, the estimated coverage probabilities, the average widths and the standard errors for all confidence interval methods are displayed in Table 1. The results in Table 1 suggest that the proposed methods have coverage probabilities closed to the nominal confidence coefficient when sampling from a normal distribution which is as expected. Also, as expected, the Student t confidence interval turned out to be the best estimator under a normal distribution without contaminated data. Table 1 showed also, that the average width of the Sps t, Downton t and MAD t confidence intervals are larger than the Student t confidence interval average width under the normal distribution without outliers. This makes the Student t a better method under normal data with no contamination.
Table 2 and 3 give the estimated coverage probabilities, the average widths and the standard errors for all confidence interval methods under a normal distribution with 10 and 20% contamination, respectively.
Table 1: | Coverage probability, average width and standard error for the standard normal distribution |
Table 2: | Coverage probability, average width and standard error for the 10% contaminated normal distribution |
Table 3: | Coverage probability, average width and standard error for the 20% contaminated normal distribution |
The results in Table 2 and 3 suggest that Sps t followed by MAD t confidence intervals are more resistant to contaminated data than the other confidence intervals and Sps t is the best. Also, notice that the outliers greatly changed the coverage probabilities and increased the widths of the Student t confidence interval. It is evident also, that for all sample sizes and contaminated normal distribution, Sps t and MAD t intervals are resistant to contaminated data and had good coverage probabilities with average interval widths, but Sps t is better when compared with the other confidence interval methods. The Student t confidence interval for the mean given in many textbooks does not behave properly for contaminated data. At any rate of contamination, we suggest that the Sps t followed by MAD t confidence intervals should be used when the population distribution is normal with outliers. While Student t is not preferred at all for contaminated data and the sample mean and the sample standard deviation are not good choices for constructing confidence interval. On the other hand, when the population distribution is normal without outliers, we suggest using the Student t confidence interval as the theory says. The Sps t followed by MAD t are more resistant to outliers than other methods.
Adrover et al. (2004) illustrated the performance of the globally robust confidence intervals by small Monte Carlo simulation. Note that (as expected) the coverage probabilities and widths achieved by our proposed Sps t and MAD t confidence intervals are much better than their globally robust ones. This fact can be explained by observing that the globally robust ones take account of large scaled distributions.
Shi and Kibria (2007) proposed alternative confidence intervals for Median t and MAD t, which are some adjustments to Student t. Their performance is compared according to coverage probabilities, widths and ratio of coverage probabilities to widths. They concluded that Median t performs the best in the sense of higher coverage probabilities. Also, MAD t performs the best in the sense of smaller width for a simulated data from a Gamma distribution. His coverage probabilities for MAD t are low, but it is the best with respect to interval width. Note also, that his formulas are different than ours for the calculation of MAD t, in addition to the distribution.
Baklizi and Kibria (2009) considered some confidence intervals for the mean or difference of means of Gamma distribution by extending the Median t interval for the two sample problem which differs from our problem. They used bootstrap techniques to compare the performance of these procedures based on coverage probabilities and widths of those intervals. They concluded that Median t and bootstrapped one sample Median t have to have the closest coverage probabilities to the nominal level. Note also, that their formulas are different than ours for the calculation of MAD t, in addition to the distribution.
ACKNOWLEDGMENTS
The authors would like to thank the Hashemite University for the cooperation during the preparation of this paper. The authors also wish to thank the managing and associated editors and the referees for their helpful comments and suggestions which have improved the presentation of the study.