HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2009 | Volume: 9 | Issue: 15 | Page No.: 2835-2840
DOI: 10.3923/jas.2009.2835.2840
Confidence Interval for the Mean of a Contaminated Normal Distribution
M.O. Abu-Shawiesh, F.M. Al-Athari and H.F. Kittani

Abstract: In this study, we calculate confidence intervals for the mean of a normal data and a contaminated normal data. Some robust estimators against outliers are also considered to construct confidence intervals that are more resistant to outliers than the Student t confidence interval. The confidence intervals of these estimators are computed and compared with each other for normal and contaminated normal data to determine which is better. The performance of these confidence intervals is evaluated and compared by calculating the estimated coverage probability, the average width and the standard error by using simulation. Sps t followed by MAD t are recommended at any rate of contamination, while Student t is not preferred at all for contaminated data and the sample mean and the sample standard deviation are not good choices for constructing confidence interval, but highly recommended for normal data without outliers as expected.

Fulltext PDF Fulltext HTML

How to cite this article
M.O. Abu-Shawiesh, F.M. Al-Athari and H.F. Kittani, 2009. Confidence Interval for the Mean of a Contaminated Normal Distribution. Journal of Applied Sciences, 9: 2835-2840.

Keywords: Median absolute deviation, confidence interval, outlier, coverage probability, contaminated normal and robust estimator

INTRODUCTION

The usual assumptions behind Student t confidence interval are that the distribution of data is normal (or approximately normal) and no major contamination due to outliers. Under these assumptions, the sample mean and the sample standard deviation are often used to construct this confidence interval.

However, these assumptions may not hold in many real-world problems. In particular, there are many situations where we have evidence that the underlying distribution is normal with some outliers that might affect the confidence interval or its coverage probability. These outliers may have a strong influence on the Student t confidence interval in the sense that they pull the width of the confidence interval too much in their direction and alter the coverage probability.

The literature showed that the sample median and inter-quartile range or the sample median and median absolute deviation or the sample median and Gini’s mean difference are indeed more resistant to departures from normality and presence of outliers. In this study, we incorporate this observation into constructing some interval estimators for the mean of the normal distribution with contaminated data. The sample median (MD) is used to estimate the parameter μ, whereas the population standard deviation σ is estimated by using three robust measures of scale that includes the Inter-Quartile Range (IQR), Gini’s mean difference (G) and median absolute deviation from the sample median (MAD).

Park and Cho (2003) proposed robust design to develop improvement in industrial production. They showed that the sample mean and variance are useful estimates under normality without contamination and the sample median and MAD or the sample median and the IQR are more useful under a contaminated normal.

Adrover et al. (2004) defined globally robust confidence intervals for the location among other things which takes in consideration a large scale of contaminated distributions. They constructed intervals that are stable in the sense of achieving coverage near the nominal level and informative in the sense of having short widths by taking into account the potential bias of the estimates. Our results showed that the proposed confidence intervals Sps t and MAD t satisfy the two conditions of globally robust confidence intervals under normal and contaminated normal and the Student fails to satisfy them. A result which is supported by the above mentioned reference.

Kibria (2006) considered some interval estimators such as Student t, Johnson t, Median t and Mean Absolute Deviation (MAD) t intervals for estimating the mean of a asymmetric distribution, in an effort to find a robust confidence interval, but Kibria did not try to find a robust confidence interval for contaminated normal data. So, we think it is important to try to obtain confidence intervals that are resistant to outliers.

Baklizi (2007, 2008) considered various modified procedures based on t confidence intervals as well as the approach based on empirical likelihood for the mean or difference of means of some skewed distributions. The performance was based on coverage probabilities and widths of those intervals. He found that intervals based on Bartlett corrected empirical likelihood and empirical likelihood procedures are superior for skewed heavy-tailed distributions.

There are other several alternative approaches available in the literature proposed for confidence intervals by several researchers at different times, among them Bloch and Gastwirth (1968), Guenther (1969), Gross (1976), Johnson (1978), Kafadar (1982), Horn (1983), Kleijnen et al. (1986), Hettmansperger and McKean (1998), Meeden (1999) and Willink (2005).

The objective of the study is to observe the performance of confidence intervals based on mean and standard deviation when the underlying data is contaminated. We investigate if the confidence intervals based on sample median and inter-quartile range; sample median and MAD; sample median and Gini’s mean, are resistance to the outliers and compare the performance of these confidence intervals. Such investigations are carried out by a simulation procedure to determine the coverage probability, the average width and the standard error of each confidence interval method under the normal assumption with and without contaminated data and then select confidence intervals that are more resistant against the presence of outliers or maintains a coverage probability close to a desired nominal confidence coefficient (1-α) with good average width and small standard error.

SOME ROBUST ESTIMATORS

Here, we introduce several robust estimators against outliers that are used in this study for constructing the confidence interval for μ when σ is unknown.

The sample median (MD): The sample median for a random sample of n observations X1, X2, … , Xn is defined as follows:

(1)

The sample median is best known for being insensitive to outliers. Under the normal distribution, the efficiency of the sample median drops off rapidly towards its asymptotic value of 0.64 as sample size increases. The sample median has a maximal 50% breakdown point (Rousseeuw and Croux, 1993). Also, the sample median is difficult to handle in mathematical equations, does not use all available values and can be misleading in distributions with a long tail because it discards so much information (Betteley et al., 1994; Francis, 1995). Even that the sample median has emerged as a good estimator and is generally considered as an alternative average to the sample mean especially when outliers are present in the data. For a normal distribution with mean μ and standard deviation σ, the standard error for the sample median is given by .

The pseudo-standard deviation (Sps): The pseudo- standard deviation Sps based on the IQR can be written as:

(2)

Under the normal distribution with mean μ and standard deviation σ, the scale estimator is unbiased estimator of σ. It has a breakdown point of 25%, but an efficiency of only 0.37 (Staudte and Sheather, 1990).

The Downton estimator (σ*): Downton (1966) introduced a family of estimators based on ordered sample values. Among this family of estimators, Downton proposed σ* as an estimator for the standard deviation σ of a normal population. Let X1, X2, … , Xn be a random sample from a normal distribution with mean μ and variance σ2. Let X(1) ≤ X(2) ≤ … ≤ X(n) denotes the corresponding order statistics. The Downton’s estimator (σ*) is given by:

(3)

Downton estimator has been also studied by David (1968), where he showed that this estimator is equivalent to Gini’s mean difference which is a robust estimator of the standard deviation σ (Kendall and Stuart, 1958). Therefore, the Downton estimator can be written using the Gini mean difference, G, as:

(4)

Where:

(5)

Nair (1936) found that for a normal distribution may be used as an unbiased estimator for σ. The Downton estimator has been recommended as a robust scale estimator by Iglewicz (1983). Barnett et al. (1967) studied Downton’s estimator and obtained its first four moments in a closed form. Inspection of the tables of coefficients of the best linear unbiased estimator of σ for n ≤20, makes it clear that for n > 3, σ* estimator also places less weight on the extremes than does . Thus this gives a little extra protection against outliers (Sarhan and Greenberg, 1962).

The median absolute deviation from the sample median (MAD): For a random sample X1, X2, …, Xn with a sample median (MD), the median absolute deviation from the sample median is defined as follows:

MAD = 1.4826 Median {|Xi-MD|}; I = 1,2,…,n
(6)

The median absolute deviation from the sample median is a more robust scale estimator than the sample standard deviation, measures the deviation of the data from the median. It was proposed first by Hampel (1974), who attributed it to Gauss. It is often used as an initial value for the computation of more efficient robust estimators. The statistic bnMAD will be an approximately unbiased estimator of σ where, bn is a correction factor needed to make bnMAD unbiased when X1, X2,…, Xn are normally distributed (Rousseeuw and Croux, 1993). This correction factor is given for n≤9 by:

and when n>9 then:

THE PROPOSED CONFIDENCE INTERVALS

Here, we will introduce some modified confidence intervals for μ when σ is unknown. Furthermore, the classical Student’s t confidence interval will be considered and compared.

The Student t confidence interval: Let X1, X2, … , Xn be a random sample from a normal population with mean μ and standard deviation σ. The sample mean is normally distributed with mean μ and standard deviation Then the Student t-statistic

was given by Student (1908) converges to standard normal distribution and the confidence interval for μ is for large n. When n is small, confidence interval for μ should be:

(7)

where, tα/2,n-1 is the upper α/2 percentage point of the Student t-distribution with (n-1) degrees of freedom, i.e., P(tn>tα, n-1) = α.

The Sps t confidence interval: This interval is a modification of the Student t confidence interval based on the sample median, MD, as an estimate for μ and the pseudo-standard deviation, Sps as an estimate for σ. Therefore we define the Sps t confidence interval for μ as:

(8)

The Downton t confidence interval: The Downton t confidence interval for μ, is given as:

(9)

This confidence interval is based on the sample median, MD, as an estimate for μ and the Downton estimator (σ*) based on the Gini’s mean difference (G), as an estimate for standard deviation σ.

The MAD t confidence interval: The MAD t confidence interval for μ, is given as:

(10)

This confidence is based on the sample median, MD, as an estimate for μ and the median absolute deviation from the sample median, MAD, as an estimate for σ.

RESULTS

Here, we are interested in comparing and studying the behavior of the proposed confidence intervals under the normal distribution with and without outliers and how the presence of outliers affects them by using a simulation study. The FORTRAN programs are used to run the simulation and to make the necessary tables. We generated 10000 random samples of sizes n = 10, 15, 20, 30, 40, 50 and 100 from Uniform (0, 1) and then use them to generate random samples from the normal distribution with and without contaminated data by considering the following two situations:

Uncontaminated distribution where all samples are generated from the standard normal distribution i.e., N(0, 1)
Contaminated distributions where outliers are introduced in the data in two different combinations as follows:

  C10N3: A situation where 90% observation come from N(0, 1) and 10% from N(0, 9)
  C20N3: A situation where 80% observation come from N(0, 1) and 20% from N(0, 9)

The Simulated results for coverage probability Average Width (AW) and the standard error (SE) of the confidence intervals with the two levels of contamination (10 and 20%) are given in Table 1-3.

DISCUSSION

The performance (relative efficiency) of the proposed methods for the normal distribution when there are no outliers are examined first. Also, the estimated coverage probabilities, the average widths and the standard errors for all confidence interval methods are displayed in Table 1. The results in Table 1 suggest that the proposed methods have coverage probabilities closed to the nominal confidence coefficient when sampling from a normal distribution which is as expected. Also, as expected, the Student t confidence interval turned out to be the best estimator under a normal distribution without contaminated data. Table 1 showed also, that the average width of the Sps t, Downton t and MAD t confidence intervals are larger than the Student t confidence interval average width under the normal distribution without outliers. This makes the Student t a better method under normal data with no contamination.

Table 2 and 3 give the estimated coverage probabilities, the average widths and the standard errors for all confidence interval methods under a normal distribution with 10 and 20% contamination, respectively.

Table 1: Coverage probability, average width and standard error for the standard normal distribution

Table 2: Coverage probability, average width and standard error for the 10% contaminated normal distribution

Table 3: Coverage probability, average width and standard error for the 20% contaminated normal distribution

The results in Table 2 and 3 suggest that Sps t followed by MAD t confidence intervals are more resistant to contaminated data than the other confidence intervals and Sps t is the best. Also, notice that the outliers greatly changed the coverage probabilities and increased the widths of the Student t confidence interval. It is evident also, that for all sample sizes and contaminated normal distribution, Sps t and MAD t intervals are resistant to contaminated data and had good coverage probabilities with average interval widths, but Sps t is better when compared with the other confidence interval methods. The Student t confidence interval for the mean given in many textbooks does not behave properly for contaminated data. At any rate of contamination, we suggest that the Sps t followed by MAD t confidence intervals should be used when the population distribution is normal with outliers. While Student t is not preferred at all for contaminated data and the sample mean and the sample standard deviation are not good choices for constructing confidence interval. On the other hand, when the population distribution is normal without outliers, we suggest using the Student t confidence interval as the theory says. The Sps t followed by MAD t are more resistant to outliers than other methods.

Adrover et al. (2004) illustrated the performance of the globally robust confidence intervals by small Monte Carlo simulation. Note that (as expected) the coverage probabilities and widths achieved by our proposed Sps t and MAD t confidence intervals are much better than their globally robust ones. This fact can be explained by observing that the globally robust ones take account of large scaled distributions.

Shi and Kibria (2007) proposed alternative confidence intervals for Median t and MAD t, which are some adjustments to Student t. Their performance is compared according to coverage probabilities, widths and ratio of coverage probabilities to widths. They concluded that Median t performs the best in the sense of higher coverage probabilities. Also, MAD t performs the best in the sense of smaller width for a simulated data from a Gamma distribution. His coverage probabilities for MAD t are low, but it is the best with respect to interval width. Note also, that his formulas are different than ours for the calculation of MAD t, in addition to the distribution.

Baklizi and Kibria (2009) considered some confidence intervals for the mean or difference of means of Gamma distribution by extending the Median t interval for the two sample problem which differs from our problem. They used bootstrap techniques to compare the performance of these procedures based on coverage probabilities and widths of those intervals. They concluded that Median t and bootstrapped one sample Median t have to have the closest coverage probabilities to the nominal level. Note also, that their formulas are different than ours for the calculation of MAD t, in addition to the distribution.

ACKNOWLEDGMENTS

The authors would like to thank the Hashemite University for the cooperation during the preparation of this paper. The authors also wish to thank the managing and associated editors and the referees for their helpful comments and suggestions which have improved the presentation of the study.

REFERENCES

  • Adrover, J., M. Salibian-Barrera and R. Zamar, 2004. Globally robust inference for the location and simple linear regression models. J. Statis. Plann. Inform., 119: 353-375.
    CrossRef    


  • Baklizi, A., 2007. Inference about the mean difference of two non-normal populatins based on independent samples: A comparative study. J. Staist. Comput. Simul., 77: 613-624.
    CrossRef    


  • Baklizi, A., 2008. Inference about the mean of skewed population: A comparative study. J. Staist. Comput. Simul., 78: 421-435.
    CrossRef    


  • Baklizi, A. and B. Kibria, 2009. One and two sample confidence intervals for estimating the mean of skewed populations: An empirical comparative study. J. Applied Statis., 1: 1-9.
    CrossRef    


  • Barnett, F., K. Mullen and J.G. Saw, 1967. Linear estimates of a population scale parameter. Biometrika, 54: 551-554.


  • Betteley, G., N. Mettrick, E. Sweeney and D. Wilson, 1994. Using Statistics in Industry: Quality Improvement Through Total Process Control. 1st Edn., Prentice Hall International Ltd., London


  • Bloch, D.A. and J.L. Gastwirth, 1968. On a simple estimate of the reciprocal of the density function. Ann. Math. Statist., 39: 1083-1085.


  • David, H.A., 1968. Gini`s mean difference rediscovered. Biometrika, 55: 573-575.


  • Downton, F., 1966. Linear estimates with polynomial coefficients. Biometrika, 53: 129-141.


  • Francis, A., 1995. Business Mathematics and Statistics. 4th Edn., DP Publications Ltd., London, ISBN: 1-85805-157-6


  • Gross, A.M., 1976. Confidence interval robustness with long-tailed symmetric distributions. J. Am. Statist. Assoc., 71: 409-416.


  • Guenther, W.C., 1969. Shortest confidence intervals. Am. Statist., 23: 22-25.


  • Hampel, F.R., 1974. The influence curve and its role in robust estimation. J. Am. Stat. Assoc., 69: 383-393.
    Direct Link    


  • Hettmansperger, T.P. and J.W. McKean, 1998. Robust Nonparametric Statistical Methods. 1st Edn., Hodder Arnold, London, ISBN: 978-0340549377


  • Horn, P.S., 1983. Some easy t-statistics. J. Am. Statist. Assoc., 78: 930-936.


  • Iglewicz, B., 1983. Robust Scale Estimators and Confidence Intervals for Location. In: Understanding Robust and Exploratory Data Analysis, Hoaglin, D.C., F. Mosteller and J.W. Tukey (Eds.). John Wiley and Sons, New York, ISBN: 0-471-38491-7, pp: 405-431


  • Johnson, N.J., 1978. Modified t tests and confidence intervals for asymmetrical populations. J. Am. Statist. Assoc., 73: 536-544.


  • Kafadar, K., 1982. A biweight approach to the one-sample problem. J. Am. Statist. Assoc., 77: 416-424.


  • Kendall, M. and A. Stuart, 1958. The Advanced Theory of Statistics, Distribution Theory. 3rd Edn., Charles Griffin and Co. Ltd., London


  • Kibria, B.M.G., 2006. Modified confidence intervals for the mean of the asymmetric distribution. Pak. J. Statist., 22: 109-120.
    Direct Link    


  • Kleijnen, J.P.C., G.L.J. Kloppenburg and F.L. Meeuwsen, 1986. Testing the mean of asymmetric population: Johnson`s modified t test revisited. Commun. Statist. Simul. Comput., 15: 715-732.
    CrossRef    


  • Meeden, G., 1999. Interval estimators for the population mean for skewed distributions with a small sample size. J. Applied Statist., 26: 81-96.


  • Nair, U.S., 1936. The standard error of Gini's mean difference. Biometrika, 28: 428-436.


  • Park, C. and B.R. Cho, 2003. Development of robust design under contaminated and non-normal data. Qual. Eng., 15: 463-469.
    CrossRef    


  • Rousseeuw, P.J. and C. Croux, 1993. Alternatives to the median absolute deviation. J. Am. Statist. Assoc., 80: 1273-1283.


  • Sarhan, A.E. and B.G. Greenberg, 1962. Contributions to Order Statistics. 1st Edn., John Wiley and Sons, New York, ISBN: 978-0471754206


  • Shi, W. and B. Kibria, 2007. On some confidence intervals for estimating the mean of a skewed population. Int. J. Math. Educ. Technol., 38: 412-421.
    Direct Link    


  • Staudte, R.G. and S.J. Sheather, 1990. Robust Estimation and Testing. 2nd Edn., John Wiley and Sons, New York, ISBN: 978-0-471-85547-7


  • Student, 1908. The probable error of a mean. Biometrika, 6: 1-25.


  • Willink, R., 2005. A confidence interval and test for the mean of an asymmetric distribution. Commun. Statist. Theory Meth., 34: 753-766.
    CrossRef    

  • © Science Alert. All Rights Reserved