Research Article
Confidence Interval for Locations of Non-kurtosis and Large Kurtosis Leptokurtic Symmetric Distributions
Department of Mathematics, Faculty of Science and Information Technology Zarqa University, Zarqa 13132, Jordan
The Studen's t confidence interval for one population mean is derived under the assumption that the sample was randomly selected from a normal population with unknown standard deviation. However, it is clear that in practice, there are many researchers use Student's t statistic in testing hypotheses or constructing confidence intervals regardless of sample size and distribution shape.
They rely on the idea that the t- test or the Students t confidence interval is robust against the moderate deviations from the normality assumption except for extremely skewed populations; (Bartlett, 1935; Bradley, 1980; Pearson and Please, 1975; Pocock, 1982). Pearson and Please (1975) established that skewness had an effect on Type I error rates and hence on the nominal confidence level. Chaffin and Rhiel (1993) found that if both skewness and kurtosis are present in the distribution they will have a great impact on the distribution of Student t statistic. Chami et al. (2007) derived a modified version of Coxs method to provide a more efficient confidence interval for the mean of the log-normal distribution than the available five estimators. They used the same comparative criteria of this study, coverage probability and interval width. Johnson (1978) proposed a modification of the Student's t confidence interval for skewed distribution. Since 1978, several other methods have been proposed in the literature by several researchers, among them Baklizi (2007, 2008), Gross (1976), Hettmansperger and Mckean (1998), Kibria (2006), Kleijnen et al. (1986), Meeden (1999), Shi and Kibria (2007) and Willink (2005).
Rhiel and Chaffin (1996) suggested using Students t statistic regardless of sample size when the population distribution is symmetric in the sense that the coverage probability is close to the nominal significance level. Wardrop (1994) reported a similar research study of confidence intervals for the mean. He compared the proportion of 5000 simulated intervals that contained the population mean to the nominal confidence level and suggested to use the Student's t confidence interval when the population is symmetric.
After all, the Students t confidence interval may not be appropriate model even if the population distribution is symmetric. A case of considerable practical interest is one in which the observations follow a long-tailed, a contaminated normal, a large kurtosis or heavy tails without mean or variance symmetric distributions. These observations might have a strong influence on the width and the coverage probability of the Students t confidence interval. Abu-Shawiesh et al. (2009) proposed four methods of confidence intervals for the mean of contaminated normal distribution, namely, Students t, Sps t, Downtons t and MAD t confidence intervals. They compared the performance of these intervals, by a simulation study, to check how the presence of extreme values affects the confidence intervals. The performance was based on coverage probabilities and widths of the intervals. They found that intervals based on Sps t and MAD t confidence intervals are better than the others, which is in support to this study. Gross (1976) examined the case of confidence interval robustness in long-tailed, (Gaussian and longer), symmetric distributions and showed that it has a great effect on the distribution of Student t statistic.
This study restricted itself, for the most part, to examining the impact of no Kurtosis and large Kurtosis distributions on the distribution of Student t statistic and proposing three alternative interval estimators, for example, Sps t, MAD t and Downtons t confidence intervals for estimating the location parameters of non-kurtosis and large Kurtosis leptokurtic symmetric distributions that have not been considered by the pre-mentioned authors.
The objective of the study was to observe the performance of the proposed confidence intervals and compare them with Students t confidence interval to determine the best method that has smallest average width within the confidence intervals that have coverage probabilities not less than the lower bound of the robustness. Such investigations are carried out by a simulation study.
The coverage probability and the average width for each confidence interval are simulated for 40000 samples from standard normal, standard double exponential, standard double translated Pareto of infinite excess kurtosis, double translated Pareto of no mean, variance and kurtosis defined, standard Cauchy and double translated Pareto of mean zero and infinite variance distributions. Then the best confidence interval is selected.
THE PROPOSED CONFIDENCE INTERVALS
There is several confidence intervals exist for the location parameter in the literature. However, the author will consider the following four intervals.
Student's t confidence interval: Let X1, X2,...,Xn be a random sample from a normal population with mean μ and standard deviation σ. Then the statistic was given by Student (1908) has a t-distribution with (n-1) degrees of freedom. The 100(1α)% confidence interval for μ often have the form When the population distribution is non-normal, then the sample mean is approximately normally distributed with a mean μ and standard deviation provided that n is large and hence the distribution of Student's t statistic converges to standard normal distribution and will be the approximate confidence interval for μ, where zα is the upper α percentage point of the standard normal distribution. Rhiel and Chaffin (1996) suggested using t critical values instead of z critical values when the distribution is symmetric or slightly skewed in order to get coverage probabilities closer to the nominal confidence levels than do z critical values. Thus, the 100(1α)% confidence interval for μ is where is the upper α percentage point of the Student's t distribution with (n-1) degrees of freedom. That is
The Spst confidence interval: This confidence interval is constructed by using the fact that the sample median (MD) of a normally distributed random variable with mean μ and standard deviation σ is asymptotically normally distributed with mean μ and standard error and the fact that the pseudo- standard deviation Sps based on the interquartile range, IQR, is unbiased estimator of σ (Mood et al., 1974; Staudte and Sheather, 1990). Therefore the Spst confidence interval for the location parameter μ is defined by:
(1) |
where,
(2) |
Downtowns t confidence interval: Downton (1966) proposed σ* as an estimator for the standard deviation, σ of a normal population. The Downton estimator, σ* is given by is Gini's mean difference.
(3) |
where,
(4) |
The Downton estimator, σ*, has been recommended as robust scale estimator for σ by many authors, for example ( Iglewicz, 1983; Barnett et al., 1967; Sarhan and Greenberg,1962). Therefore, the Downton's t confidence interval can be defined by:
(5) |
The MAD t confidence interval: The MAD t confidence interval for μ is given as:
(6) |
where,
(7) |
is the median absolute deviation from the sample median (Hampel, 1974) and bn is a correction factor needed to make bn MAD unbiased estimator for σ ( Rousseeuw and Croux, 1993).
ROBUSTNESS
In this study, the author adopted the definition of robustness used by Bradley (1980). Bradley (1980) considered a test to be robust for the nominal level α=0.05 if the coverage probability for the test, (simulated α ), was within 20% of the nominal level (between 0.04 and 0.06) and for α=0.01 if the coverage probability was within 50% of the nominal level ( between 0.005 and 0.015). According to this definition and the relationship between the two- sided test and the confidence interval, the confidence interval that has coverage probability (simulated confidence level) between 0.94 and 0.96 will be considered robust for the nominal confidence level 0.95 and the confidence interval that has coverage probability between 0.985 and 0.995 will be considered robust for the nominal confidence level 0.99.
To acknowledge the sampling error and the precision with which the probability coverage, estimates the nominal confidence level, 1α, the margin of error, is calculated for each nominal level of 0.95 and 0.99 by multiplying the standard error of the coverage probability by respectively 1.96 and 2.58. The standard error is calculated as follows:
(8) |
Table 1 gives the upper and the lower robustness bounds as well as the margins of errors for the nominal confidence levels 0.95 and 0.99.
Table 1: | Robustness bounds and margins of errors for the various nominal confidence levels |
THE METHOD
This study is funded by the Deanship of Research and Graduate studies in Zarqa University/Jordan and conducted in the department of mathematics and the laboratories of the department of Information Technology for a duration time of ten months started on March 3, 2010.
Computer simulation is used to carry out the objectives of the paper. Coverage probabilities and average widths are simulated for each confidence interval for samples from the distributions defined below:
• | Two large-kurtosis leptokurtic symmetric distributions with p.d.f. |
• | Standard double exponential distribution with excess kurtosis of 3 |
• | Standard double translated Pareto distribution |
(9) |
with c= 4 and infinite excess kurtosis
• | Three non-kurtosis symmetric distributions with p.d.f. |
• | Double translated Pareto distribution |
(10) |
with c = 1 and no mean, variance or higher moments defined
• | Standard Cauchy distribution which has no mean, variance or higher moments defined |
• | Double translated Pareto distribution with c=2, mean zero and infinite variance |
Samples of sizes n= 20(10)50,100 were simulated from uniform (0, 1) and then used to generate random samples from the above distributions by using the probability transform. Forty thousand simulations were used for each sample size for each distribution. From the simulation, the observed proportions of calculated confidence intervals that contain the location parameter μ were determined and compared to the nominal confidence level and the robustness bounds by using MATLAB, the language of technical computing version 6.5 (Enander et al., 1996). In addition to simulating the coverage probability for the confidence interval, the average width was simulated for those confidence intervals whose coverage probabilities are not less than the lower bound of the robustness bounds. Then the best confidence interval method is selected to be the one which has the smallest average width for all sample size and all distributions.
Table 2-7 show the simulation results of the Student's t and the proposed confidence intervals for the location parameters when the nominal confidence levels are 0.95 and 0.99.
Table 2 shows the coverage probabilities and the Average Widths (AW) of Student's t, Sps t, MAD t and Downtons t confidence intervals for samples from standard normal distribution and nominal confidence levels 0.95 and 0.99. As expected, the results indicate that the coverage probabilities for Student's t and the proposed confidence intervals are equal to or close to the nominal confidence levels and within the robustness bounds for all sample sizes and both nominal confidence levels (The slight deviations from the nominal levels are a result of the simulation). The results show also, that the average widths of the proposed confidence intervals are slightly larger than the average width of Students t confidence interval. This makes, as expected; the Students t is slightly more precise than the proposed confidence intervals under the normality assumption.
Table 3 and 4 show the coverage probabilities and the average widths of all confidence interval methods at 0.95 and 0.99 nominal confidence levels for samples from heavy-tailed and large level of excess kurtosis distributions, the standard double exponential distribution with excess kurtosis of 3 and the standard double translated Pareto distribution with infinite excess kurtosis.
The results show that the coverage probability for the Students t confidence interval is equal to or close to the nominal confidence level and within the robustness bounds for all sample sizes. The coverage probabilities for the other three proposed methods are lager than the upper robustness bound except that the Sps t and the MAD t of Table 3 are within the robustness bounds at 0.99 nominal confidence level and all sample sizes.
Table 2: | Coverage probability and average width for standard normal distribution |
Table 3: | Coverage probability and average width for standard double exponential distribution |
Table 4: | Coverage probability and average width for standard double translated Pareto distribution (c = 4) |
Table 5: | Coverage probability and average width for double translated Pareto distribution (c = 1) |
Table 6: | Coverage probability and average width for Cauchy (0, 1) distribution |
Table 7: | Coverage probability and average width for double translated Pareto distribution (c = 2) |
Table 3 and 4 show also, that average widths of Sps t and MAD t confidence intervals are closed to each other and both are much smaller than the Student's t and Downtons t confidence interval average widths. Because of this and associated increase in the coverage probability, the Spst and MAD t confidence intervals are Stronger than the Students t and Downtons t at both nominal confidence levels. This makes MAD t and Sps t are much better methods than the others.
Table 5 and 6 show the coverage probabilities and the average widths of all confidence interval methods for samples from the non-mean, non-variance and non-kurtosis double translated Pareto and standard Cauchy distributions at 0.95 and 0.99 nominal confidence levels. The results show that the coverage probabilities for all confidence interval methods at both nominal confidence levels are larger than the upper robustness bound and the average width of the MAD t is the smallest for all sample sizes and both nominal confidence levels. Notice that Cauchy data has greatly changed the coverage probabilities and increased the average widths of Students t and Downtons t confidence intervals. Because of this and the associated increase in coverage probabilities, the MAD t confidence interval is the strongest method.
The results of Table 7 show that MAD t and Sps t methods are equally stronger than the other methods under the non-kurtosis and infinite variance distributions.
The results in Table 3-7 suggest that MAD t followed by Sps t confidence interval is stronger and more resistant to the non-kurtosis and the large kurtosis leptokurtic symmetric distributions than Student's t and the other confidence interval methods and is the best, a result which is supported by Abu-Shawiesh et al. (2009). Student's t confidence interval does not behave properly for highly heavy tailed Cauchy and the no mean, no variance or higher moments defined Double translated Pareto of c=1 distributions, where various results in probability theory about expected values and variances, such as the strong law of large numbers and the central limit theorem will not work in such cases and where few of the sample standard deviations S are too large and hence few of the intervals widths are very long due to the reason that Cauchy distribution tends to generate extreme values which enlarge the value of the sample standard deviation and hence the value of the average width, a result which is supported by Gross (1976).
The results provide some insight into whether the Student's t confidence interval should not be used for all symmetric non- normal distributions. The following points suggest that MAD t followed by Sps t confidence interval is more appropriate than Student's t confidence interval.
• | As expected, when the population distribution is normal, the Students t confidence interval should be slightly more precise than the other methods. At any rate, MAD t and Sps t confidence intervals have good accuracy and precision |
• | When the population distribution is symmetric, with large or infinite kurtosis as in Table 3 and 4, MAD t and Sps t confidence intervals are stronger than the Student's t confidence interval |
• | When the population distribution has no mean, no variance and no kurtosis defined as in Table 5 and 6, Students t confidence interval is not preferred at all and the sample mean and the sample standard deviation are meaningless and not good choices for constructing confidence interval. On the other hand, MAD t confidence interval turned out to be better |
• | When the population distribution has finite mean, infinite variance and no kurtosis, as in Table 7, MAD t and Sps t confidence intervals are stronger than the Student's t confidence interval |
The author acknowledges the support received from the Deanship of Research and Graduate Studies in Zarqa University /Jordan. The author also wishes to thank the Editorial committee and the referees for their valuable comments which have improved the presentation of the paper.