ABSTRACT
Considerable attention has been devoted to identification and detection of outliers in discrete univariate samples in time and frequency domains, with less attention paid on what to do with detected outliers. Available techniques for treatment of detected outliers were found to be subjective and deficient. An algorithm is proposed for accommodation of aberrant observations in the frequency domain. A new filtering method of accommodating outliers is also suggested and the performance of various accommodation techniques was determined in respect of the fixed and dynamic models. Five real and analyzed data of sizes (T = 48, 70, 100, 146 and 150) were used in the study. Reductions of between 3.3 and 4.5% in the standard error for both fixed and dynamic models were observed respectively after suspected outliers were accommodated by the filtering method. There was improvement in the precision of the estimates of parameters at (p<0.05) level of significance for both real and simulated data. This work has established that the filtering method of accommodation of outliers is a better and more efficient technique than all existing methods especially when the data are large.
PDF Abstract XML References Citation
How to cite this article
DOI: 10.3923/ajms.2008.24.33
URL: https://scialert.net/abstract/?doi=ajms.2008.24.33
INTRODUCTION
An outlier is an observation in the data that differs noticeably from other observations. They are wild observation which does not appear to be consistent with the rest of the data. Grubbs (1969) remarks that an outlying observations or outlier is one that appears to deviate markedly from other members of sample in which it occurs. Recent research Battaglia (2006) also defined outlier as not only as an anomalous observation arising from anomalous events, but as an observation that is incoherent with the surrounding observations. But it is difficult to give exact criteria for deciding when a value is too big or too small or in general too extreme. Many authors have studied the occurrence of outliers in univariate and multivariate time series. Fox (1972) proposed two parametric models for studying outliers, Abraham and Box (1979) used the Bayesian method, Chang (1982) adopted Fox`s model and proposed an iterative procedure to detect multiple outliers, Chang and Tiao (1983), Hoaglin and Iglewicz (1987) and Martin and Yoai (1988) treated outlier as contamination generated from a given probability distribution, Tsay (1988) to mention a few have investigated outliers, level shift and variance changes in a unified manner, Tsay et al. (2000) have investigated outliers, level shift and variance changes in a unified manner. Galiano et al. (2004) extended the study of outlier`s detection and investigated their effect in univariate and multivariate time series. Despite these efforts, researchers have not come up with an efficient method to deal with detected outliers.
The study of outliers in a data set is often inevitably an informal screening process preceding fuller and more formal analysis of the data. Since the presence of one or more outliers in a data set could lead to bias in the estimation of parameters of the model and greatly inflates the estimate of the variance (σ2), there is serious need to consider methods of removal of outliers from time series data. It may also be possible to apply an outlier robust methods of making valid inference and reliable forecasts for the future. Collett and Lewis (1976) in his earlier study carried out on accommodation of outliers in the middle of eighteenth century about the combination of astronomical data (observations) asked:
... Is it right to hold that the several observations are of the same weight or moment, or equally prone to any or every error? Is every other outlier with the same probability? Such an assertion would be quite absurd, I see no way of drawing a dividing line between those that are to be utterly rejected and those that are to be wholly returned; it may even happen that the rejected observation is the one that would have supplied the best correction to the others. Nevertheless, I do not condone in every case the principle of rejecting one or other of the observations, indeed I approve it, whenever in the course of observations an accident occurs which in itself raises an immediate scruple in the mind of the observer, if there is no such reason for satisfaction I think each and every observation should be admitted whatever its quality, as long as the observer is conscious that he has taken every case.
For the reasons mentioned in the quoted views about 200 years after, many researchers were still investigating the timing and occurrence of outliers in statistical data. Many outlier generating techniques have been developed; among them Rosner (1975) suggested the Extreme Studentized deviate (ESD); Chang and Tiao (1983) also propose the innovative outlier model (10) and the additive outlier mo0del (AO) while Shangodoyin (1994) improved on them. Shittu (2000) proposed the multiplicative outlier generating model (MO) and the condition of the Innovative and Additive outlier generating model (CO) to mention a few.
This study therefore aims at developing an alternative technique that uses the robust trigonometric regression. The method is expected to improve the precision of the estimates, increase accuracy of forecasts of time series data in frequency domain.
TREATMENT OF OUTLIERS
Having enumerated some of the efforts that have been made to identify or label aberrant observation, the pertinent question to be asked are What action are we to take when one or more observations in a set of data is adjudged to be an outlier? How should we react to outliers and what principles and methods can be used to support rejecting them, adjusting them values or leaving them anuttered prior to processing the principal mass of data.
Treatment of outliers depends on the form of the population and the technique will be conditioned by and specific to the postulated basic model for the population. However, the method of processing outliers takes a relative form.
Rejection of Outliers
According to Hawkins (1980) early approaches to processing of outliers involve testing an outlier with a view to determining whether it should be retained or rejected since an outlying observation could represent one of the most important pieces of data, perhaps pointing to some special, as yet undiscovered, feature of the relationship between related variables.
However, care must be taken in making decision on whether to delete an observation. On this Kruskal (1960) said.
As to practice, I suggest that it is of great importance to preach the doctrine that apparent outliers should always be reported even when one feels that their causes are known or when one reject them for whatever good rule or reason. The immediate pressure of practical statistical analysis are almost uniformly in the direction of suppressing announcement of observation that do not fit the pattern, we must maintain a strong sea-wall against these pressure and I quote;
Thus outright rejection of suspected outliers has statistical consequences reduction of the number of observation in the sample, thus further analysis will be on the reduced sample (or censored sample) may affects inferences on the related population. Since often times rejection was not carried out according to any formal procedure, but was purely a matter of the observer`s judgment.
Weighting Method
Weighting is an alternative to outright rejection of extreme values. Glaisher (1872) was perhaps the first to publish a paper on weighting procedure. Rider (1933) wrote in his study:
Since the object of combining observation is to obtain the best possible estimation of the time value of a magnitude, the principle underlying (weighting) methods is that an observation which differs widely from the best should be returned, but assigned a smaller weight than the others in computing a weighted average of course retention with an exceedingly small weight amounts to virtual rejection.
Glaisher`s method was concerned with n observation from normal distributions, with common mean required to be estimated and with unknown and unequal variances. He proposed estimating the mean μ iteratively by a weighted combination of the Xi with weights determined from the squared deviation of the values of the observation. Stone (1968) criticized Glaisher`s method and proposed an alternative weighting procedure base on maximum likelihood. This leads to a weighted mean μ given by the (n-1)th degree equation
Trimming
Another alternative to outright rejection is trimming. It is a procedure in which a fixed fraction α of lower and upper, extreme sample values are totally discarded before processing the sample. To illustrate this procedure, suppose we are estimating a location parameter μ from n observation X1, X2, ...., Xn. Since outliers manifest themselves as extreme values; it is possible to control the variability due to the r lowest sample values X1, X2, ...., Xr and the S highest ones Xn-s + 1, ...., Xn where (r + S) observation are adjudged outliers. If the (r + S) observations are omitted, so that we confine ourselves to a censored sample of size (n-r-S), we get the (r, S)-fold trimmed mean.
This procedure is not quite different from rejection technique even though Barnett and Lewis (1985) believed that-trimmed mean does not throw out` outliers, in the sense of ignoring them completely. He claims that it bring them in` toward the bulk of the sample.
Winsorizing
If on the other hand the r and or S lowest and largest sample values are each replaced by their (nearest neighbor) values of the nearest observation to be retained unchanged, then we have (r, S)-fold Winsorized mean.
Thus given n observations X1, X2, ...., Xn where it is known apriori or detected through some statistical procedure that X1, ...., Xr and Xs + 1, ...., Xn are lower and upper outliers respectively, replacing the lower and upper (largest) extreme sample values by rXr + 1 and sXs + 1 so that we work with a transformed sample of size n. The (r, S)-fold winsorized mean is given by:
This makes each of the later values appear twice in the data.
Interpolation
This is a method considered by Xie (1993) where the underlying series is assumed to be linear and parametric. He assumed an ARMA (p, q) model for the contaminated series and considered the outlying observation as a missing data, then obtained supplement to the values by using well known interpolation formulae.
ALTERNATIVE APPROACH-FILTERING METHOD
Considering the performance and the limitations of various procedure for treatment of outliers in earlier section above coupled with the fact that censored data set denies the investigator a great loss of confidence in the specified model we propose the method/procedure for filtering of suspected outliers.
This approach also uses the robust trigonometric regression to obtain the robustified discrete Fourier transform such that at each frequency, we fit a sine and cosine coefficient, by using either the repeated median technique of Chang and Tiao (1983) or the biweight of Turkey (Andrews et al., 1972).
The filtered value of the suspected outlier be substituted back into the data set before further analysis is carried out.
To apply the filtering method, the observed series will be subjected to outlier test with a view to detecting aberrant observations using the following algorithms.
Algorithm I (Detection of Outliers)
Given a time series X1, X2, X3, ...., XN
• | Compute the median of the series (i.e., ) |
• | Compute the Fourier frequencies for I = 1, 2, ..., k where, |
• | Obtain the estimate of and |
• | Compute the periodogram for all in the range by |
(1) |
If is very close to its true value, then and will also be close to and , respectively, hence the squared amplitude will be non-zero and here will be large peak. This corresponds to the frequency with the greatest contribution to the variance. However, if is substantially far from its expected value the periodogram will be close to zero.
• | Determine for i = 1, 2, ...., k whose squared amplitude is significantly greater than zero. |
• | Obtain the residual variance of the series. |
• | Compute the test statistics. |
for i = 1, 2, ..., k since median is more resistant to outlier, hence a robust measure of central tendency Atkinson (1981).
• | Determine such that |
• | For all where, C is the critical value simulated as 1.00 or 1.10, |
• | The observation corresponding to is declared an outlier. |
This procedure was applied in Shittu and Shangodoyin (2007)
Algorithm 2 (Accommodation of Outliers)
When outliers are detected using algorithm I above, the median as a measure of location rather than the mean Atkinson (1981) is employed.
• | Obtain the median of the observed series X1, X2, X3, ...., XN. |
• | Determine the value of Fourier frequencies for i = 1, 2, ...., k whose squared amplitude is non-zero. |
• | Using the biweight filter of Tatum and Hurvich (1993) and the repeated median filter of Siegel (1982) and for all ; compute the discrete Fourier transform |
(2) |
• | gives the filtered data set whose contamination/outlier has been cleaned |
• | The detected outlier using algorithm 1 is then replaced by before further analysis is carried out. |
This approach uses the robust trigonometric regression to obtain the robustified discrete Fourier transform such that at each frequency, we fit a sine and cosine coefficient, by using either the biweight filter of Tatum and Hurvich (1993) or the repeated median technique of Siegel (1982).
In the final analysis, the filtered value of the suspected outlier will be substituted back into the data set with a view to comparing its performance over the other existing method of treating outliers.
APPLICATIONS
In this study, five different real and well analyzed data are used to illustrate the use of the above algorithms. They are series A: Zadakat data daily offerings in a local mosque in Ibadan, Oyo State, Nigeria, between 18th February, 2001 and 13th July, 2001; series B: a Wolfer`s sunspot data, the yearly record of the activities in the solar system from 1749 to 1924, a well analyzed data from Anderson (1970), Series C; Batch chemical data, a well analysed data obtainable from Box and Jenkins (1976); series D: a monthly Consumer Price Index data obtained from the annual abstract of statistic for the Federal Office of Statistics (FOS), Lagos, Nigeria, FOS (1998) and series E: a monthly diabetic disease data collected from the University College Hospital (UCH), Ibadan, Nigeria between January 1974 and February 1986 and used in detection of outbreak of epidemics (Osanaiye and Talabi, 1989).
Table 1: | The collected data for fixed model (series A) using different methods |
Table 2: | The collected data for fixed model (series B) using different methods |
Table 3: | The collected data for fixed model (series C) using different methods |
The proposed algorithm is used to diagnose collected data for outliers using the Spectral method (Shittu and Shangodoyin, 2007). We also found that 2 and 8 observations were identified as outliers in series A and B while 4 observations each were identified in E (Table 1 -3). No observation were identified in series C and D. This is not to say that the algorithm can not work for small sample size data as studies have shown that the procedure performs efficiently in any series where contamination is suspected.
The suspected outliers were either rejected as it they were not part of the series, or excluded (i.e., trimmed). The labeled data were also winsorized by replacing the suspected data with its largest neighbor.
The proposed accommodation technique (Filtering method) was also applied. This is a method whereby the value of the suspected outlier is replaced with the corresponding value obtained after Fourier transformation, an analogue of the bi-weight filter of Tatum and Hurvich (1993).
Relative Performance of the Accommodation Techniques
Here, various treatments techniques are applied using the algorithm with a view to measuring the performance among the existing techniques and with the proposed technique. To do this, the fixed and dynamic models were fitted to the resulting series with a view to measure the relative performance of the various outlier accommodation methods.
In this attempt, our interest is not to determine the appropriate model for the data, but to examine the variance and standard error of the parameter estimates before and after treatment. The issue of appropriate modeling technique will be addressed in the next section.
Fixed Model
The simple least squares regression model for univariate sample was used to fit regression lines to all the series.
Table 4: | The collected data for fixed model (series D) using different methods |
Table 5: | The collected data for fixed model (series E) using different methods |
Since the least squares fitting is not resistant to outliers and neither is the slope of the regression, the least squares` fitting is not resistant to outliers and neither is the fitted slope estimates, materials from (http://www.basc.nwu.edu/statguidefiles/ancovaassviol.html). The results of our finding were given in Table 1 -5 in the Appendix.
Series: A
It can be seen from the Table 1 that the appropriate model for that original data was the simple regression model as indicated by the p-value (p = 0.0145). After treatment for outliers, it was found out that the underlying structure of the model was not specified by the simple linear regression model again as indicated by their p-values.
It should be noted that the largest observed reduction is noticed in the filtering method.
The above shows that the occurrence of outliers has inflated the error structure of the model as well as the of the parameter estimates, that led to model mis-specification.
Series: B
It is well known that simple regression model is not appropriate for the sunspot data (Series B) above Hathaway and Wilson (2004). However, there was a significant reduction in the standard error of the estimates and that of the fitted model as shown in Table 2. It is expected that the same or greater percentage reduction in the residual error can be achieved even if the most appropriate model was fit to the original data set.
Series: C and D
No treatment was carried out on series C and D as none of their observations were labeled as outlier incidentally, the fitted models to the two series were good as indicated in their p-values (Table 3, 4).
Series: E
In Table 5 in the Appendix, reductions in the standard error of the parameter estimates and the standard deviation of the model were noticed with the winsorizing and filtering methods in a tie, i.e., equal performance.
Dynamic Model
Again the performance of the various outlier treatment techniques was examined under dynamic systems. The model identification tools (ACF and PACF) were used to fit appropriate ARMA (p, q) model to all the series in this study. The contaminated series A, B, E and F are then modeled after being treated with different treatment techniques.
The summary of the result of the analysis are given in Table 7-10 in the Appendix.
Table 6: | The collected dat for dynamic model (series A) using different methods |
Table 7: | The collected dat for dynamic model (series B) using different methods |
Table 8: | The collected dat for dynamic model (series C) using different methods |
Table 9: | The collected dat for dynamic model (series D) using different methods |
Table 10: | The collected dat for dynamic model (series E) using different methods |
Series: A
In series A, the appropriate model is ARMA (1, 1), as in the fixed model, substantial reduction in the standard error of the estimates were achieved after treatment with filtering method performing best among others (Table 6).
Series: B
Autoregressive model of order one [AR(1)] was found most suitable for series B, as in series A reduction in standard errors were observed with winsorizing method performing better than all other techniques.
Series: C and D
ARMA(1,1) and AR(1) were found to be most appropriate for series C and D respectively. Since both series were not contaminated with outliers, no treatment were required.
Series: E
The model that best describe the underlying structure of series E was found to be ARMA(1, 1). As in the fixed model, reductions were noticed in the standard error of the model as well as the error of the estimates with the filtering method performing better than all other traditional methods.
Having noticed that the filtering method performs better in 3 out of 4 series in the dynamic modeling and about the same performance in the fixed model, the filtering method is hereby declared as the best method of treatment for outlier contaminated series.
DISCUSSION
The procedures are based on simple techniques, they can be used as data cleaning device in spectral estimation and robust time series analysis.
The performance of various accommodation techniques was determined in respect of the fixed and dynamic models. It was discovered that the new method of accommodating outliers (filtering method) is best in term of the residual error of the filtered data as well as in the standard error of the estimates (Table 1 -5) for fixed model and 7-10 for the dynamic model in the appendix. The filtering method performs best in all the series except in Series e where the performance of the Winsorizing and Filtering methods are almost the same.
CONCLUSIONS
Based on our findings, we conclude that the proposed filtering method of accommodating outliers is considered best among the existing methods in terms of the residual error of the filtered data as well as the precision of the estimates.
This implies that forecast values obtained from the Filtered data will be more accurate and reliable thus can be used for meaningful planning and control of events and programmes.
There is no need going into lengthy computation if there is sufficient information on the existence of outliers, however, more often than not non or scanty information are provided on the occurrence of outliers and since the number of outliers present in a set of data can not be determined apriori, it is recommended that every data set, especially time series data should be diagnosed for outliers using the proposed algorithms which have been found to be more efficient than other traditional techniques, before further analysis could be carried out. Detected outlier(s) should be accommodated by the filtering method which has been established to be the most efficient technique that is capable of guaranteeing the integrity of the data.
REFERENCES
- Abraham, B. and G.E.P. Box, 1979. Bayesian analysis of some outlier problems in time series. Biometika, 66: 229-236.
CrossRefDirect Link - Atkinson, A., 1981. Two graphical displays for outlying and influential observations in regression. Biometrika, 68: 13-20.
CrossRefDirect Link - Collett, W. and T. Lewis, 1976. The subjective nature of outlier rejection procedures. Applied Stat., 25: 228-237.
CrossRefDirect Link - Hathaway, D. and R. Wilson, 2004. What the sunspot record tells us about space climate. J. Solar Phys., 224: 5-19.
Direct Link - Hawkins, D.M., 1980. Identification of Outliers (Monographs on Statistics and Applied Probability). 1st Edn., Springer, London, ISBN-10: 041221900X.
Direct Link - Hoaglin, D.C. and B. Iglewicz, 1987. Fine-tuning some resistant rules for outlier labeling. J. Am. Stat. Assoc., 82: 1147-1149.
CrossRefDirect Link - Martin, R.D. and V.J. Yohai, 1988. Influence functionals for time series. Ann. Stat., 14: 781-818.
Direct Link - Osanaiye, P.A. and C.O. Talabi, 1989. On some non-manufacturing applications of counted data cumulative sum (CUSUM) control chart schemes. Statistician, 38: 251-257.
CrossRefDirect Link - Stone, E.J., 1968. On the rejection of discordant observations. Monthly Notices R. Astronomical Soc., 28: 165-168.
Direct Link - Tatum, L.G. and C.M. Hurvich, 1993. High breakdown methods of time series analysis. J. R. Stat. Soc., 55: 881-896.
Direct Link - Tsay, R., 1988. Outliers, level shifts and variance changes in time series. J. Forecasting, 7: 1-20.
CrossRefDirect Link - Tsay, R., D. Pena and Pankratz, 2000. Outliers in multivariate time series. Biometrika, 87: 789-804.
Direct Link