Detection of Outliers in Time Series Data: A Frequency Domain Approach

Shittu, O.I.; Shangodoyin, D.K.

ABSTRACT

We consider the identification and detection of outliers in frequency domain using the spectral method. By assuming both the additive and multiplicative effect of outliers on a series, the parameters of the model were estimated using the maximum likelihood method with a view to measuring the effect of the suspected outlier on the parameter of the series. The occurrence of outliers has led to a shift in the phase and amplitude of the Fourier series thus affected the periodogram estimates. Further more, detection of aberrant observations is more exact in the frequency domain than in the time domain.

PDF Abstract XML References Citation

INTRODUCTION

Considerable attention has been devoted to the detection of outliers in discrete univariate time series were developed for univariate samples in time domain. Fox (1972) and Rosner (1975) started study on outlier detection, Haoglin and Iglewicz (1987) worked on resistant rules for labeling outliers. Chang and Tiao (1983) introduced the additive (AO) and Innovative (IO) models, these were further developed by Shangodoyin and Shittu (2000) the Multiplicative (MO) and Convolution (CO) were proposed in using the model identification tools (ACF and PACF). However in almost all the techniques in time domain Tsay (1986). Shangodoyin and Shittu (2003) detected that outliers were found to have some degree of smearing or swamping effects on other regular observations in the series. Also most economic and social data which are no longer linear but continuous in nature just in physics, engineering and medicine are of the continuous type which can be analyzed in frequency domain.

In this research, we determine the occurrence of outliers in time series data that assumes a Gaussian process and has a continuous spectrum using the spectral method of analysis. An algorithm that uses the robust trigonometric regression of Tatum and Hurvich (1993) is proposed. The estimate of the parameters of the model for the contaminated series is obtained by the maximum likelihood method with a view to compare with that obtained by the least squares method by Priestley (1981)and Brillinger (1981). We also assume the additive and multiplicative effect of outliers on the observed process and the measure of impact of outliers on the observed process and the measure of impact of outliers on the observed values shall be estimated as well as the location of the suspected outlier using a proposed algorithm based on the repeated median transform of Siegel (1982).

ESTIMATION OF PARAMETER USING THE MAXIMUM LIKELIHOOD TECHNIQUE

Here, we estimate the parameters of the model using the maximum likelihood technique with a view to comparing them with that obtained by the least Square method in the literature (Priestley, 1981; Brillinger, 1981).

Let X_t be any periodic stochastic process with period 2π with Fourier representation as:

(1)

Where:

w_i:	The Fourier frequency;
φ_i:	The phase uniformly distributed on (0, 2π)
Ri:	The amplitude
ε_t:	The random error term NID (0, σ)

Equation 1 can be re-written as:

(2)

A_i = R_j cos φ_i and B_i = R_j sin φ_i are parameters to be estimated and ε_t is a purely random process, normal and independently distributed with:

Where, σ_ε² is a further unknown parameter and

Since ε_t ~ N I D (0, σ²) the distribution function of ε_t can be given as:

(3)

With a corresponding maximum likelihood function:

Image for - Detection of Outliers in Time Series Data: A Frequency Domain Approach

and log- likelihood:

(4)

The maximum likelihood estimate of the A_o, A_i and B_i are

(5)

(6)

and

(7)

It can be shown that the estimates are unbiased with variances:

and

Where:

is the unbiased estimate of the residual variance.

ESTIMATION OF PARAMETERS OF A CONTAMINATED SERIES

Our focus here is to derive estimates of the parameters of outlier contaminated series.

The Additive Model
Suppose outliers have additive effect on a series, we assume the additive outlier generating model of Tsay (1986).

The additive model is given by:

(8)

Where, X_t is the observed series; Z_t is the outlier free series; and D is the magnitude of the outlier is the time indicator of the outlier such that

Using (8) in (4) gives the maximum likelihood function:

(9)

and the log-likelihood function:

The maximum likelihood estimate of the A_o is:

(10)

the estimate of the magnitude of outlier is:

(11)

at therefore:

(12)

and for reasons of orthogonality when there is no outlier:

(13)

The maximum likelihood estimate of the A_i and B_i are:

(14)

and

(15)

It could be observed that from (12), (14) and (15) the occurrence of outlier has influenced only (the grand mean) and no influence on for (AO) model. However, the influence on could be monotone increasing or decreasing depending on whether is positive or negative.

Thus the occurrence of outlier in any series does not affect the periodogram for the (AO) model.

The Multiplicative Model (MO)
Suppose outlier have multiplicative effect on a data set, without loss of generality, we consider the multiplicative outlier generating model (MO) (Shangodoyin and Shittu, 2003):

(16)

Where, X_t is the observed series; Z_t is the outlier free series; and D is the magnitude of the outlier is the time indicator of the outlier such that

Using (16) in (4) we have the log-likelihood function:

(17)

and the estimate of:

(18)

(19)

While the MLE of the magnitude of outlier for the multiplicative model is:

(20)

Where:

the Normalized Periodogram.

The maximum likelihood estimate of the A_i and B_i are:

(21)

and

(22)

However when there is no outlier, that is when t # T and = 0, the estimates of are

and

respectively.

Which are also unbiased

DETECTION OF OUTLIERS USING THE DERIVED ESTIMATES

The derived estimates shall now be used to diagnose for suspected outliers using the following proposed algorithms.

Algorithm I (Detection of Outlier)
If observations X₁, X₂,. . . . ; X_N can be expresses as a sum of sine and cosine waves as in (2) which can be written as:

Using any of the spread sheet package or Microsoft Excel to

•	Obtain the estimate of the Fourier frequencies for k = 1, 2. . . ^N/₂ and
•	i = 1, 2, . . . k hence the periodogram I_N(ω_i) for all ω in the range by

Where, α(ω) and β(ω) are as defined in (9) and (10).

If is very close to its true value then and will also be close to respectively, hence the squared amplitude will be non-zero. However, if is substantially far from its expected value the periodogram will be close to zero.

•	Determine the value of ω_i; i = 1, 2,. . . . K whose squared amplitude is non-zero.
	Obtain the residual variance and Compute the test statistics:

•	Determine λ_F = Max_(1<F<N) λ_i
•	For all λ_Fi > C where, C is the critical value simulated as 1.00 or 1.10, the observation X_t^s corresponding to λ_Fi is declared an outlier for the Fourier frequencies ω_i; i = 1, 2,.... k whose squared amplitude is non-zero.
•	Use the Repeated median filter of Siegel (1982) and for all ω_i ≠ 0; compute the estimate discrete Fourier transform:

*	X_t^F gives the uncontaminated data set whose contamination/outlier has been removed.

DATA ANALYSIS

To show the use of the above algorithm, five different natural and well analysed data were used. They are series A: Zadakat data in a local mosque in Nigeria; series B: Wolfer Sunspot data, a record of activities in the solar system; series C: Batch chemical data; series D: Well analyzed data from Box and Jenkins (1976). Nigerian Consumer Price index data obtained for the Federal Office of Statistics; and series E: Diabetic patient data from the University teaching Hospital, Ibadan, Nigeria.

The algorithm I was used to diagnosed collected data for outliers using the Microsoft Excel package and the results were summarized in the Table 1-3.

Table 1:	The timing and magnitude of outliers

Table 2:	The timing and magnitude of outliers

Table 3:	The timing and magnitude of outliers

It should be noted that no outlier were detected in series C (N = 48) and series D (N = 70)

CONCLUSIONS

The derived estimates using the Maximum Likelihood Method (MLE) compares favourably with the Least Squares Method (LSM). This confirms the remark made by Priestley (1981) that Maximum likelihood method is a more asymptotically fully efficient method of estimation. It was found out that the contamination has no influence on the estimates of and for (AO) model. However, the influence on could be monotone increasing or decreasing depending on whether is positive or negative, however for the multiplicative model, the influence on the parameter estimates were noticeable under the null hypothesis that there is contamination in the series.

We found that 2, 6 and 4 observations were identified as outliers in series A, B and E, respectively as shown in Table 1-3 while no observation were identified in series C and D. This is not to say that the algorithm can not work for small sample size data (i.e., n<100) as studies have shown that the procedure performs efficiently in any series were contamination is apparent.

It was also observed that using the spectral method of analysis in the frequency domain, the detection of aberrant observations were more exact than in those techniques in discrete domain.

With the Robust repeated median transform, it can also be observed that the issue of swamping or masking effect does not arise as outlying observations can be detected more exactly. The Robust repeated median transform technique is more complex and involves a lot of iterations; it is also more extensive computationally than other techniques.

RECOMMENDATIONS

Because of the fact that the number of outliers present in a set of data can not be determined aprori, it is recommended that every set of data, especially time series data should be diagnosed for outliers; the detected outlier should be treated or accommodated by any known method, before further analysis could be carried out.

Future research should emphasize on the identification and detection of outliers in Multivariate and categorical data as well as the extension of multiple outlier detection technique to the frequency domain.

REFERENCES

Box, G. and Jenkins, 1976. Time Series Analysis and Control. 1st Edn., Holden-Day San Francisco, California, ISBN: 0-8162-1104-3.
Brillinger, 1981. Time Series Data Analysis and Theory. Expanded Edn., McGraw-Hill, Inc., New York.
Chang, I. and G.C. Tiao, 1983. Estimation of time series parameters in the presence of outlier. Technical Report 8, University of Chicago, Statistic and Research Center.
Fox, A.J., 1972. Outlier in time series. R. Stat. Soc., 34: 350-363.
Hoaglin, D.C. and B. Iglewicz, 1987. Fine-tuning some resistant rules for outlier labeling. J. Am. Stat. Assoc., 82: 1147-1149.
CrossRef Direct Link
Priestley, M.B., 1981. Spectral Analysis and Time Series. 1st Edn., Academic Press, London, ISBN-10: 0125649010, pp: 653.
Direct Link
Rosner, B., 1975. On the detection of many outliers. Technometrics, 17: 221-227.
CrossRef Direct Link
Shangodoyin, D.K. and O.I. Shittu, 2000. Some recent advances in multiple outlier detection technique. J. Sci. Res., 6: 12-19.
Shangodoyin, D.K. and O.I. Shittu, 2003. Single outlier generating models-new strategies. J. Sci. Eng. Technol., 10: 5166-5177.
Siegel, A.F., 1982. Robust regression using repeated medians. Biometrika, 69: 240-244.
Direct Link
Tatum, L.G. and C.M. Hurvich, 1993. High breakdown methods of time series analysis. J. R. Stat. Soc., 55: 881-896.
Direct Link
Tsay, R.S., 1986. Time series model specification in the presence of outliers. J. Am. Stat. Assoc., 81: 132-141.
CrossRef Direct Link

Asian Journal of Scientific Research

Research Article

Detection of Outliers in Time Series Data: A Frequency Domain Approach

ABSTRACT

How to cite this article

Search

CONCLUSIONS

REFERENCES

Search

Related Articles

Leave a Comment