INTRODUCTION
The ARIMA models is also called as BoxJenkins models, are the most general class of models for forecasting a time series (Box and Jenkins, 1976). A restrictive aspect of these models is that, they have linear structured models. The relationship between the variables is never linear in real life in general for most problems (Granger and Terasvirta, 1993) and using linear models is not efficient for such problems.
Artificial Neural Networks (ANN) are a class of flexible nonlinear models that can discover patterns adaptively from the data. Theoretically, it has been shown that given an appropriate number of nonlinear processing units, ANN can learn from experience and estimate any complex functional relationship with high accuracy. Recently, numerous successful applications have established their role time series forecasting problems such as prediction of electric demand, prediction of breast cancer survivability, future evolution of electricity markets, modeling rainfallrunoff process (Abraham and Nath, 2001; Delen et al., 2005; Gareta et al., 2006; Srinivasulu and Jain, 2006). One of the main reasons that ANN models can produce more effective results in classification, sample recognition and forecasting problems than linear models. The ANN also does not require any knowledge nor prior information about systems of interest. Therefore, using of the ANN models affecting the nonlinear structure in data may be helpful (Zhang et al., 2001).
Latterly, nonparametric regression methods have become a very useful tool
for nonlinear data such as time series (Ferreira et al., 2000). However,
these approaches do not perform well when trend and seasonality is present.
To overcome this problem, we considered two alternatives methods proposed in
study. In both approaches the trend is specified as nonparametric, but the
seasonal component specification is different. First, we take into account a
partial linear model where the parametric part is a dummyvariable specification
for the seasonality. Secondly, we consider the seasonal component to be a smooth
function of time and, the model falls within the class of additive models. The
nonparametric regression models are discussed in detail by Hastie and Tibshirani
(1999), Wahba (1990), Hardle (1991), Green and Silverman (1994) and Hardle et
al. (2004).
There are some forecasting techniques used to forecast data time series including trend and seasonality. We considered the ones which is more flexible than linear models. One of these techniques is nonparametric regression models analyzed by Ferreira et al. (2000). The other technique is ANN models used for time series forecasting. The purpose of this study is to compare some forecasting models and to examine the issue whether a more complex model work more effectively in forecasting a trend and seasonal time series.
MATERIALS AND METHODS
The monthly arrival data of tourist coming to Turkey has been used for model estimation and evaluation. The artificial neural networks and nonparametric regression called as semiparametric and additive regression models are discussed follows, respectively.
ARTIFICIAL NEURAL NETWORK (ANN) APPROACH
MultiLayer Perceptrons (MLPs)
The MLP models used in this study are the standard threelayer MLP called
as the feedforward neural networks models with only one output node and with
nodes in adjacent layers fully connected. The activation function for hidden
nodes is the logistic function f_{h}(x) = 1/(1+exp(–x)) and for
the output node, the linear function. Bias (or intercept) terms are used in
both hidden and output layers. MLP model can be expressed by:
where, y_{k} represents the kth output value,
denote the weights for the connections between the ith input and jth hidden
units,
denotes the weights for the connections between the jth hidden and kth output
units,
denote the bias for jth hidden unit,
denote the bias for kth output unit, f_{h}(.) is activation function
applied to the hidden units and f_{o}(.) is the activation function
applied to the output units.
BackPropagation (BP) is the widespread approximation for training of the multilayer feedforward neural networks based on WidrowHoff training rule (Bishop, 1995; Haykin, 1999). The main idea here is to adjust the weights and the biases that minimize the sum of square error by propagation the error back at each step, namely:
over the first of time series, called the training set in neural networks. To minimize the sum of square error, different BP algorithms are constructed by applying different numeric optimization algorithms among gradient and Newton methods class. Conjugate Gradient (CG) algorithms provided by the Statistica Neural Network (SNN) is also employed in training of MLP networks. Another algorithm of CG algorithms is Scaled Conjugate Gradients (SCG) algorithm (Moller, 1993). The basic idea of SCG is to combine the model trust region approach with the CG approach (Bishop, 1995; Nocedal and Wright, 1999).
Radial Basis Function Networks (RBF)
The RBF is composed of three layers: an input layer, a hidden layer and
an output layer. The hidden layer of an RBF is nonlinear, whereas the output
layer is linear. In the RBF, one hidden layer with required number of units
is enough in order to model a function. The activations of hidden (radial) units
are defined depending on the distance of the input vector and the center vector.
Typically, the radial layer has exponential activation functions and the output
layer a linear activation function. Appropriate y output vector for the x input
vector is calculated as follows for n input, m output and p radial units for
the RBF:
where, w_{kj}, j = 1, 2,..,p are the appropriate weights for kth output unit, φ_{j}(x), j = 1, 2,..,p, is the basis function of jth radial unit, w_{ko}, k = 1, 2,...,m are the appropriate deviations for kth output unit, φ_{0} is an extra basis function with activation value fixed at φ_{0} = 1. Usually, more attention paid for the following Gaussian basis function:
where, μ_{j} = (μ_{j1}, μ_{j2}...,μ_{jn}) vector is center for φ_{j}(x) and σ_{j} is deviation (or width) parameters of that function. The basis function of the unit is defined using those two parameters. Equation 2 can be written in matrix notation as:
where, W = (w_{kj}) and φ = (φ_{j}). As can be seen from Eq. 2, the linear activation function is used in the RBF for output layer. Education is made in three stages in the RBF. In the first stage, by unsupervised education, radial basis function centers (in other words μ_{j}) are optimized using all {x^{(i)}}, i = 1, 1,...,N, education data. Centers can be assigned by a number of algorithms: Subsampling, Kmeans, Kohonen training, or Learned Vector Quantization. In the second stage σ_{j}, j = 1, 2,...,p, parameters can be assigned by algorithms explicit, isotropic and Knearest neighbor. In the third stage of education, the basis functions that are obtained for adjusting the appropriate weights for output units are taken as fixed and deviation parameters are added to linear sum. Optimum weights are obtained by minimization of the sum of square errors:
In Eq. 14,
is the target value for output unit k when the network is presented with input
vector x^{(i)}, i = 1, 2,...,N. Since, the equality in Eq.
4 is the quadratic function of the weights, optimum weights can be found
as the solution of the linear equations system. The output layer is usually
optimized using the pseudoinverse technique.
The MLP with a defined architecture, is given by the appropriate weights and the biases of the units, but in the RBF, it is given by the center and the deviation of the radial units and by the weights and biases of the output units. As the point is given by n coordinates in n dimensional space, the number of the coordinates are equal to the linear input units n. Hence, in SNN software, the coordinates of the center radial unit are taken as weights and the deviation of the radial unit is taken as bias. As a result, radial weights denotes the center point, radial bias denotes the deviation.
THE ANN APPROACH TO TIME SERIES MODELING
The feedforward ANN method is usually used for time series modeling and forecasting (Zhang et al., 1998). For one hidden layer network architecture n:p:1 (n, number of inputs, p, number of hidden units and 1, number of outputs), inputs are the observed values of nth previous time points and outputs (targets) are (n+1)th observed value. When the network square error function is examined, it can be seen that ANN are a nonlinear functions of previous observations (y_{t–1}, y_{t–2},...,y_{t–n}) to future observations y_{t} (Zhang, 2003):
where, (y_{t–1}, y_{t–2},...,y_{t–n})
denote input values, y_{t} denotes target (or output) value, w denote
the weights of the network, ε_{t} denote the vector of network
error at time point t. The predicted
is calculated as follows:
If N number of y_{1}, y_{2},...,y_{N} observations are used for a time series and 1step forward forecast is made, the number of training samples are N–n. (y_{1}, y_{2},...,y_{n}) is taken as first input training sample and y_{n+1} is accepted as the target. The second training pattern will contain y_{2}, y_{3},...,y_{n+1} as inputs and y_{n+2} as the second target output. Finally, (y_{N–n}, y_{N–n+1},...,y_{N–1}) and y_{N} will be the last inputs pattern and target correspondingly. In training procedure, with the help of different BP algorithms, the parameters (weights and biases) of the network is obtained by getting closer to the minimum value of Sum of Square Error (SSE):
THE NONPARAMETRIC APPROACH IN TIMES SERIES PREDICTION
The following basic model form has been considered as:
where, t_{i}’s are spaced in [a, b], s(t_{i}) denotes
the seasonal component, z(t_{i}) represents the trend and e(t_{i})
indicates the terms of error with zero mean and common variance .
The model 6 can be also written as:
It is assumed that the following model structure for the trend:
where, f is a smooth function in [a, b] and εi’s are assumed to be
with zero mean and common variance ,
which is different from e_{i}’s.
The main idea is to estimate the functions f and s. The function f is estimated as a smooth function, but the estimation of the function s is different due to seasonality (Ferriera et al., 2000). Therefore, it is considered two alternative models for the estimation of the s. Firstly, it is treated a semiparametric model where parametric component is dummy variable for the seasonality. Secondly, it is discussed the seasonal component to be a smooth function of time and use a nonparametric method.
SemiParametric Regression Model
It is assumed that the seasonality is built as follows:
where, r is the number of annual observations (r = 12) and v_{i}’s
are assumed to be with zero mean and common variance
and different from the errors in Eq. 7 and 8.
's
are dummy variable that denotes the seasonal effects and β_{k}’s
are parametric coefficients. Dummy variables are denoted by
(where, D_{ki} = 1 if i observation correspond to the kth month of year
and D_{ki} = 0 otherwise) for cancels the seasonal effects when a year
is completed (Ferriera et al., 2000). By substitution Eq.
9 and 8 in Eq. 7, it is obtained as:
where, u_{i}’s are the sum of the random errors with zero means
and constant variance .
Model 10 is called as a semiparametric model due to consist of a parametric linear component and only a nonparametric component. The main purpose is to estimate the parameter vector β and function f at sample points t_{1},...,t_{n}. For this aim, two estimation methods, called as smoothing spline and regression spline, have been considered.
Estimation with Smoothing Spline Method (SSM)
Estimation of the parameters of interest in Eq. 10 can
be performed using smoothing spline. Mentioned here the vector parameter β
and the values of function f at sample points t_{1},...,t_{n}
are estimated by minimizing the penalized residual sum of squares:
where, f ∈ C^{2}[0, 1] and d_{i} is the ith row of the
matrix D. When the β = 0, resulting estimator has the form ,
where, S_{λ} a known positivedefinite (symmetric) smoother matrix
that depends on λ and the knots t_{1},...,t_{n} (Wahba,
1990; Eubank, 1999).
For a prespecified value of λ the corresponding estimators for f based
on Eq. 10 can be obtained as follows (Wahba, 1990). Given
a smoother matrix S_{λ}, depending on a smoothing parameter λ,
construct .
Then, by using penalized least squares, mentioned here estimator are given by:
Evaluate some criterion function (such as cross validation, generalized cross validation) and iterate changing λ until it is minimized.
Estimation with Regression Spline Method (RSM)
Smoothing spline become less practical when sample size n is large, because
it uses n knots. Regression spline is a more general approach to spline fitting.
Smoothing spline require many parameters to be estimated, typically at least
as many parameter as observations. A regression spline is a piecewise polynomial
function whose highest order nonzero derivative takes jumps at fixed knots.
Usually, regression splines are smoothed by deleting nonessential knots. When
the knots are selected, regression spline can be fitted by ordinary least squares.
For further discussion on selection of knots, see study of Ruppert and Carrol
(2000).
f(t_{i}) in equality (Eq. 10) is approximated by:
where, p≥1 is an integer (order of the regression spline and usually chosen
a priori), b1,...,b_{K} are independently and identically distributed
(i.i.d) with ,
(t)_{+} = t if t>0 and 0 otherwise and κ_{1}<...,κ_{k}
are fixed knots (min(t_{i}) < κ_{1},....,< κ_{k}<
max(t_{i})).
Using vector and matrix notation model 10 can be expressed as:
Where:
and
where, b = (b_{1},...,b_{K})^{T} is the vector of coefficients
and η = (η_{1},...,η_{n})^{T} is a vector
of the random error. Predicted value of
in Eq. 10 is given by:
Regression spline estimators
of (β, f) are defined as the minimizer of:
where, λ>0 is a smoothing parameter such as in Eq. 11. As λ→∞, the regression spline converges to a pth degree polynomial fit. As λ→0}, the regression spline converges to the Ordinary Least Squares (OLS) fitted spline. For a prespecified value of λ the corresponding estimators for β and f based on Eq. 15 can be obtained as follows (Ruppert et al., 2003):
where, .
The smoothing parameter λ and the number of knots K must be selected in implementing the regression spline. However, λ plays a more essential role. See for a detailed discussion of the knot selection (Ruppert, 2002). The solution can be obtained in SPlus.
Additive Regression Model (ARM)
The semiparametric model has been used for estimation of the parameters
in Eq. 10. However, there are situations in which a dummy
variable specification does not capture all fluctuations because of existing
any seasonal effect. Therefore, here, a more general case for seasonal component
has been considered as follows:
where, g ∈ C^{2}[a, b], v_{i}'s denote the terms of random
error with zero mean and common variance .
By substitution of the Eq. 8 and 20 in
Eq. 7, y_{i} is obtained as:
where, u_{i}'s are the terms of random error with zero mean and constant
variance .
The model presented in Eq. 21 has a fully nonparametric
model because parametric component is missing. These models are called as additive
nonparametric regression models. In order to estimate the model in Eq.
21, the criterion Eq. 11 and 21 can
be generalized in an obvious way. Estimator of the model 21 is based on minimum
of the penalized residual sum of squares (Hastie and Tibsirhani, 1999):
where, the first term in Eq. 22 denotes the Residual Sum of the Squares (RSS) penalizing the lack of fit, the second term multiplicand by λ_{1} denotes the roughness penalty for the f and the third term multiplicand by λ_{2} denotes the roughness penalty for g Firstly, Eq. 22 can be written as:
where, K_{f} is a penalty matrix for f and K_{g} is a penalty matrix for g. Then, by differentiating according to f and g and afterwards, by setting to zero, the estimators of f and g are defined as:
MEASURING MODEL PERFORMANCE METHODS
In order to evaluate the predictions of a model with observations, the following statistical performance measures, which include the Mean Square Error (MSE), the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE) and the Mean Absolute Percentage Error (MAPE) have been used (Carey and Rob, 2002). Forecast evaluation measurements are defined as following:
where, y_{t} is represent the observed values,
is indicate the forecasted values and Mean is the arithmetic mean value.
A perfect model would have MSE or RMSE, MAE and MAPE ≅ 0.00. Of course, because of the influence of random errors, there is no such thing as a perfect model in time series modeling.
EXPERIMENTAL EVALUATIONS
Here, a real data sets occurred in Turkey is discussed as experimental. Appropriate nonparametric regression and ANN models were chosen by doing experiments to forecast and these models are also compared with each other. To conduct these experiments, SNN, SPlus and RPrograms are used.
Data Set
The real data set is from Turkish Statistical Institute (TÜİK) analyzed
by Aslanargun et al. (2007). The data can be found in http://www.tuik.gov.tr
and represent the number of monthly tourist coming to Turkey between January
1984 and December 2003. The data set is divided into two parts for the use in
training and forecasting. In the first part, 216 monthly data are taken into
account for the period of the January 1984December 2001 period. These data
are used in training to construct the models. In the second part, with the help
of the models constructed in the first part, the performances of those models
are calculated using the 24 monthly data for the January 2002December 2003
period.
Selection of the Appropriate ANN Models
The 216 monthly data were used in training of the network. An evaluation
of the model was made depending on the forecasts of the 24 monthly data. The
best model have been selected by scores obtained from MSE, RMSE, MAE and MAPE.
As the initial weight and bias values of the network were random, experiments
with 150 replications were made for the same network structure and the models
giving the best forecasts were determined. Because of the monthly tourist arrival
data included seasonality, the number of input units was determined as 12. During
these experiments, various neural network algorithms with single layer, with
one or two hidden layers, MLP and the RBF models were applied on the data set.
As the initial 12 data were lost because of the seasonality, 204 from the 216
data were used to adjust the weights. In the training stage of the network,
data were divided into two parts: 132 of the 204 data were used for training
and 72 data were used for validation. This division was used to restrict memorization
of the network and provided for better forecasts (Bishop, 1995; Haykin, 1999).
Among the ANN models, the MLP (12:3:1) model have indicated the best performance. The CG algorithm indicated the best performance on the 51th epoch. A hyperbolic tangent function is applied in the hidden unit and the linear activation function is applied in the output unit. The weights and biases of the MLP (12:3:1) model are shown in Table 1.
Constructing the Appropriate NonParametric Models
The 216 monthly data set including January 1984December 200 period were used
in training the models called as semiparametric and additive regression model.
For estimation of these models, we need to select smoothing parameter λ.
In general, the λ can be selected by using automatic selection methods
such as Cross Validation (CV), Generalized Cross Validation (GCV). In practice,
it is reasonable to select the λ by specifying degrees of freedom (df =
trace (S_{λ})) for the nonparametric components (Hastie and Tibshirani,
1999). Therefore, the df is used to select the smoothing parameter λ in
smoothing spline. On the other hand, both the smoothing parameter λ and
the number of knots K must be selected in implementing the regression spline.
The solution is obtained by SPlus and R programs.
Table 1: 
The weights and biases of the MLP (12:3:1) 

The row and column header numeric terminology first lists
the layer, then the unit number within the layer. For example, 2.1 stands
for unit 1 in layer 2 

Fig. 1: 
Observed number of tourists and their estimation values obtained
by appropriate SSM, RSM and ARM 
Secondly, we consider an additive regression model where both the seasonal
and the trend components are unknown smooth functions of time. So, we have chosen
λ_{1} and λ_{2} parameters by specifying the df. Observed
number of tourist for the January 1984December 2001 period and the tourist
estimation results obtained by appropriate nonparametric models are shown in
Fig. 1.
Figure 1 shows that the data have an upward trend together with seasonal variation. In these series, the main advantage of the nonparametric regression models that perform prediction by means of the models without data loss. In other words, as the initial 216 data are used to estimate the model, 216 residuals are obtained for each of nonparametric model. Consequently, 216 data was used to estimate nonparametric regression models called as smoothing spline, regression spline and additive regression models.
COMPARISONS OF THE MODELS
An evaluation of the nonparametric regression models obtained by using SPlus and R programs was made depending on the forecasts for the January 2002December 2003 period. Namely, we calculated the performance values of the nonparametric regression and ANN models for the mentioned period. Table 2 shows for each model, the values of the MSE, RMSE, MAE and MAPE called as performance indicators. The scores of the MSE, RMSE, MAE and MAPE belong to MLP model are lower than the ones of the concurrent specifications. In that sense, the MLP model outperforms the other formulations. Furthermore, as shown Table 2, the SSM model has indicated better performance than the RSM and ARM since SSM has had the lowest scores of the model evaluation criteria among the nonparametric models.
For test data composed of the 24 values, the observed and forecasted values obtained by different models are calculated, but they are only given as graphically since they would occupy very much place. Observed and their forecasted values by the models are shown in Fig. 2.
As shown in Fig. 2, the output from the MLP model shows that
predicted and observed values very close to each other. However, the output
from the RBF model presents that there is big difference between actual and
predicted observations.
Table 2: 
Performance values for the selected models 

*The model having best performance 

Fig. 2: 
Observed and their forecasted values by the models 
The outputs shown in Fig. 2 support the ideas advanced in
the Table 2.
RESULTS AND DISCUSSION
It is proved that the ANN model gives a comparable performance to the ARIMA model for longer time horizons (Zhang, 2003). There have a great deal of studies comparing various neural network and some nonparametric methods for time series. For example, SallurRuiz et al. (2008) indicated that MLP model outperformed than other methods. Different ANN and nonparametric methods have been applied to time series data set in the study. However, most of them are designed to predict the time series. In this study, these methods mentioned here are considered as a comparative on aspect of forecasting performance.
It is known that neural networks very good performance in time series forecasting problems. As can be seen from Table 2, the MLP (12:3:1) have performed very good performance, while the RBF have not performed well enough. This case supports the idea that the RBF is usually unsuccessful in extrapolation problems (Bishop, 1995). On the other hand, as shown Fig. 2, the values forecasted by ARIMA, SSM, RSM, ARM and MLP (12:3:1) are closely following the real observed values, whereas the values forecasted by the RBF is not proper the observed series. It is also seen that RBF is not good predictor for such time series.
Nonparametric regression models can be considered as alternative method to ANN models. The SSM indicated a good empirical performance among the nonparametric models, while another nonparametric model called as ARM showed the worst performance. As a result, our opinion is that MLP can be useful in time series forecasting problems included seasonality and trend. We propose to use the MLP, especially on these type series. Furthermore, the SSM model can be used as an alternative method to the MLP.