
Research Article


Performance of Multiple Linear Regression Model for Longterm PM_{10} Concentration Prediction Based on Gaseous and Meteorological Parameters 

Ahmad Zia UlSaufie,
Ahmad Shukri Yahaya,
NorAzam Ramli
and
Hazrul Abdul Hamid



ABSTRACT

The aim of this study was to investigate performance of Multiple Linear Regression (MLR) method in predicting future (next day, next 2 days and next 3 days) PM_{10} concentration levels in Seberang Perai, Malaysia. The developed model was compared to multiple linear regression models. The model used gaseous (NO_{2}, SO_{2}, CO), PM_{10} and meteorological parameters (temperature, relative humidity and wind speed) as predictors. Performance indicators such as Prediction Accuracy (PA), Coefficient of Determination (R^{2}), Index of Agreement (IA), Normalized Absolute Error (NAE) and Root Mean Square Error (RMSE) were used to measure the accuracy of the models. Performance indicator shows next day (RMSE = 11.211, NAE = 0.124, PA = 0.927, IA = 0.960, R^{2} = 0.858,) and next 2day (RMSE = 14.652, NAE = 0.155, PA = 0.881, IA = 0.925, R^{2} = 0.775) and next 3day (RMSE = 15.611, NAE = 0.167, PA = 0.849, IA = 0.912, R^{2} = 0.720). Assessment of model performance indicated that multiple linear regression method can be used for long term PM_{10} concentration prediction with next day for next day.





Received: April 07, 2012;
Accepted: June 29, 2012;
Published: August 01, 2012


INTRODUCTION
Particulate Matter (PM) is one of the air pollutants and the most important
in terms of adverse effects on human health. In Malaysia, there are three major
sources of air pollution, namely mobile sources, stationary sources and open
burning sources (Afroz et al., 2003). Several
studies about the impacts of PM on human health were published (AlvimFerraz
et al., 2005; Brunekreef and Holgate, 2002;
Hoek et al., 2002; Kappos
et al., 2004). PM_{10 }concentration is more preferable than
SPM for determining air pollution in Malaysia because Air Pollution Index (API)
is obtained from the measurement of fine particles which is below 10 μm
aero dynamic diameter of particles. Department of Environment, Malaysia established
Malaysia Ambient Air Quality Guidelines in 2002 stating daily PM_{10}
limit value is 150 μg m^{3}, while annual PM_{10} value
should not exceed 50 μg m^{3 }(Department of
Environment Malaysia, 2002). When the PM_{10 }concentration level
exceed the limit values stated in air quality guidelines, short term and chronic
human health problems may occur. Statistical modeling could offer good insights
in predicting future air pollution levels (next day, next 2 days and next 3
days).
Multiple linear regression is easy for implementation and calculation. Many
researchers used this method as forecasting tool in multiple disciplines. Chaloulakou
et al. (2003) used this method to investigate the complex relationships
between meteorological and time period parameters and forecast future PM_{10}
concentrations. In Athens, Grivas and Chololokau (2006)
used this method to predict hourly PM_{10} concentrations 24 h in advance
and the result showed that multiple regression models can be used to predict
PM_{10} 24 h in advance. In Malaysia, Ghazali (2006)
used MLR for PM_{10} concentration level prediction and UlSaufie
et al. (2011) compared MLR with feed forward back propagation for
PM_{10} concentration prediction. However, both models cannot be used
for future prediction.
The aim of this study was to investigate the performance of multiple linear
regression method in predicting future (next day, next 2 days and next 3 days)
PM_{10} concentration levels in Seberang Perai, Malaysia. Besides, this
study also compared performance between meteorological parameters with gases
and meteorological parameter without gases as inputs. This model is useful because
it facilitates respective authorities to carry out suitable actions to reduce
the impact of air pollution.
MATERIALS AND METHODS Site description: Seberang Perai, Pulau Pinang monitoring site is located at Taman Inderawasih (05°23.4704'N, 100°23.1977'E), at the north part of Peninsular Malaysia. This site is just a few kilometers from industrial area and surrounded by busy roads. Annual hourly observations for PM_{10} in Seberang Prai, Pulau Pinang, Malaysia from January 2004 to December 2007 were selected for PM_{10} concentration level prediction. The hourly observations were transformed into daily data by taking the average PM_{10} concentration level for each day. The chosen variables such as Relative Humidity (RH), Wind Speed (WS), nitrogen dioxide (NO_{2}), Temperature (T), carbon monoxide (CO), sulphur dioxide (SO_{2}) and previous day PM_{10} were selected to study their influences on PM_{10} concentration. The wind over country generally variable and light. Wind flow patterns can be described by four seasons namely north east monsoon known as wet seasons (November to March), transitional period (April to May), Southwest monsoon, knows as dry seasons (June to September) and another transitional period (October to November). Average values for the chosen variables were 6.5 m sec^{1} (WS), 28°C (T), 75.35% (RH), 0.0061 ppm (SO_{2}), 0.01334 ppm (NO_{2}), 0.4963 ppm (CO) and 67.24 μg m^{3} (PM_{10}). Multiple linear regression: Multiple linear regression is one of the modeling techniques used to investigate the relationship between a dependent variable and several independent variables. In multiple linear regression model, the error term denoted by ε is assumed to be normally distributed with mean 0 and constant variance σ. ε is also assumed to be uncorrelated.
We assume that the multiple linear regression model have k independent variables
and there are n observations. Thus the regression model can be written as (KovacAndric
et al., 2009):
where, b_{i} are the regression coefficients, x_{i} are independent variables and ε is error associated with the regression. To estimate the value of the parameters, the least squares method was used. VIF or variance inflation will be used for study effect of multicollinearity on the variance of estimated regression coefficients. The VIF is given by: where, VIF_{i} is the variance inflation factor associated with the ith predictor and R^{2}_{i} is the multiple coefficient of determination in a regression of the ith predictor on all other predictors. The DurbinWatson (DW) statistic tests for autocorrelation of residuals. This test important to check that model assumptions is satisfied. The DW statistic is given by: where, n n is number of observations, ê_{i} = y_{i}ŷ_{i} (y_{i} = observed values and ŷ_{i} is predicted values. d is DurbinWatson statistics and always between 04. A value d = 2 indicates no autocorrelation in the data, if values toward 0 indicates positive auto correlation and values approaching 4 indicates negative autocorrelation. Performance indicators: Performance indicators were used to evaluate the goodness of fit for the MLR for future PM_{10} concentration prediction in Seberang Prai, Pulau Pinang. Performance indicators used to determine the best method in predicting PM_{10} concentration are NAE, RMSE, IA, PA and coefficient of determination (R^{2}) (Table 1).
RESULTS AND DISCUSSION
Multiple linear regression models were developed with 1428 (next day), 1427 (next 2 days) and 1426 (next 3 days) sets of data (average daily data from January 2004 to December 2007) using SPSS version 19.0. These years were selected due to limitation to access the data. Table 2 showed the summary model for PM_{10} concentration predictions based on gases and meteorological parameters. The result showed that all three future PM_{10} concentration prediction models showed no problems with multicollinearity as the value for Variance Inflation Factor (VIF) was lower than 10. Durbin Watson statistic showed that the summary model did not have any autocorrelation problem for next day (DW = 2.117), next 2 days (DW = 1.160) and next 3 days (DW = 1.043). Table 3 also showed the summary model predicting PM_{10 }concentration based on meteorological parameters without gases. The result showed that the model did not imply multicollinearity (VIF = 1.2571.870) and autocorrelation problem (DW = 0.9002.152) with R^{2} greater than 0.6.
PM_{10} level decreased during strong wind events because the strong
wind dispersed the PM_{10} away. Negative correlation between temperature
and PM_{10} was due to no significant temperature fluctuation in Malaysia
(2432°C). Similar results were found by Yusof et
al. (2008). SO_{2} had positive correlation with PM_{10}
because most SO_{2} in the area came from petrol fueled vehicle motor
emissions. Besides that, SO_{2} also came from industrial activities
processing materials containing sulfur. For NO_{2} and CO, the main
sources for these two gases are diesel fueled vehicle emission. Our findings
reflected negative correlation between these two gases and PM_{10} because
there was less diesel fueled vehicle emission in this area.
Analysis of Variance (ANOVA) was conducted to test whether the models were significantly better at predicting the outcomes than using a mean. Table 4 showed the result for ANOVA (gases and meteorological parameters as input). The results indicated that observed values of F were 1243.152 (next day), 701.940 (next 2 days) and 503.147 (next 3 days) where the critical values F_{0.05, 7, 1420}, F_{0.05, 7, 1419} and F_{0.05, 7, 1419} were less than 2.103. From this result, all regression models were useful as predictors because the observed F ratios were four or five times greater than the critical values of F. Besides, it also indicated that the model significantly improved our capability to predict PM_{10} concentration. Similar conclusion was found in respect of applying meteorological parameters as inputs as shown in Table 5.
One of the assumptions for MLR was residuals (or errors) were normally distributed
with zero mean and constant variances.
Table 2: 
Model summary of PM_{10} based on meteorological parameters
with gaseous 

Table 3: 
Model summary of PM_{10} based on meteorological parameters
without gaseous 

Table 4: 
Result for ANOVA, gaseous and meteorological parameters as
input 

Table 5: 
Result for ANOVA, meteorological parameters as input 

Residual analysis was very important in determining the adequacy of the statistical
model. If the error showed any pattern, the model was considered as not taking
care of all the systematic information. Figure 1 and 2
showed that the residuals were normally distributed with zero mean for the models.

Fig. 1(ac): 
Meteorological parameters based standardized residual analysis
of PM_{10} for, (a) Next day, (b) Next 2 days and (c) Next 3 days 
Figure 3 and 4 depicted that residuals
were uncorrelated with constant variances as the residuals were contained in
a horizontal band and hence obviously there were no defects in the models.
Comparison of performance: Performance indicators were used to compare
performance for future prediction of PM_{10} concentration in Seberang
Perai, Pulau Pinang. Table 6 showed the values for performance
indicators. Accuracies measured were prediction accuracy, coefficient of determination
and index of agreement, while the errors measured were normalized absolute error
and root mean square error. The performance indicators reflected greater accuracy
in next day PM_{10 }concentration prediction compared to the next 2day
and next 3day predictions.
Table 6: 
Performance indicator for future PM_{10} concentration
prediction. 

1: Based on meteorological parameters (WS, T, RH) and PM_{10},
2: Based on meteorological parameters (WS, T, RH), gaseous (CO, NO_{2},
SO_{2}) and PM_{10} 
However, the result showed that MLR could predict future PM_{10} concentration
until the next 3 days. Index of agreement with values greater than 0.9 indicated
that the predicted values were highly accurate until the next 3 days. Table
6 also showed the comparisons between different parameters as inputs.

Fig. 2(ac): 
Gaseous meteorological parameters based standardized residual
analysis of PM_{10} for, (a) Next day, (b) Next 2 days and (c) next
3 days 

Fig. 3(ac): 
Correlation of fitted values with residuals of PM_{10}
for, (a) Next day, (b) Next 2 days and (c) Next 3 days 

Fig. 4(ac): 
Correlation of fitted values with residuals of PM_{10}
for, (a) Next day, (b) Next 2 days and (c) Next 3 days 
Table 7: 
Comparison results with other researcher using multiple linear
regression 

The result showed meteorological parameters with gases as inputs performed
better than meteorological parameter without gases. However, all the models
could be utilized for PM_{10} concentration prediction as the values
for prediction accuracy were greater than 0.8.
Various researcher have obtained multiple linear regression for predicting PM_{10} concentration. The result show that Coefficient of Determinations (R^{2}) were between 0.530.91 and Index of Agreement (IA) is from 0.640.86. Our result show that is close agreement between these obtained by previous researchers. Table 7 show comparison results with other researchers. CONCLUSION
The result of fitting the best multiple linear regression models for PM_{10}
concentration prediction using predictors such as air pollutants (NO_{2},
SO_{2}, CO and PM_{10}) and meteorological parameters (T, RH
and wind speed). The result showed that using meteorological parameters with
gases as inputs worked better than meteorological parameters without gases.
The values of R^{2}, PA and IA would increase as more variables were
added to the model. Similar conclusions were found by Mendenhall
and Sincich (1995). Tree model predicting PM_{10} concentration
had been successfully developed for next day, next 2 days and next 3 days.
The quality and reliability of the developed models were evaluated via performance
indicators (NAE, RMSE, PA, IA and R^{2}). Assessment of model performance
indicated that multiple linear regression method could be used for long term
PM_{10} concentration predictions. The models could be easily implemented
for public health protection by providing early warnings to the respective population.
Besides, the models were useful in helping authorities to reduce air pollution
impact preventative measures in Seberang Perai, Malaysia.
ACKNOWLEDGMENT
This study was funded by Universiti Sains Malaysia under Grant 304/PAWAM/60311017.
Thank you to Universiti Sains Malaysia and Universiti Teknologi MARA for providing
financial support to carry out this study and also thanks to the Department
of Environment Malaysia for their support.

REFERENCES 
1: AlvimFerraz M.C., M.C. Pereira, J.M. Ferraz, A.M.C. Almeida e Mello and F.G. Martins, 2005. European directives for air quality: Analysis of the new limits in comparison with asthmatic symptoms in children living in the Oporto metropolitan area, Portugal. Hum. Ecol. Risk Assess. Int. J., 11: 607616. CrossRef 
2: Brunekreef, B. and S.T. Holgate, 2002. Air pollution and health. Lancet, 360: 12331242. CrossRef  PubMed  Direct Link 
3: Chaloulakou, A., G. Grivas and N. Spyrellis, 2003. Neural network and multiple regression models for PM_{10} prediction in Athens: A comparative assessment. J. Air Waste Manage. Assoc., 53: 11831190. CrossRef  Direct Link 
4: Department of Environment, Malaysia, 2002. Malaysia environmental quality report 2004. Department of Environment, Ministry of Sciences, Technology and the Environment, Malaysia, Kuala Lumpur, Malaysia.
5: Ghazali, N.A., 2006. A study to assess the effect of weather parameters in influencing the air quality in Malaysia. M.Sc. Thesis, Universiti Sains Malaysia, Malaysia.
6: Grivas, G. and A. Chaloulakou, 2006. Artificial neural network models for prediction of PM_{10} hourly concentrations, in the greater area of Athens, Greece. Atmos. Environ., 40: 12161229. CrossRef  Direct Link 
7: Hoek, G., B. Brunekreef, B. Goldbohm, P. Fischer and P.A. van der Brand, 2002. Association between mortality and indicators of trafficrelated air pollution in the Netherlands: A cohort study. Lancet, 360: 12031209. CrossRef  Direct Link 
8: Kappos, A.D., P. Bruckmann, P. Eikmann, N. Englert and U. Heinrich et al., 2004. Health effects of particles in ambient air. Int. J. Hygiene Environ. Health, 207: 399407. PubMed  Direct Link 
9: KovacAndric, E., J. Brana and V. Gvozdic, 2009. Impact of meteorological factors on ozone concentrations modelled by time series analysis and multivariate statistical methods. Ecol. Infom., 4: 117122. CrossRef 
10: Yusof, N.F.F.M., N.A. Ghazali, N.A. Ramli, A.S. Yahaya, N. Sansuddin and W. AlMadhoun, 2008. Correlation of Pm_{10} concentration and weather parameters in conjunction with haze event in Seberang Perai, Penang. Proceedings of the International Conference on Construction and Building Technology, June 1620, 2008, Kuala Lumpur, Malaysia, pp: 211220 Direct Link 
11: Mendenhall, W. and T.L. Sincich, 1995. Statistics for Engineering and the Sciences. 4th Edn., PrenticeHall Inc., New Jersey, USA., ISBN13: 9780023805813, Pages: 1008
12: UlSaufie, A.Z., A.S. Yahaya, N.A. Ramli and H. Abdul Hamid, 2011. Comparison between multiple linear regression and feed forward back propagation neural network models for predicting PM_{10} concentration level based on gaseous and meteorological parameters. Int. J. Sci. Technol., 1: 4249. Direct Link 
13: Papanastasiou, D.K., D. Melas and I. Kioutsioukis, 2007. Development and assessment of neural network and multiple regression models in order to predict PM_{10} levels in a mediumsized Mediterranean city. Water Air Soil Pollut., 182: 325334. CrossRef  Direct Link 
14: Sfetsos, A. and D. Vlachogiannis, 2010. A new methodology development for the regulatory forecasting of PM_{10}. Application in the Greater Athens Area, Greece. Atmospheric Environ., 44: 31593172. CrossRef  Direct Link 
15: Vlachogianni, A, P. Kassomenos, A. Karppinen, S. Karakitsios and J. Kukkonen, 2011. Evaluation of a multiple regression model for the forecasting of the concentrations of NO_{x} and PM_{10} in Athens and Helsinki. Sci. Total Environ., 409: 15591571. CrossRef  PubMed  Direct Link 
16: Afroz, R., M.N. Hassan and N.A. Ibrahim, 2003. Review of air pollution and health impacts in Malaysia. Environ. Res., 92: 7177. CrossRef 



