Subscribe Now Subscribe Today
Research Article
 

Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters



Ahmad Zia Ul-Saufie, Ahmad Shukri Yahaya, NorAzam Ramli and Hazrul Abdul Hamid
 
Facebook Twitter Digg Reddit Linkedin StumbleUpon E-mail
ABSTRACT

The aim of this study was to investigate performance of Multiple Linear Regression (MLR) method in predicting future (next day, next 2 days and next 3 days) PM10 concentration levels in Seberang Perai, Malaysia. The developed model was compared to multiple linear regression models. The model used gaseous (NO2, SO2, CO), PM10 and meteorological parameters (temperature, relative humidity and wind speed) as predictors. Performance indicators such as Prediction Accuracy (PA), Coefficient of Determination (R2), Index of Agreement (IA), Normalized Absolute Error (NAE) and Root Mean Square Error (RMSE) were used to measure the accuracy of the models. Performance indicator shows next day (RMSE = 11.211, NAE = 0.124, PA = 0.927, IA = 0.960, R2 = 0.858,) and next 2-day (RMSE = 14.652, NAE = 0.155, PA = 0.881, IA = 0.925, R2 = 0.775) and next 3-day (RMSE = 15.611, NAE = 0.167, PA = 0.849, IA = 0.912, R2 = 0.720). Assessment of model performance indicated that multiple linear regression method can be used for long term PM10 concentration prediction with next day for next day.

Services
Related Articles in ASCI
Search in Google Scholar
View Citation
Report Citation

 
  How to cite this article:

Ahmad Zia Ul-Saufie, Ahmad Shukri Yahaya, NorAzam Ramli and Hazrul Abdul Hamid, 2012. Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters. Journal of Applied Sciences, 12: 1488-1494.

DOI: 10.3923/jas.2012.1488.1494

URL: https://scialert.net/abstract/?doi=jas.2012.1488.1494
 
Received: April 07, 2012; Accepted: June 29, 2012; Published: August 01, 2012



INTRODUCTION

Particulate Matter (PM) is one of the air pollutants and the most important in terms of adverse effects on human health. In Malaysia, there are three major sources of air pollution, namely mobile sources, stationary sources and open burning sources (Afroz et al., 2003). Several studies about the impacts of PM on human health were published (Alvim-Ferraz et al., 2005; Brunekreef and Holgate, 2002; Hoek et al., 2002; Kappos et al., 2004). PM10 concentration is more preferable than SPM for determining air pollution in Malaysia because Air Pollution Index (API) is obtained from the measurement of fine particles which is below 10 μm aero dynamic diameter of particles. Department of Environment, Malaysia established Malaysia Ambient Air Quality Guidelines in 2002 stating daily PM10 limit value is 150 μg m-3, while annual PM10 value should not exceed 50 μg m-3 (Department of Environment Malaysia, 2002). When the PM10 concentration level exceed the limit values stated in air quality guidelines, short term and chronic human health problems may occur. Statistical modeling could offer good insights in predicting future air pollution levels (next day, next 2 days and next 3 days).

Multiple linear regression is easy for implementation and calculation. Many researchers used this method as forecasting tool in multiple disciplines. Chaloulakou et al. (2003) used this method to investigate the complex relationships between meteorological and time period parameters and forecast future PM10 concentrations. In Athens, Grivas and Chololokau (2006) used this method to predict hourly PM10 concentrations 24 h in advance and the result showed that multiple regression models can be used to predict PM10 24 h in advance. In Malaysia, Ghazali (2006) used MLR for PM10 concentration level prediction and Ul-Saufie et al. (2011) compared MLR with feed forward back propagation for PM10 concentration prediction. However, both models cannot be used for future prediction.

The aim of this study was to investigate the performance of multiple linear regression method in predicting future (next day, next 2 days and next 3 days) PM10 concentration levels in Seberang Perai, Malaysia. Besides, this study also compared performance between meteorological parameters with gases and meteorological parameter without gases as inputs. This model is useful because it facilitates respective authorities to carry out suitable actions to reduce the impact of air pollution.

MATERIALS AND METHODS

Site description: Seberang Perai, Pulau Pinang monitoring site is located at Taman Inderawasih (05°23.4704'N, 100°23.1977'E), at the north part of Peninsular Malaysia. This site is just a few kilometers from industrial area and surrounded by busy roads. Annual hourly observations for PM10 in Seberang Prai, Pulau Pinang, Malaysia from January 2004 to December 2007 were selected for PM10 concentration level prediction. The hourly observations were transformed into daily data by taking the average PM10 concentration level for each day. The chosen variables such as Relative Humidity (RH), Wind Speed (WS), nitrogen dioxide (NO2), Temperature (T), carbon monoxide (CO), sulphur dioxide (SO2) and previous day PM10 were selected to study their influences on PM10 concentration. The wind over country generally variable and light. Wind flow patterns can be described by four seasons namely north east monsoon known as wet seasons (November to March), transitional period (April to May), South-west monsoon, knows as dry seasons (June to September) and another transitional period (October to November). Average values for the chosen variables were 6.5 m sec-1 (WS), 28°C (T), 75.35% (RH), 0.0061 ppm (SO2), 0.01334 ppm (NO2), 0.4963 ppm (CO) and 67.24 μg m-3 (PM10).

Multiple linear regression: Multiple linear regression is one of the modeling techniques used to investigate the relationship between a dependent variable and several independent variables. In multiple linear regression model, the error term denoted by ε is assumed to be normally distributed with mean 0 and constant variance σ. ε is also assumed to be uncorrelated.

We assume that the multiple linear regression model have k independent variables and there are n observations. Thus the regression model can be written as (Kovac-Andric et al., 2009):

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
(1)

where, bi are the regression coefficients, xi are independent variables and ε is error associated with the regression. To estimate the value of the parameters, the least squares method was used.

VIF or variance inflation will be used for study effect of multicollinearity on the variance of estimated regression coefficients. The VIF is given by:

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
(2)

where, VIFi is the variance inflation factor associated with the ith predictor and R2i is the multiple coefficient of determination in a regression of the ith predictor on all other predictors.

The Durbin-Watson (DW) statistic tests for autocorrelation of residuals. This test important to check that model assumptions is satisfied. The DW statistic is given by:

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
(3)

where, n n is number of observations, êi = yii (yi = observed values and ŷi is predicted values. d is Durbin-Watson statistics and always between 0-4. A value d = 2 indicates no autocorrelation in the data, if values toward 0 indicates positive auto correlation and values approaching 4 indicates negative autocorrelation.

Performance indicators: Performance indicators were used to evaluate the goodness of fit for the MLR for future PM10 concentration prediction in Seberang Prai, Pulau Pinang. Performance indicators used to determine the best method in predicting PM10 concentration are NAE, RMSE, IA, PA and coefficient of determination (R2) (Table 1).

Table 1: Performance indicators (Ul-Saufie et al., 2011)
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters

RESULTS AND DISCUSSION

Multiple linear regression models were developed with 1428 (next day), 1427 (next 2 days) and 1426 (next 3 days) sets of data (average daily data from January 2004 to December 2007) using SPSS version 19.0. These years were selected due to limitation to access the data. Table 2 showed the summary model for PM10 concentration predictions based on gases and meteorological parameters. The result showed that all three future PM10 concentration prediction models showed no problems with multicollinearity as the value for Variance Inflation Factor (VIF) was lower than 10. Durbin Watson statistic showed that the summary model did not have any autocorrelation problem for next day (DW = 2.117), next 2 days (DW = 1.160) and next 3 days (DW = 1.043). Table 3 also showed the summary model predicting PM10 concentration based on meteorological parameters without gases. The result showed that the model did not imply multicollinearity (VIF = 1.257-1.870) and autocorrelation problem (DW = 0.900-2.152) with R2 greater than 0.6.

PM10 level decreased during strong wind events because the strong wind dispersed the PM10 away. Negative correlation between temperature and PM10 was due to no significant temperature fluctuation in Malaysia (24-32°C). Similar results were found by Yusof et al. (2008). SO2 had positive correlation with PM10 because most SO2 in the area came from petrol fueled vehicle motor emissions. Besides that, SO2 also came from industrial activities processing materials containing sulfur. For NO2 and CO, the main sources for these two gases are diesel fueled vehicle emission. Our findings reflected negative correlation between these two gases and PM10 because there was less diesel fueled vehicle emission in this area.

Analysis of Variance (ANOVA) was conducted to test whether the models were significantly better at predicting the outcomes than using a mean. Table 4 showed the result for ANOVA (gases and meteorological parameters as input). The results indicated that observed values of F were 1243.152 (next day), 701.940 (next 2 days) and 503.147 (next 3 days) where the critical values F0.05, 7, 1420, F0.05, 7, 1419 and F0.05, 7, 1419 were less than 2.103. From this result, all regression models were useful as predictors because the observed F ratios were four or five times greater than the critical values of F. Besides, it also indicated that the model significantly improved our capability to predict PM10 concentration. Similar conclusion was found in respect of applying meteorological parameters as inputs as shown in Table 5.

One of the assumptions for MLR was residuals (or errors) were normally distributed with zero mean and constant variances.

Table 2: Model summary of PM10 based on meteorological parameters with gaseous
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters

Table 3: Model summary of PM10 based on meteorological parameters without gaseous
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters

Table 4: Result for ANOVA, gaseous and meteorological parameters as input
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters

Table 5: Result for ANOVA, meteorological parameters as input
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters

Residual analysis was very important in determining the adequacy of the statistical model. If the error showed any pattern, the model was considered as not taking care of all the systematic information. Figure 1 and 2 showed that the residuals were normally distributed with zero mean for the models.

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
Fig. 1(a-c): Meteorological parameters based standardized residual analysis of PM10 for, (a) Next day, (b) Next 2 days and (c) Next 3 days

Figure 3 and 4 depicted that residuals were uncorrelated with constant variances as the residuals were contained in a horizontal band and hence obviously there were no defects in the models.

Comparison of performance: Performance indicators were used to compare performance for future prediction of PM10 concentration in Seberang Perai, Pulau Pinang. Table 6 showed the values for performance indicators. Accuracies measured were prediction accuracy, coefficient of determination and index of agreement, while the errors measured were normalized absolute error and root mean square error. The performance indicators reflected greater accuracy in next day PM10 concentration prediction compared to the next 2-day and next 3-day predictions.

Table 6: Performance indicator for future PM10 concentration prediction.
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
1: Based on meteorological parameters (WS, T, RH) and PM10, 2: Based on meteorological parameters (WS, T, RH), gaseous (CO, NO2, SO2) and PM10

However, the result showed that MLR could predict future PM10 concentration until the next 3 days. Index of agreement with values greater than 0.9 indicated that the predicted values were highly accurate until the next 3 days. Table 6 also showed the comparisons between different parameters as inputs.

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
Fig. 2(a-c): Gaseous meteorological parameters based standardized residual analysis of PM10 for, (a) Next day, (b) Next 2 days and (c) next 3 days

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
Fig. 3(a-c): Correlation of fitted values with residuals of PM10 for, (a) Next day, (b) Next 2 days and (c) Next 3 days

Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters
Fig. 4(a-c): Correlation of fitted values with residuals of PM10 for, (a) Next day, (b) Next 2 days and (c) Next 3 days

Table 7: Comparison results with other researcher using multiple linear regression
Image for - Performance of Multiple Linear Regression Model for Long-term PM10 Concentration Prediction Based on Gaseous and Meteorological Parameters

The result showed meteorological parameters with gases as inputs performed better than meteorological parameter without gases. However, all the models could be utilized for PM10 concentration prediction as the values for prediction accuracy were greater than 0.8.

Various researcher have obtained multiple linear regression for predicting PM10 concentration. The result show that Coefficient of Determinations (R2) were between 0.53-0.91 and Index of Agreement (IA) is from 0.64-0.86. Our result show that is close agreement between these obtained by previous researchers. Table 7 show comparison results with other researchers.

CONCLUSION

The result of fitting the best multiple linear regression models for PM10 concentration prediction using predictors such as air pollutants (NO2, SO2, CO and PM10) and meteorological parameters (T, RH and wind speed). The result showed that using meteorological parameters with gases as inputs worked better than meteorological parameters without gases. The values of R2, PA and IA would increase as more variables were added to the model. Similar conclusions were found by Mendenhall and Sincich (1995). Tree model predicting PM10 concentration had been successfully developed for next day, next 2 days and next 3 days.

The quality and reliability of the developed models were evaluated via performance indicators (NAE, RMSE, PA, IA and R2). Assessment of model performance indicated that multiple linear regression method could be used for long term PM10 concentration predictions. The models could be easily implemented for public health protection by providing early warnings to the respective population. Besides, the models were useful in helping authorities to reduce air pollution impact preventative measures in Seberang Perai, Malaysia.

ACKNOWLEDGMENT

This study was funded by Universiti Sains Malaysia under Grant 304/PAWAM/60311017. Thank you to Universiti Sains Malaysia and Universiti Teknologi MARA for providing financial support to carry out this study and also thanks to the Department of Environment Malaysia for their support.

REFERENCES

1:  Alvim-Ferraz M.C., M.C. Pereira, J.M. Ferraz, A.M.C. Almeida e Mello and F.G. Martins, 2005. European directives for air quality: Analysis of the new limits in comparison with asthmatic symptoms in children living in the Oporto metropolitan area, Portugal. Hum. Ecol. Risk Assess. Int. J., 11: 607-616.
CrossRef  |  

2:  Brunekreef, B. and S.T. Holgate, 2002. Air pollution and health. Lancet, 360: 1233-1242.
CrossRef  |  PubMed  |  Direct Link  |  

3:  Chaloulakou, A., G. Grivas and N. Spyrellis, 2003. Neural network and multiple regression models for PM10 prediction in Athens: A comparative assessment. J. Air Waste Manage. Assoc., 53: 1183-1190.
CrossRef  |  Direct Link  |  

4:  Department of Environment, Malaysia, 2002. Malaysia environmental quality report 2004. Department of Environment, Ministry of Sciences, Technology and the Environment, Malaysia, Kuala Lumpur, Malaysia.

5:  Ghazali, N.A., 2006. A study to assess the effect of weather parameters in influencing the air quality in Malaysia. M.Sc. Thesis, Universiti Sains Malaysia, Malaysia.

6:  Grivas, G. and A. Chaloulakou, 2006. Artificial neural network models for prediction of PM10 hourly concentrations, in the greater area of Athens, Greece. Atmos. Environ., 40: 1216-1229.
CrossRef  |  Direct Link  |  

7:  Hoek, G., B. Brunekreef, B. Goldbohm, P. Fischer and P.A. van der Brand, 2002. Association between mortality and indicators of traffic-related air pollution in the Netherlands: A cohort study. Lancet, 360: 1203-1209.
CrossRef  |  Direct Link  |  

8:  Kappos, A.D., P. Bruckmann, P. Eikmann, N. Englert and U. Heinrich et al., 2004. Health effects of particles in ambient air. Int. J. Hygiene Environ. Health, 207: 399-407.
PubMed  |  Direct Link  |  

9:  Kovac-Andric, E., J. Brana and V. Gvozdic, 2009. Impact of meteorological factors on ozone concentrations modelled by time series analysis and multivariate statistical methods. Ecol. Infom., 4: 117-122.
CrossRef  |  

10:  Yusof, N.F.F.M., N.A. Ghazali, N.A. Ramli, A.S. Yahaya, N. Sansuddin and W. Al-Madhoun, 2008. Correlation of Pm10 concentration and weather parameters in conjunction with haze event in Seberang Perai, Penang. Proceedings of the International Conference on Construction and Building Technology, June 16-20, 2008, Kuala Lumpur, Malaysia, pp: 211-220
Direct Link  |  

11:  Mendenhall, W. and T.L. Sincich, 1995. Statistics for Engineering and the Sciences. 4th Edn., Prentice-Hall Inc., New Jersey, USA., ISBN-13: 978-0023805813, Pages: 1008

12:  Ul-Saufie, A.Z., A.S. Yahaya, N.A. Ramli and H. Abdul Hamid, 2011. Comparison between multiple linear regression and feed forward back propagation neural network models for predicting PM10 concentration level based on gaseous and meteorological parameters. Int. J. Sci. Technol., 1: 42-49.
Direct Link  |  

13:  Papanastasiou, D.K., D. Melas and I. Kioutsioukis, 2007. Development and assessment of neural network and multiple regression models in order to predict PM10 levels in a medium-sized Mediterranean city. Water Air Soil Pollut., 182: 325-334.
CrossRef  |  Direct Link  |  

14:  Sfetsos, A. and D. Vlachogiannis, 2010. A new methodology development for the regulatory forecasting of PM10. Application in the Greater Athens Area, Greece. Atmospheric Environ., 44: 3159-3172.
CrossRef  |  Direct Link  |  

15:  Vlachogianni, A, P. Kassomenos, A. Karppinen, S. Karakitsios and J. Kukkonen, 2011. Evaluation of a multiple regression model for the forecasting of the concentrations of NOx and PM10 in Athens and Helsinki. Sci. Total Environ., 409: 1559-1571.
CrossRef  |  PubMed  |  Direct Link  |  

16:  Afroz, R., M.N. Hassan and N.A. Ibrahim, 2003. Review of air pollution and health impacts in Malaysia. Environ. Res., 92: 71-77.
CrossRef  |  

©  2022 Science Alert. All Rights Reserved