Abstract: Ground-level ozone (O3) is a secondary pollutant and has an adverse effect on human health, agriculture and ecosystems. The aim of this study is to develop model and to predict future O3 concentrations level in Shah Alam for next day (D+1), next two days (D+2) and next three days (D+3) using traditional method of Multiple Linear Regression (MLR) based on the concept of Ordinary Least Square estimate (OLS). This study uses daily average data of air pollutants (O3, NOx, NO, SO2, NO2, CO) and meteorological variables (WS, T, RH) that was selected from 2002 until 2013 as independent variables. The performance indicator of the models are measured by accuracy measures (Prediction accuracy, Index agreement and Coefficient of determination) and error measures (Root mean square error, Normalized absolute value). The average accuracy measures (AI, PA and R2) show that the prediction for D+1, D+2 and D+3 is 0.4492, 0.3797 and 0.304 respectively. Meanwhile, the average error measures (RMSE, NAE) show that the prediction for D+1, D+2 and D+3 is 0.1453, 0.1374 and 0.1302, respectively.
INTRODUCTION
Ground level ozone (O3) in urban areas has become a serious air pollution problem. Based on the air quality status for 2013, 31.51% of unclean air was recorded in Shah Alam. According to Department of Environmental Malaysia (DoE), Air Pollution Index (API) was dominated by O3 concentrations around afternoon until evening. The O3 is a secondary pollutant. It is not originated directly from the earths surface but it is formed by the chemical reaction under the influence of sunlight combining with nitrogen oxides (NOx) and Volatile Organic Compound (VOCs) (DoE., 2006). VOCs are often known as nonmethane hydrocarbons (NmHC) (Ghazali et al., 2010). On the other hand, O3 results in photochemical smog and as a trigger to human respiratory system problem (Sanna, 2009).
Various methods have been widely used from the previous study of O3. One of the popular methods in prediction of O3 concentrations level is Multiple Linear Regression (MLR). In statistical tools, analysis of regression is commonly used to analyze data. The MLR is a traditional method based on the concept of Ordinary Least Square estimate (OLS) (Ul-Saufie et al., 2012). MLR is used to present the relationship between dependent variable and independent variables (Chatterjee and Hadi, 2006).
Previous studies proved that MLR is a standard method and easy to be applied (Ul-Saufie et al., 2013). Future daily PM 10 concentrations prediction by combining regression models and feed forward back propagation models with Principle Component Analysis (PCA), in 2013. According to Barrero et al. (2006), the process involved in O3 formation could be easy to understand by using MLR. Prediction of O3 concentrations for several hours could be implemented by using MLR (Ramli et al., 2010). MLR is used to predict O3 concentrations and at the same time to be used to understand the increasing and decreasing patterns of O3 and NO2, respectively under the influenced of weather parameters (Ghazali et al., 2010).
The aim of this study is to present the result of the multiple linear regression in prediction of O3 concentrations level for the next day (D+1), the next two days (D+2) and the next three days (D+3) as the function of meteorological variables (WS, T, RH) and other pollutants concentration (NOx, NO, SO2, NO2, O3, CO).
MATERIALS AND METHODS
Study area: Monitoring station in Shah Alam is located at Taman Tun Dr. Ismail (TTDI) Jaya Primary School (N03°06.287, E101o 33.368) and nearby residential area. At the same time, this station is located at the main transportation area such as major road, highways and airport as well as surrounded by light industrial area (Azmi et al., 2010). Besides, Shah Alam city is located at the center of Petaling Jaya city (east) and Klang town (west) (Leh et al., 2014).
Monitoring record: The variables used in this study are ozone (O3, ppm), wind speed (WS, km h-1), ambient temperature (T, °C), relative humidity (RH, %), nitrogen oxide (NOx, ppm), nitric oxide (NO, ppm), sulphur dioxide (SO2, ppm), nitrogen dioxide (NO2, ppm) and carbon monoxide (CO, ppm). The primary data was managed by Alam Sekitar Malaysia Sendirian Berhad (ASMA) which is the private company under supervision of Department of Environmental Malaysia (DoE).
According to Ahamad et al. (2014), Ghazali et al. (2010) and Banan et al. (2013) measurements of air pollutants and meteorological variables were monitored by Teledyne Ozone Analyzer Model 400A UV Absorption (O3), Teledyne Model 200A (NOx, NO, NO2), Teledyne Model 100A (SO2), Teledyne Model 300 (CO), Met One 010C Sensor (WS), Met One 062 Sensor (T) and Met One 083D Sensor (RH). These monitoring instruments automatically record the air pollutant concentrations and meteorological variables hourly. The instruments and procedures of monitoring record is based on the method fixed by the United States Environmental Protection Agency (EPA) standard (Ghazali et al., 2010). Furthermore, the secondary data from 1st January 2002 until 31st December 2013 was obtained from Department of Environmental Malaysia (DoE).
In this study, the hourly concentrations for each variables were transformed into daily average concentrations. Eighty percent of monitoring records were randomly selected and twenty percent were used for validation of the models. The statistical software used in the data analysis are SPSS Version 20, MATLAB R2012a and Microsoft Excel 2013.
Variable selection: The variables selected in this study are based on previous study of O3 concentrations level (Table 1). The formation of O3 is a result from emission and combination of the other air pollutants through a chemical process. The main substances of O3 formation are VOCs and NOx.
Table 1: | Summarization of selected variables by previous researchers |
SO2: Sulphur dioxides, PM10: Particulate matter, WD: Wind direction, WV: Wind violation, SR: Solar radiation, NMHC: Nonmethane hydrocarbon |
According to Department of Environment, Malaysia, 2006, VOCs are emitted from factories chimney, motor vehicles, industrial activities, consumers and commercial products. Meanwhile, NOx are released by motor vehicles, power plants and combustions. Most of the previous studies stated that meteorological conditions also contribute to the formation of O3 concentrations.
Meteorological variable: Khiem et al. (2010) found that the low wind speed which associated with the other meteorological conditions has a high ability to contribute O3 concentrations. Urban area has very little difference of O3 concentrations level with rural area during high wind speeds (Husar and Renard, 1997). Temperature is also one of the main factors in the O3 production and formation. The concentrations level of O3 tends to increase at high temperature (Banja et al., 2012). Besides, relative humidity could be considered as a contributor to the O3. The lack of photochemical process efficiency due to the high relative humidity has always been associated with low level of O3 concentrations (Lelieveld and Crutzen, 1990).
Air pollutant variable: According to DoE. (2014), the sources of SO2 come from power plants (50%), industrial activities (9%), motor vehicles (7%) and others (34%). The contributors of NO2 are power plants (61%), motor vehicles (26%), industrial activities (6%) and others (7%). Meanwhile, the emissions of CO are detected from motor vehicles (95.3%), power plants (3.8%), industrial activities (0.4%) and others (0.5%). These situations increase the formation of O3 concentrations level in Malaysia.
Regression analysis: Regression analysis that was used in this study is Multiple Linear Regression (MLR) based on traditional approaches of Ordinary Least Square estimate (OLS). MLR is an extension from a simple linear regression. In MLR, there are one dependent variable (response variable) and several independent variables (explanatory variables/predictors). Chatterjee and Hadi (2006) defined the general equation of MLR as follows:
(1) |
y | = | Dependent variable (response variable) |
x | = | Independent variable (predictor) |
p | = | Represent values of the predictors for ith unit |
β0 | = | Regression constant |
βp | = | Regression coefficient |
Fig. 1: | Procedure for development of multiple linear regression model |
Table 2: | Ordinary least square assumption |
Source: Chatterjee and Hadi (2006), Residual = error = ε |
Montgomery et al. (2012), found that the method of least square is used to estimate parameters and MLR mostly was used as an empirical model. There are a few stages required to obtain MLR model (Fig. 1).
Stage 1, 80% of the data for each variables was randomly selected by MATLAB R2012a. Stage 2; the Variance Inflation Factor (VIF) was used to check the multicollinearity test. The model is considered to be free of multicollinearity problem if the value of VIF is less than 10 (Field, 2005). During Stage 3, check the OLS assumptions (Table 2). Then, the models are validated by performance indicator (RMSE, NAE, PA, IA, R2) using 20% complete monitoring record in stage 4. Finally in stage 5, the MLR model was obtained.
Performance indicator: Performance indicators are used to evaluate the performance models for next day (D+1), next two days (D+2) and next three days (D+3) predictions. The performance models (Table 3) are consists of accuracy measures (PA, IA and R2) and error measures (RMSE and NAE).
Table 3: | Performance indicators |
Source: Gervasi (2008), N: No. of sample daily measurement of a particular station, pi: Predicted value, Oi: Observed values, |
RESULTS AND DISCUSSION
Descriptive statistic is used to describe a situation (Bluman, 2009). Shah Alam is located at TTDI Jaya Primary School. The mean average of O3 concentration for Shah Alam is 0.032 ppm and the monitoring record is assumed to be moderately skewed with the value of 0.717. The maximum amount of O3 concentration recorded was 0.097 ppm (Table 4 and Fig. 2). This is due to open burning and smokes from vehicles (DoE., 2004). According to Department of Environment, Malaysia, 2011, the unhealthy days from year 2001 to 2012 in Klang Valley was mainly due to the high concentration level of O3. Shah Alam was recorded as having the highest number of unhealthy days except for year 2005, 2010, 2011 and 2012 (Fig. 3).
In order to investigate the correlation between O3, D+1 (for next day) and each independent variables, regression analysis was performed based on the value of correlation coefficient (R) and scatter plot. From the regression analysis for each variables (Table 5 and Fig. 4), the relationship for each variables with O3, D+1 are WS (R = -0.036), T (R = 0.155), RH (R = -0.212), NOx (0.056), NO (R = -0.018), SO2 (R = 0.121), NO2 (R = 0.136), O3 (R = 0.445) and CO (R = 0.177), where WS, RH and NO have a negative correlation with O3, D+1 and the rest of variables have a positive correlation. Table 6 shows that the correlation coefficient (R) among O3 and the other predictors from the previous studies are (Ghazali et al., 2010), R2 = 89.90% for Shah Alam and Gombak (Banan et al., 2014), for Putrajaya, NOx (R = 0.681), NO (R = -0.537), NO2 (R = -0.499), for Petaling Jaya, NOx (R = 0.515), NO (R = -0.678), NO2 (R = -0.102) and Jerantut, NOx (R = 0.416), NO (R = -0.557) NO2 (R = -0.079), Ramli, Ghazali et al. (2010), R2 for Shah Alam and Nilai are 89.7 and 89.0%, respectively. The previous studies from the world wide show (Wang et al., 2003), the value of R2 for Hong Kong is 76.16% (Agirre-Basurko et al., 2006), the value of R for NO2 is 0.88 and (Banja et al., 2012), T (R = 0.72), RH (R = -0.40) and R2 is 76%.
According to Field (2005) and Montgomery et al. (2012), the model is considered to have a problem with multicollinearity if the value of VIF is larger than 10. Since the value of VIF for variables NOx, NO and NO2 are larger than 10, thus the model of O3 for next day prediction (D+1) has multicollinearity problem (Table 7).
Fig. 2: | Box and whisker plot for O3 concentrations |
Fig. 3: | Number of unhealthy days in Klang Valley from year 2001 until 2012 (Source: DoE., 2014) |
Table 4: | Descriptive statistics of O3 in Shah Alam |
SD: Standard deviation |
Table 5: | Correlation coefficient between O3, D+1 and each variables |
Fig. 4(a-i): | Scatter plot of O3, D+1 versus independent variables |
Table 6: | Correlation (R) and coefficient of determination (R2) between O3 concentrations level and its predictor variables |
Table 7: | Multicollinearity test of O3, D+1 |
Table 8: | Multicollinearity test of O3, D+1 (Without NOx) |
This is due to the presence of NOx where NOx is a result of NO and NO2 (NOx = NO+NO2) (Ghazali et al., 2010). According to this problem, NOx should not be included in this study and after the variable of NOx was truncated, the range of VIF showed that the model was free from multicollinearity problem (Table 8).
Fig. 5: | Histogram and table of O3 residual: D+1 |
Fig. 6: | Scatter plot of residual versus fitted values |
Table 9: | Multiple linear regression and performance indicator |
The residual of O3 concentrations in Shah Alam for next day (D+1) shows that the graph of histogram has bell-shaped distributions which means that the residuals approximately normally distributed with zero mean of residual (Fig. 5). The assumption of the residual has a constant variance is satisfied when the scatter plot (Fig. 6) shows an equal spread and approach to regression line (homoscedasticity). Besides, the assumption of the residuals being uncorrelated with the independent variables is satisfied when the value of Durbin Watson is close to 2 (1.945).
The procedures from Table 8, Fig. 5 and 6 were repeated to obtain MLR model of Shah Alam for next two days (O3,D+2) and next three days (O3,D+3).
CONCLUSION
Table 9 shows the performance indicators for next day (D+1), next two days (D+2) and next three days (D+3) in Shah Alam that were obtained from the model of multiple linear regression. The average accuracy measures (AI, PA and R2) show that the prediction for D+1 is 0.4492 followed by D+2, 0.3797 and D+3, 0.304. Besides, the average error measures (RMSE, NAE) show that the prediction for D+1, D+2 and D+3 are 0.1453, 0.1374 and 0.1302, respectively. Due to the data limitation of VOCs and UVB, the value of PA, IA and R2 are not close to one but the model is still appropriate in prediction of O3 concentrations level since the value of RMSE and NAE is close to zero. This is supported by the previous study from Yousef et al., 2008, where the best linear regression model for the air pollutant of particulate matter (PM10) for dry season and wet season are 0.262 and 0.240, respectively. Therefore, these three models could be implemented for public health protection to provide early warnings to the respective populations.
ACKNOWLEDGMENT
This study was funded by Universiti Teknologi MARA, Malaysia under Grant 600-RMI/FRGS 5/3 (40/2014). Special appreciation to Department of Environmental Malaysia (DoE), Alam Sekitar Malaysia (ASMA) for providing the air quality data for this research and special thanks to Universiti Teknologi Mara, Malaysia.