HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2013 | Volume: 13 | Issue: 2 | Page No.: 294-300
DOI: 10.3923/jas.2013.294.300
Finding the Best Statistical Distribution Model in PM10 Concentration Modeling by using Lognormal Distribution
Hazrul Abdul Hamid, Ahmad Shukri Yahaya, Nor Azam Ramli and Ahmad Zia Ul-Saufie

Abstract: Air pollution is one of the most important issues that are often discussed, it is important to carry out the study on air pollution modeling. Air pollution models play an important role and very useful because it can help local authorities to carry out suitable action to reduce the impact of air pollution. Finding the best model would allow prediction to be made accurately. Statistical distribution modeling plays an important role in predicting air pollutant concentration. Lognormal distribution is one of the distributions that widely used in environmental engineering. One of the important steps in statistical distribution modeling is parameter estimation. There are several methods can be used to estimate the parameter in fitting distribution for air pollutant concentration data. This research compared the performance of parameter estimator for two-parameter and three-parameter lognormal distribution by using PM10 concentration in Nilai, Negeri Sembilan, Malaysia. Two methods were used to estimate the parameters in this study which is method of moments and method of probability weighted moments. Five performance indicators are used to determine the best estimator and the best distribution to represent the PM10 concentration in Nilai, Negeri Sembilan from 2003 to 2009. Results show that three-parameter lognormal distribution performs better compared to two-parameter lognormal distribution.

Fulltext PDF Fulltext HTML

How to cite this article
Hazrul Abdul Hamid, Ahmad Shukri Yahaya, Nor Azam Ramli and Ahmad Zia Ul-Saufie, 2013. Finding the Best Statistical Distribution Model in PM10 Concentration Modeling by using Lognormal Distribution. Journal of Applied Sciences, 13: 294-300.

Keywords: performance indicator, PM10, Lognormal distribution, prediction model and exceedences

INTRODUCTION

The impact of air pollution is noticeable, especially for human health. There are several health effects that correlate to air pollution such as asthma, chronic bronchitis and increasing respiratory symptoms, such as sinusitis, sore throat, dry and wet cough and hay fever (WHO, 1998). An issue of great concern has been the detrimental effect of low air quality onto human health, chronically or acutely. Since air pollution can be a major problem especially to human health, air quality monitoring should be done continuously.

There are many sources of air pollution such as mobile sources, stationary sources and open burning sources (Afroz et al., 2003). Mobile sources include personal vehicles, commercial vehicles and motorcycles. Stationary sources refer to factory and industry, power stations, industrial fuel burning processes and domestic fuel burning while open burning sources refer to burning of solid wastes and forest fires. In Malaysia, there are 52 monitoring locations throughout the country that belong to the Department of Environment (Department of Environment Malaysia, 2010). The parameters monitored include Particulate Matter (PM10), Sulphur Dioxide (SO2) and several airborne heavy metals. Three major sources of air pollution in Malaysia are mobile sources, stationary sources and open burning sources. However, for the past few years, emissions from mobile sources have been the major sources of air pollution which contribute around 70 to 75% of total air pollution in Malaysia (Afroz et al., 2003). Malaysia Ambient Air Quality Guidelines state that the 24 h mean for PM10 concentration should not exceed 150 μg m-3.

The probability density function of concentration in an atmospheric plume is an important quantity used to describe and discuss environmental diffusion (Yee and Chan, 1997). The concentrations of air pollutants are usually correlated with the emission levels and meteorological conditions. When the parent probability distribution of air pollutants is correctly chosen, the specific distribution can be used to predict the mean concentration and probability of exceeding a critical concentration (Lu and Fang, 2003). Selecting appropriate probability models for the data is an important step in environmental data analysis. These probability models may become the basis for estimating the parameters to meet the evolving information needs of environmental quality management. The developed models also can be easily implemented for public health protection by providing early warning to the respective population (Ul-Saufie et al., 2012).

Lognormal distribution have been used extensively in atmospheric sciences to describe phenomena that take on non-negative values, such as storm, daily and longer-period rain, snow and hail amounts, particle size distributions, pollutant concentrations, cloud dimensions, air velocity fluctuations, flood frequencies and radio wave amplitude fluctuations. In most cases these are empirically observed and tested fits and often other models such as the gamma distribution, are also fitted or at least cannot be excluded as being consistent with the data (Lopez, 1977).

In Malaysia, lognormal distribution is the best distribution to represent the PM10 concentration in the residential area (Sansuddin et al., 2011). Lognormal distribution was also used to fit the PM10 concentration in one of the industrial area in Malaysia which is Seberang Perai and result shows that the lognormal distribution agrees with the data in several years (Yusof et al., 2010). In environmental engineering, one of the great concerns is return period. In this study, return period is an estimate of the interval of time where the high particulate event occurs.

Particulate matter enters the body when we breathe. Large particles can be trapped in nose and throat and are removed when we cough or sneeze. In some areas, particulate matter can be very heavy because of high levels of industrial activity. This study will be concerned about coarse particles that are 10 micrometers in aerodynamic diameter (PM10) or smaller because those are the particles that generally pass through the throat and nose and enter the lungs. Once inhaled, these particles can affect the heart and lungs and cause serious health effects. This research compared the performance of parameter estimator for two-parameter and three-parameter lognormal distribution by using PM10 concentration in Nilai, Negeri Sembilan, Malaysia and hence find the best distribution to represent the PM10 concentration in Nilai, Negeri Sembilan.

MATERIALS AND METHODS

Study area: Nilai is a town located in Negeri Sembilan and can be classified as an industrial area. Geographically located at latitude 2°45'N of the equator and longitude 102°15'E of the prime meridian, Nilai is a rapidly growing town due to its proximity and easy connection to Kuala Lumpur. The data used in this study is hourly PM10 concentration in Nilai, Negeri Sembilan taken from the year 2003 to 2009. Missing values were replaced using mean top bottom method where the data were filled with the average of data available above and below the missing values (Yahaya et al., 2005).

Lognormal distribution: Equation 1 shows the probability density function for the two parameter lognormal distribution (Evans et al., 2000):

(1)

where, x≥0, λ represent a scale parameter and σ represent a location parameter for annual measurement of particular sites.

For three parameter lognormal distribution, probability density function given by Johnson et al. (1994) is as follows:

(2)

where, x>δ;-∞<σ<∞; λ>0. λ represent the scale parameter, σ represent the location parameter and δ represent the threshold parameter.

Parameter estimation: The lognormal distribution model was used to fit the hourly PM10 concentration observed in Nilai. There are several methods can be used to estimate the parameter of the lognormal distribution. This study only focuses on the method of moment and method of probability weighed moment to estimate the parameter. To estimate the parameters λ and σ of the lognormal distribution by using the method of moments, the formula given by Evans et al. (2000) is as follows:

(3)

(4)

The following equations are given by Hosking (1990) where erf(x) is the error function:

(5)

(6)

Where:

(7)

and F(.) is the normal distribution function which can be evaluated using the following equations:

(8)

Equation 5 and 6 can be solved for μ and σ by replacing λ1 and λ2 by the sample L-moments l1 and l2 to give the following equations:

(9)

(10)

For the three-parameter lognormal distribution, the first two moments of the lognormal distribution are given by Kite (1977) as follows:

(11)

(12)

the coefficient of variation of (x-α), z2, is also given by Kite (1977) as below:

(13)

where, w is defined by:

(14)

where, γ1 in Eq. 12 is the coefficient of skewness of the original variable x. The values of the parameters after replacing μ'1, μ2 and γ1 by their sample estimates m'1, m2, and g1 are given by:

(15)

(16)

(17)

An approximate solution for parameter estimates by using probability weighted moment given by Hosking (1990) is as follows:

(18)

(19)

(20)

Where:

(21)

(22)

(23)

(24)

(25)

λr is the r-th L-moment of the random variable X, l1 and l2 are the first and second sample L-moment.

Performance indicator: Five performance indicators are used to determine the best estimator. The root mean square error (RMSE) summarizes the difference between the observed and imputed concentrations and is used to provide the average error (Junninen et al., 2004). It is defined as:

(26)

where, N is the number of imputations, Oi is the observed data point and Pi is the imputed data point. The root mean square error method is the most common indicator. For a good estimator, the RMSE value must approach zero. Therefore, a smaller RMSE value means that the model is more appropriate (Junninen et al., 2004).

The Normalized Absolute Error (NAE) is a more sensitive measure of residual error than RMSE (Junninen et al., 2004). It is defined as:

(27)

where, N is the number of imputations, Oi is the observed data point, Pi is the imputed data point. A small value for the normalized absolute error means that the model is appropriate (Junninen et al., 2004).

Index of agreement is define as follow (Junninen et al., 2004):

(28)

where, N is the number of imputations, Oi is the observed data point, Pi is the imputed data point. For a good estimator, the RMSE value must approach the value one (Junninen et al., 2004).

The coefficient of determination is define as follow (Junninen et al., 2004):

(29)

where, N is the number of imputations, Oi is the observed data point, Pi is the imputed data point. For a good estimator, the RMSE value must approach the value one (Junninen et al., 2004).

The prediction accuracy:

(30)

where, N is the number of imputations, Oi is the observed data point, Pi is the imputed data point. For a good estimator, the RMSE value must approach the value one (Junninen et al., 2004).

RESULTS AND DISCUSSION

Box plot for hourly PM10 concentration are shown in Fig. 1. As a simple graphical display, box plot is very ideal for comparisons. The maximum PM10 concentrations exceed the limit (150 μg m-3) for all year with the maximum reading is 542 μg m-3 in 2005. Table 1 shows the box plots and also the descriptive statistics for hourly PM10 concentration in Nilai, Negeri Sembilan from 2003 to 2009, respectively. The mean for each year are higher than the median, which indicate the pollutants data are skewed to the right. The highest skewness is 3.59, shows that 2005 experienced high particulate events. This is likely due to the haze event that occurred that year as an effect of the transboundary movement of air pollutants emitted from forest fires and open burning activities in Indonesia (Department of Environment Malaysia, 2005). The smoke from biomass burning from regional sources also contribute to the PM10 concentration reading especially during dry season (Juneng et al., 2009). In addition, since Nilai is an industrial area, neighbouring precursory emissions which occur as a result of local societal and also industrial development also contribute to the variations of PM10 concentration (Juneng et al., 2011).

Two-parameter lognormal distribution and three-parameter lognormal distribution were used to fit the PM10 concentration in this study. The parameters for the distributions were estimated using method of moments and method of probability weighted moments. The best estimator and distribution was selected according to the performance of five types of goodness-of-fit criteria: Normalized Absolute Error (NAE), Prediction Accuracy (PA), coefficient of determination (R2), Root Mean Square Error (RMSE) and index of agreement (IA). All these goodness-of-fit criteria were used to describe how well the distribution fits a set of observations.

Fig. 1: Characteristic for the PM10 data in μg m-3 for the monitoring site in Nilai, Negeri Sembilan

Table 1: Descriptive statistics for PM10 concentration for Nilai

Table 2: Performance Indicators Value for Nilai, Negeri Sembilan

Table 2 shows the value of each performance indicator for each estimator. The best distribution representing each year can be identified based on this goodness-of-fit criteria. From the goodness-of-fit criteria, three-parameter lognormal distribution fits the PM10 concentration better compared to two-parameter lognormal distribution for every year from 2003 to 2009. These finding supported by Taylor et al. (1986) where lognormal distribution is appropriate for particulate data. Norazian et al. (2011) also found that lognormal distribution is very suitable to represent the air pollutant data where lognormal distribution perform better for PM10 concentration in industrial area in Malaysia. Previous research by Sansuddin et al. (2011) found that gamma distribution was the best distribution to represent PM10 concentration in Nilai for 2002. However, previous researches only consider the two-parameter distribution. However, lognormal distribution has been widely used to model many kinds of environmental contaminant data including air quality data (Gilbert, 1987).

The probability density function graphs were plotted using the values of parameter according to the best distribution for each year, as given in Table 3. Figure 2 shows the probability density function plot of PM10 concentration using the best estimator and distribution selected through the performance indicator criteria. The probability density function plot in Fig. 2 shows that the distribution for each year is positively skewed with the highest skewness in 2005.

Table 3: Parameter for the lognormal distribution using the best method

Fig. 2: The probability density function plot of PM10 concentration

Fig. 3: The cumulative density function plot of PM10 concentration

This is probably caused by the high particulate event in 2005 (Department of Environment Malaysia, 2005). The probability density function plots also shown that a concentration of PM10 exceed the threshold limit set by Malaysia Ambient Air Quality Guidelines (150 μg m-3) for each year, perhaps caused by monitoring station in Nilai are located in industrial area. In 2008, the pdf plot shows that the PM10 concentrations were lower compared to other years.

Table 4: The predicted and actual exceedences

Figure 3 shows the cumulative density function plot for the best distribution that represents the observed monitoring record from 2003 to 2009. The cumulative density function plot was used to determine the probability of PM10 concentration exceeding the Malaysia Ambient Air Quality Guideline (MAAQG). The probability of exceeding more than 150 μg m-3 were used to obtained the return period as shown in Table 4.

CONCLUSION

Based on the results from the analysis of PM10 concentration in Nilai, Negeri Sembilan from 2003 to 2009, every year experienced high particulate events. Two-parameter and three-parameter lognormal distribution were used to fit the PM10 concentration using method of moments and method of probability weighted moments to estimate the parameters. Results show that three-parameter lognormal distribution is the best distribution to represent the industrial area in Nilai, Negeri Sembilan. Predictions of the PM10 exceedences were estimated using the best estimator and the best distribution for each year. There are differences between predicted and actual values of exceedences but the errors are small. The results of this study provide useful information on air quality status in Nilai Negeri Sembilan and can be used for air quality management.

REFERENCES

  • Department of Environment Malaysia, 2005. Malaysia environment quality report 2005. Department of Environment Ministry of Natural Resources and Environment, Malaysia.


  • Department of Environment Malaysia, 2010. Malaysia environment quality report 2010. Department of Environment Ministry of Natural Resources and Environment, Malaysia.


  • Evans, M., N. Hasting and B. Peacock, 2000. Statistical Distributions. 3rd Edn., Wiley, New York, ISBN-13: 978-0471371243


  • Gilbert, R.O., 1987. Statistical Methods for Environmental Pollution Monitoring. John Wiley and Sons, New York, ISBN: 9780471288787, pp: 336


  • Hosking, J.R.M., 1990. L-moment analysis and estimation of distributions using linear combinations of order statistics. J. Royal Stat. Soc., Ser. B, 52: 105-124.
    Direct Link    


  • Johnson, N., S. Kotz and N. Balakrishnan, 1994. Continuous Univariate Distributions. John Wiley and Sons, New York


  • Juneng, L., M.T. Latif, F.T. Tanggang and H. Mansor, 2009. Spatio-temporal characteristics of PM10 concentration across malaysia. Atmos. Environ., 43: 4584-4594.
    CrossRef    


  • Juneng, L., M.T. Latif and F. Tanggang, 2011. Factors influencing the variations of PM10 aerosol dust in klang valley, malaysia during the summer. Atmos. Environ., 45: 4370-4378.
    CrossRef    Direct Link    


  • Junninen, H., H. Niska, K. Tuppurainen, J. Ruuskanen and M. Kolehmainen, 2004. Methods for imputation of missing values in air quality data sets. Atmos. Environ., 38: 2895-2907.
    CrossRef    


  • Kite, G.W., 1997. Frequency and Risk Analysis in Hydrology. Water Reseach Publications, Fort Collins


  • Lopez, R.E., 1977. The lognormal distribution and cumulus cloud populations. Mon. Weather Rev., 105: 865-872.
    CrossRef    


  • Lu, H.C. and G.C. Fang, 2003. Predicting the exceedences of a critical PM10 concentration: A case study in Taiwan. Atmos. Environ., 37: 3491-3499.
    CrossRef    


  • Yusof, N.F.F.M., N.A. Ramli, A.S. Yahaya, N. Sansuddin, N.A. Ghazali and W. Al Madhoun, 2010. Monsoonal differences and probability distribution of PM10 concentration. Environ. Monitor. Assess., 163: 655-667.
    CrossRef    Direct Link    


  • Norazian, M.N., M.M.A. Abdullah, C.Y. Tan, N.A. Ramli, A.S. Yahaya and N.F.M.Y. Fitri, 2011. Modelling of PM10 concentration for industrial area in malaysia: A case study in shah alam. Phys. Procedia, 22: 318-324.
    CrossRef    


  • Sansuddin, N., N.A. Ramli, A.S. Yahaya, N.F.F.M. Yusof, N.A. Ghazali and W.A.A. Madhoun, 2011. Statistical analysis of PM10 concentrations at different locations in Malaysia. Environ. Monit. Assess., 180: 573-588.
    CrossRef    Direct Link    


  • Taylor, J.A., A.J. Jakemen and R.W. Simpson, 1986. Modelling distributions of air pollutant concentrations: Identification of statistical models. Atmos. Environ., 20: 1781-1789.
    CrossRef    


  • Ul-Saufie, A.Z., A.S. Yahaya, N.A. Ramli and H.A. Hamid, 2012. Performance of multiple linear regression model for long-term PM10 concentration prediction based on gaseous and meteorological parameters. J. Applied Sci., 12: 1488-1494.
    CrossRef    Direct Link    


  • WHO, 1998. Report of the bioregional workshop on health impacts of haze related air pollution. World Health Organization, Manila.


  • Yahaya, A.S., N.A. Ramli and N.F. Yusof, 2005. Effects of estimating missing values on fitting distribution. Proceeding International Conference on Quantitative Sciences and its Applications, December 6-8, 2005, Universiti Utara Malaysia, Malaysia -.


  • Yee, E. and R. Chan, 1997. A simple model for the probability density functions of concentration fluctuations in atmospheric plumes. Atmos. Environ., 31: 991-1002.
    CrossRef    Direct Link    


  • Afroz, R., M.N. Hassan and N.A. Ibrahim, 2003. Review of air pollution and health impacts in Malaysia. Environ. Res., 92: 71-77.
    CrossRef    

  • © Science Alert. All Rights Reserved