Abstract: This study illustrated the procedure in selecting the best model in estimating the Electrical Conductivity (EC) levels based on the hydrochemistry properties and nature effecting factors using multiple regressions. The six independent variables and two dummy variables considered in this data set. The Multiple Regression (MR) models were involved up to first-order interaction and there were 57 possible models considered. This study is the extension of prior research which had generated 63 possible models, by using the same technique but no interaction involved between the independent variables. In this study, the process of getting the best model from the total of 120 possible models had been illustrated. The backward elimination of variables with the highest p-value was employed to get the selected model. The best model includes the combination of single and first order interaction (Li, Mg, Na-SO4, Na-Li, Na-Mg and SO4-Mg). The best model obtains then being verified by the Mean Absolute Percentage Error (MAPE) calculation to measure the models relative overall fit.
INTRODUCTION
Groundwater is one of the major water resources especially in small tropical islands. Small tropical islands have limited sources of freshwater, no surface water and fully reliant on rainfall and groundwater recharge. The distortion caused by over exploitation of freshwater using pumping well have created an imbalance in the recharge-discharge equilibrium and resulting in the drawdown of the water table or upcoming of the saltwater intrusion (Kristie, 2007). This is often exacerbated by insufficient recharge to the freshwater aquifer which can occur in times of deficiency. The freshwater aquifers afloat on top of the saltwater at the interface due to density differences in the two respective water sources. The saltwater tends to form a lodge under the freshwater that extends inland. As saltwater intrusion occurs, this lodge extends further inland and is seen at shallower depths. The result is that wells that previously produced freshwater can see an increase in chloride concentration that makes the well unusable for irrigation or potable uses (Baharuddin et al., 2009).
The intrusion of saltwater has been the factor of saline water penetration due to the landward and upward displacement of the freshwater-saltwater interface in coastal aquifers (Knighton et al., 1991) and as the invasion of fresh or brackish surface water or groundwater by water with higher salinity. Salinity can be explained by the total of all non-carbonate salts dissolved in water. Salinity is a capacity of the total salt concentration, comprised mostly of Na+ and Cl¯ ions. Even though, there are smaller quantities of other ions in seawater (e.g., K+, Mg2+ or SO42¯), sodium and chloride ions represent about 91% of all seawater ions (Al-Naeem, 2008). Salinity is an important measurement in seawater where freshwater mixes with salty water (Abdullah et al., 2011). Chloride, an ion of the element chlorine, is naturally abundant within sea water. High chloride concentrations are often used as an indicator that seawater intrusion is occurring at a well but it is not a conclusive confirmation (Naeem et al., 2007).
In general, the Total Dissolved Solids (TDS) concentration is the amount of the cations (positively charged) and anions (negatively charged) ions in the water. Thus, salinity of the water can be determined by the TDS concentration (Mitra et al., 2007). Electrical Conductivity (EC) is a useful indicator of TDS because the conduction of current in an electrolyte solution is primarily dependent on the concentration of ionic species. EC is proportional to the sum of cations and anions and roughly equivalent to TDS in water. Solids can be found in nature in a dissolved form. Salts that dissolve in water break into positively and negatively charged ions. Conductivity is the capability of water to conduct an electrical current and the dissolved ions are the conductors (Alslaibi et al., 2011). The best method of monitoring mixture of fresh water and saline water can be done by measuring the electrical conductivity. Monitoring is conducted for separating stream hydrographs and geophysical mapping of contaminated groundwater. Examples of EC for distilled water should typically have an EC of less than 0.3 μS cm-1 compared to groundwater, EC values greater than 500 μS cm-1 indicate that the water may be polluted. The EC value of drinking water should be no more than 2500 μS cm-1. Water with a higher TDS may have water quality problems and be unpleasant to drink (Thirumalini and Joseph, 2009). Factors of existing major chemical elements such as Na+, Cl¯, K+, Mg2+ and SO42¯ contribute a significant role in the process of classifying and assessing groundwater quality. These ionic chemical elements have the ability to carry an electric current. The more dissolved ionic solutes in water, the greater its EC, because the conduction of current in an electrolyte solution is primarily dependent on the concentration of ionic species. Warm weather in small tropical islands can increase the water salinity because of the evaporation process. This will leads to more widespread and severe problems in groundwater quality in small tropical islands.
MATERIALS AND METHODS
Study area: Currently, small tropical island which have been known as tourist attraction depends entirely on shallow aquifer groundwater supply mainly for domestic usage and washing purposes. Increased demands of the residents and tourists impose great pressure on the available groundwater resources. Dug wells are used to extract groundwater from the sandy aquifer. Groundwater is pumped routinely using water pump with water level meter integrated. Thus, aquifer in this tropical small island will becomes gradually vulnerable to seawater intrusion.
Data collection: A total of 59 groundwater samples were collected monthly from October 2008 to March 2009 in Manukan Island, a small tropical island in Sabah, West Malaysia located in South China Sea. The groundwater samples were collected from 10 boreholes which were drilled, using hand auger, to align perpendicularly from the sea. The depth of the boreholes ranged from 1.5-3 m.
Fig. 1: | Cross section (A-A) of boreholes installed |
Cross section (A-A) of boreholes were installed perpendicularly from the coastline towards inland over a distance of 130 m and the proximity to the sea is in the order of B1 to B10, as depicted in Fig. 1.
The groundwater was extracted from boreholes using a portable vacuum pump interconnected with 0.3 inch polyethylene tubing. Groundwater was allowed to run for approximately 10 min in order to purge several boreholes volumes. The main reason was to remove stagnant water and allow representative groundwater to be sampled. Prior to each sample collection, the bottles were rinsed thoroughly with the groundwater from the boreholes. In-situ parameter such as EC was measured on site. Water samples to be sent for laboratory analysis were collected in Polyethylene (PE) bottles of one liter volume for anions and cations analysis. After filling the bottles with samples, the bottles were capped tightly, labeled and stored in a cooler box.
STATISTICAL ANALYSES
Preliminary study: The dataset used in this paper had undergone a preliminary study by using the Factor Analysis (FA). Factor analysis was performed on a subset of 19 selected variables (pH, EC, Ca, Mg, Na, K, HCO3, Cl, SO4, H4SiO4, Al, Ba, Be, Fe, Li, Mn, Pb, Se and Sr), that represented the overall groundwater chemistry framework. Five factors were extracted from the rotated component matrix. High positive loadings of Na, EC, SO4, Li, K, Mg and Cl on Factor 1 indicated that the groundwater chemical composition was largely influenced by marine signature as these ions were found to be predominant in seawater (Voudouris et al., 2000). For the Multiple Regression Analysis (MRA), only parameters from Factor 1 will be used. Two dummy variables (Tides and Borehole position) have been created to be included in the process of model building as shown in Table 1.
This variables set have been analyzed by using the MRA without any interaction involved. Only the single variables have been used and the result of the final model (M63.0.6) explained that only Sodium and Borehole Position give the significant effect in EC estimation. The details of this study are explained in Lin et al. (2012).
Multiple regression (MR) models with interaction: Multiple regression analysis, a form of general linear modelling (Hair et al., 2010) is a statistical technique that can be used to analyze the relationship between a single dependent (criterion) variable and several independent (predictor) variables.
Table 1: | Description of variable involved in the models |
The objective of regression analysis is to predict a single Dependent Variable (DV) from the knowledge of one or more Independent Variables (IV)s. Interaction effects represent the combined effects of variables on the criterion or dependent measure. When an interaction effect is present, the impact of one variable depends on the level of the other variable. Part of the power of MR is the ability to estimate and test interaction effects when the predictor variables are either categorical or continuous. As, Pedhazur and Schmelkin (1991) had noted, the idea that multiple effects should be studied in research rather than the isolated effects of single variables is one of the important contributions of Sir Ronald Fisher. When interaction effects are present, it means that interpretation of the individual variables may be incomplete or misleading. The specific MR model that has been explained by Lind et al. (2005) can be stated as follows:
(1) |
where, Xi is the random variable representing the ith value of DV Y. Thus, X1i, X2i , , Xki are the ith value of IV for i = 1, 2, , n.
Models results
All possible models: In the development of the MR models for this
datasets, Electrical Conductivity (EC) would be the Dependent Variable (DV)
noted by Y, whereas, Na (X1), SO4 (X2), Li
(X3), K (X4), Cl (X5) and Mg (X6)
would be the Independent Variables (IV). Tides (T) and boreholes position (P)
were included as independent dummy variables included in the models. Dummy variables
were executed during the calculation of the possible models but included in
the models before next model building procedure was carried out. All possible
models, N can be calculated by using the formula:
(2) |
where, N is the number of possible models generated and q is the number of variables and j = 1,2, , q.
For this study, q = 6 (excluded the 2 dummy), the possible model is:
(3) |
Table 2: | Summary of all possible models |
The summary of all possible models are shown in Table 2. In this study, 192 models have been considered into further analysis because the interaction with dummy variable can only be done until the first order interaction (shaded area). The total numbers of models that have been considered in this analysis is 192 = single variable (63 models) + first order interaction variable (57 models).
Selected models: Multicollinearity is the intercorrelation of IV. The higher correlation coefficient will increase the standard error of the beta coefficients and produce assessment of the unique role of each independent resulting in difficult or impossible output. Multicollinearity exist if Correlation Coefficient >0.95. Zainodin-Noraini multicollinearity remedial procedures had been applied and details are explained in Abdullah et al. (2011) and Zainodin et al. (2011). Pearson correlation analysis verifies that there is existence of multicollinearity between IVs in M116 and seven variables (X1P, X2X3, X1T, X2X6, P, X3X5, X3X6) have been eliminated from the models (M116.7.0).
Next, the coefficient test should be carried out as an elimination procedure of insignificant variable by using the backward elimination as shown by Abdullah et al. (2008). To justify the removal of the insignificant variable, Wald test (Ramanathan, 2002) should be applied to the possible models upon the completion of all the elimination procedure of insignificant variables. In this step, the total of 14 insignificant variables have been eliminated from the model M116.7.0. At the end of this phase, only six variables have been left in the model (i.e., model M116.7.14). Table 3 shows the entered variable before the elimination procedure and the remaining variable after the elimination of insignificant variables.
Eight criteria of model selection (8SC): Identification of the best model should be based on Eight Selection Criteria (8SC) as shown in Abdullah et al. (2011). The objective is to determine a model with the lowest value of a criterion statistic. The calculation of the criterion statistics will be based on the Sum of Square Error (SSE), number of estimated parameters and the sample size. Table 4 shows the details of each model selection criteria.
Where, n would be the number of observations, (k+1) is the number of models parameters and SSE the sum of square of error. The Akaike Information Criterion (AIC) (Akaike, 1974) and Finite Prediction Error (FPE) (Akaike, 1970) are developed by Akaike. The Generalised Cross Validation (GCV) is developed by Golub et al. (1979) while the HQ criterion is suggested by Hannan and Quinn (1979). The RICE criterion is discussed by Rice (1984) and the SCHWARZ criterion is discussed by Schwarz (1978). The SGMASQ is developed by Ramanathan (2002) and the Shibata criterion is suggested by Shibata (1981).
Table 3: | Model M116.7.0 with entered variable before elimination procedure of insignificant variables and model M116.7.14 with remaining variable after elimination procedure of insignificant variables |
Table 4: | Eight selection criteria (8SC) for best model identification |
From 192 possible models generated during the stage of this analysis, only 67 models have been selected with the same SSE value and number of model parameter. These models then been grouped and any models from this group can be the selected model. The best model was then chosen from the selected models by using the 8SC based on the majority of least values as shown in Table 5. The best model selected is M116.7.14.
Table 5: | Value of 8SC for all selected models |
Best model verification: By using the Wald test, the complete model (M116) was taken as initial possible model and M116.7.14 as the reduced model. The complete (C) model (M116):
(4) |
The reduced (R) model (M116.7.14):
(5) |
Hypothesis:
H0:β1 = β2 = β5 = β15 = β23 = β26 = β35 = β36 = β56 =βT = βP = β1T = β2T = β5T = β6T = β1P = β2P = β5P = β6P = 0
H0:At least one βS is non zero
Decision:
The value of Fcritical value from F distribution curve = Ftable = F0.05, 28, 49 = 1.80 and the calculated value of F = Fcal = 0.8358. Since the calculated value of F is less than Ftable, the decision is to accept H0. The removal of insignificant variables in coefficient test is justified.
The final phase of model building is applying the Goodness-of-Fit on the final best model. The goodness-of-fit comprises of the randomness test and normality test. Randomness test is to determine that the residuals are randomly distributed and normality test on the Kolmogorov-Smirnov statistics is to ensure that the normality assumptions are not violated. The Runs Test value is 3.7767, since the value of |Z| = 0.192<asymp. Sig (2-tailed) = 0.847, therefore, H0 is accepted and this test supported the conclusion that there is enough evidence that the residual is randomly distributed. Since the Kolmogorov-Smirnov statistics (0.192) gives the significant p-value = 0.200>0.05, therefore, H0 is accepted. There is enough evidence at 0.05 significant levels that the standardized residual is normal. This statement is supported by the scatter plot and histogram in Fig. 2.
From here, the best regression model would therefore be represented by:
(6) |
where, X3 is Lithium, X6 is Magnesium, X1X2 is interaction between Sodium-Sulphate, X1X3 is the interaction between Sodium-Lithium, X1X6 is the interaction between Sodium-Magnesium and X2X5 is the interaction between Sulphate-Chlorine. This interaction factor could be considered to reflect ion-exchange reactions between groundwater and the aquifer matrix corresponded to the positive loadings especially between Sodium and Sulphate.
Fig. 2(a-b): | (a) Standardized residual scatter plot and (b) Histogram with normal curve |
Sodium played a substantial role in controlling the behaviour of Sulphate and Magnesium in the shallow aquifer which will increase the salinity (EC level).
Model accuracy measurement: The Mean Absolute Percentage Error (MAPE) is commonly used in quantitative forecasting methods because it produces a measure of relative overall fit. The absolute values of all the percentage errors are summed up and the average is computed (Levy and Lemeshow, 1991). In this study, MAPE is used to verify the best model obtain. It usually expresses accuracy as a percentage and is defined by the formula:
(7) |
where, At is the actual value and Ft is the forecast (estimated) value. The difference between At and Ft is divided by the actual value At again. The absolute value of this calculation is summed for every fitted or forecast point in time and divided again by the total number of fitted points a. In this case, the number of a = 3, number of data reserved for this purpose. In general a MAPE of 10% is considered very well, a MAPE in the range 11-25% or even higher is quite common. The lower MAPE value the best the model can be used in forecasting or evaluating the missing values. By substituting the remaining observation that has no been included in the model building analysis, the value of MAPE obtained is 2.022%. This value indicates that this model could be best used for estimation of missing value or forecasting.
CONCLUSION
EC is widely used for monitoring the mixing of fresh and saline water. The groundwater with high EC level is not appropriate for drinking purposes attributed to its high salinity and elevated concentration of several minor elements. In this study, the model obtained clearly stated the contribution of each parameter in determining the EC level. Na is dominant ions of seawater, high levels of Na ions in coastal groundwater may indicate a significant effect of seawater mixing. Eventually, Na is not independently significant in estimation of EC level. The interaction between Na-SO4, Na-Li and Na-Mg has given a significant impact in the model. Two dummy variables (Tides and Borehole position) have been created to be included in the process of model building but have been eliminated during the modeling process. The dummy variables do not show any significant effect in estimation of EC level. The application of the model to such an island proved useful in demonstrating the mechanism of seawater intrusion in monitoring the water quality. The uses of variable interaction effects in the statistical model especially for environmental datasets have shown a significant impact parallel with the environmental theory. The interpretation on the environmental theory supported by the statistical modelling plays an important role in this task of problem solving and decision making. For further analysis, the remaining 72 models (Table 2) will be analysing using MRA with higher interaction. The highest interaction that can be considered for this dataset is until 5th order. With the higher interaction effects, the model is expected to give more significant.
ACKNOWLEDGMENT
The data for this study is financially supported by the Ministry of Science, Technology and Innovation, Malaysia (under Science Fund Grant No 04-01-10-SF0065. The authors thank the Mr. Lin Chin Yik and his project team members for providing the data for this research. The authors would also like to thank anonymous reviewers for their useful comments and enlightened ideas.