HOME JOURNALS CONTACT

Asian Journal of Mathematics & Statistics

Year: 2012 | Volume: 5 | Issue: 4 | Page No.: 121-131
DOI: 10.3923/ajms.2012.121.131
Multiple Regression Models up to First-order Interaction on Hydrochemistry Properties
Aminatul Hawa Yahaya, Noraini Abdullah and H.J. Zainodin

Abstract: This study illustrated the procedure in selecting the best model in estimating the Electrical Conductivity (EC) levels based on the hydrochemistry properties and nature effecting factors using multiple regressions. The six independent variables and two dummy variables considered in this data set. The Multiple Regression (MR) models were involved up to first-order interaction and there were 57 possible models considered. This study is the extension of prior research which had generated 63 possible models, by using the same technique but no interaction involved between the independent variables. In this study, the process of getting the best model from the total of 120 possible models had been illustrated. The backward elimination of variables with the highest p-value was employed to get the selected model. The best model includes the combination of single and first order interaction (Li, Mg, Na-SO4, Na-Li, Na-Mg and SO4-Mg). The best model obtains then being verified by the Mean Absolute Percentage Error (MAPE) calculation to measure the models’ relative overall fit.

Fulltext PDF Fulltext HTML

How to cite this article
Aminatul Hawa Yahaya, Noraini Abdullah and H.J. Zainodin, 2012. Multiple Regression Models up to First-order Interaction on Hydrochemistry Properties. Asian Journal of Mathematics & Statistics, 5: 121-131.

Keywords: Multiple regression, first-order interaction variables, backward elimination and dummy variables

INTRODUCTION

Groundwater is one of the major water resources especially in small tropical islands. Small tropical islands have limited sources of freshwater, no surface water and fully reliant on rainfall and groundwater recharge. The distortion caused by over exploitation of freshwater using pumping well have created an imbalance in the recharge-discharge equilibrium and resulting in the drawdown of the water table or upcoming of the saltwater intrusion (Kristie, 2007). This is often exacerbated by insufficient recharge to the freshwater aquifer which can occur in times of deficiency. The freshwater aquifers afloat on top of the saltwater at the interface due to density differences in the two respective water sources. The saltwater tends to form a lodge under the freshwater that extends inland. As saltwater intrusion occurs, this lodge extends further inland and is seen at shallower depths. The result is that wells that previously produced freshwater can see an increase in chloride concentration that makes the well unusable for irrigation or potable uses (Baharuddin et al., 2009).

The intrusion of saltwater has been the factor of saline water penetration due to the “landward and upward displacement of the freshwater-saltwater interface in coastal aquifers (Knighton et al., 1991) and as the invasion of fresh or brackish surface water or groundwater by water with higher salinity. Salinity can be explained by the total of all non-carbonate salts dissolved in water. Salinity is a capacity of the total salt concentration, comprised mostly of Na+ and Cl¯ ions. Even though, there are smaller quantities of other ions in seawater (e.g., K+, Mg2+ or SO42¯), sodium and chloride ions represent about 91% of all seawater ions (Al-Naeem, 2008). Salinity is an important measurement in seawater where freshwater mixes with salty water (Abdullah et al., 2011). Chloride, an ion of the element chlorine, is naturally abundant within sea water. High chloride concentrations are often used as an indicator that seawater intrusion is occurring at a well but it is not a conclusive confirmation (Naeem et al., 2007).

In general, the Total Dissolved Solids (TDS) concentration is the amount of the cations (positively charged) and anions (negatively charged) ions in the water. Thus, salinity of the water can be determined by the TDS concentration (Mitra et al., 2007). Electrical Conductivity (EC) is a useful indicator of TDS because the conduction of current in an electrolyte solution is primarily dependent on the concentration of ionic species. EC is proportional to the sum of cations and anions and roughly equivalent to TDS in water. Solids can be found in nature in a dissolved form. Salts that dissolve in water break into positively and negatively charged ions. Conductivity is the capability of water to conduct an electrical current and the dissolved ions are the conductors (Alslaibi et al., 2011). The best method of monitoring mixture of fresh water and saline water can be done by measuring the electrical conductivity. Monitoring is conducted for separating stream hydrographs and geophysical mapping of contaminated groundwater. Examples of EC for distilled water should typically have an EC of less than 0.3 μS cm-1 compared to groundwater, EC values greater than 500 μS cm-1 indicate that the water may be polluted. The EC value of drinking water should be no more than 2500 μS cm-1. Water with a higher TDS may have water quality problems and be unpleasant to drink (Thirumalini and Joseph, 2009). Factors of existing major chemical elements such as Na+, Cl¯, K+, Mg2+ and SO42¯ contribute a significant role in the process of classifying and assessing groundwater quality. These ionic chemical elements have the ability to carry an electric current. The more dissolved ionic solutes in water, the greater it’s EC, because the conduction of current in an electrolyte solution is primarily dependent on the concentration of ionic species. Warm weather in small tropical islands can increase the water salinity because of the evaporation process. This will leads to more widespread and severe problems in groundwater quality in small tropical islands.

MATERIALS AND METHODS

Study area: Currently, small tropical island which have been known as tourist attraction depends entirely on shallow aquifer groundwater supply mainly for domestic usage and washing purposes. Increased demands of the residents and tourists impose great pressure on the available groundwater resources. Dug wells are used to extract groundwater from the sandy aquifer. Groundwater is pumped routinely using water pump with water level meter integrated. Thus, aquifer in this tropical small island will becomes gradually vulnerable to seawater intrusion.

Data collection: A total of 59 groundwater samples were collected monthly from October 2008 to March 2009 in Manukan Island, a small tropical island in Sabah, West Malaysia located in South China Sea. The groundwater samples were collected from 10 boreholes which were drilled, using hand auger, to align perpendicularly from the sea. The depth of the boreholes ranged from 1.5-3 m.

Fig. 1: Cross section (A-A’) of boreholes installed

Cross section (A-A’) of boreholes were installed perpendicularly from the coastline towards inland over a distance of 130 m and the proximity to the sea is in the order of B1 to B10, as depicted in Fig. 1.

The groundwater was extracted from boreholes using a portable vacuum pump interconnected with 0.3 inch polyethylene tubing. Groundwater was allowed to run for approximately 10 min in order to purge several boreholes volumes. The main reason was to remove stagnant water and allow representative groundwater to be sampled. Prior to each sample collection, the bottles were rinsed thoroughly with the groundwater from the boreholes. In-situ parameter such as EC was measured on site. Water samples to be sent for laboratory analysis were collected in Polyethylene (PE) bottles of one liter volume for anions and cations analysis. After filling the bottles with samples, the bottles were capped tightly, labeled and stored in a cooler box.

STATISTICAL ANALYSES

Preliminary study: The dataset used in this paper had undergone a preliminary study by using the Factor Analysis (FA). Factor analysis was performed on a subset of 19 selected variables (pH, EC, Ca, Mg, Na, K, HCO3, Cl, SO4, H4SiO4, Al, Ba, Be, Fe, Li, Mn, Pb, Se and Sr), that represented the overall groundwater chemistry framework. Five factors were extracted from the rotated component matrix. High positive loadings of Na, EC, SO4, Li, K, Mg and Cl on Factor 1 indicated that the groundwater chemical composition was largely influenced by marine signature as these ions were found to be predominant in seawater (Voudouris et al., 2000). For the Multiple Regression Analysis (MRA), only parameters from Factor 1 will be used. Two dummy variables (Tides and Borehole position) have been created to be included in the process of model building as shown in Table 1.

This variables set have been analyzed by using the MRA without any interaction involved. Only the single variables have been used and the result of the final model (M63.0.6) explained that only Sodium and Borehole Position give the significant effect in EC estimation. The details of this study are explained in Lin et al. (2012).

Multiple regression (MR) models with interaction: Multiple regression analysis, a form of general linear modelling (Hair et al., 2010) is a statistical technique that can be used to analyze the relationship between a single dependent (criterion) variable and several independent (predictor) variables.

Table 1: Description of variable involved in the models

The objective of regression analysis is to predict a single Dependent Variable (DV) from the knowledge of one or more Independent Variables (IV)’s. Interaction effects represent the combined effects of variables on the criterion or dependent measure. When an interaction effect is present, the impact of one variable depends on the level of the other variable. Part of the power of MR is the ability to estimate and test interaction effects when the predictor variables are either categorical or continuous. As, Pedhazur and Schmelkin (1991) had noted, the idea that multiple effects should be studied in research rather than the isolated effects of single variables is one of the important contributions of Sir Ronald Fisher. When interaction effects are present, it means that interpretation of the individual variables may be incomplete or misleading. The specific MR model that has been explained by Lind et al. (2005) can be stated as follows:

(1)

where, Xi is the random variable representing the ith value of DV Y. Thus, X1i, X2i , …, Xki are the ith value of IV for i = 1, 2, …, n.

Models results
All possible models: In the development of the MR models for this datasets, Electrical Conductivity (EC) would be the Dependent Variable (DV) noted by Y, whereas, Na (X1), SO4 (X2), Li (X3), K (X4), Cl (X5) and Mg (X6) would be the Independent Variables (IV). Tides (T) and boreholes position (P) were included as independent dummy variables included in the models. Dummy variables were executed during the calculation of the possible models but included in the models before next model building procedure was carried out. All possible models, N can be calculated by using the formula:

(2)

where, N is the number of possible models generated and q is the number of variables and j = 1,2,…, q.

For this study, q = 6 (excluded the 2 dummy), the possible model is:

(3)

Table 2: Summary of all possible models

The summary of all possible models are shown in Table 2. In this study, 192 models have been considered into further analysis because the interaction with dummy variable can only be done until the first order interaction (shaded area). The total numbers of models that have been considered in this analysis is 192 = single variable (63 models) + first order interaction variable (57 models).

Selected models: Multicollinearity is the intercorrelation of IV. The higher correlation coefficient will increase the standard error of the beta coefficients and produce assessment of the unique role of each independent resulting in difficult or impossible output. Multicollinearity exist if Correlation Coefficient >0.95. Zainodin-Noraini multicollinearity remedial procedures had been applied and details are explained in Abdullah et al. (2011) and Zainodin et al. (2011). Pearson correlation analysis verifies that there is existence of multicollinearity between IV’s in M116 and seven variables (X1P, X2X3, X1T, X2X6, P, X3X5, X3X6) have been eliminated from the models (M116.7.0).

Next, the coefficient test should be carried out as an elimination procedure of insignificant variable by using the backward elimination as shown by Abdullah et al. (2008). To justify the removal of the insignificant variable, Wald test (Ramanathan, 2002) should be applied to the possible models upon the completion of all the elimination procedure of insignificant variables. In this step, the total of 14 insignificant variables have been eliminated from the model M116.7.0. At the end of this phase, only six variables have been left in the model (i.e., model M116.7.14). Table 3 shows the entered variable before the elimination procedure and the remaining variable after the elimination of insignificant variables.

Eight criteria of model selection (8SC): Identification of the best model should be based on Eight Selection Criteria (8SC) as shown in Abdullah et al. (2011). The objective is to determine a model with the lowest value of a criterion statistic. The calculation of the criterion statistics will be based on the Sum of Square Error (SSE), number of estimated parameters and the sample size. Table 4 shows the details of each model selection criteria.

Where, n would be the number of observations, (k+1) is the number of model’s parameters and SSE the sum of square of error. The Akaike Information Criterion (AIC) (Akaike, 1974) and Finite Prediction Error (FPE) (Akaike, 1970) are developed by Akaike. The Generalised Cross Validation (GCV) is developed by Golub et al. (1979) while the HQ criterion is suggested by Hannan and Quinn (1979). The RICE criterion is discussed by Rice (1984) and the SCHWARZ criterion is discussed by Schwarz (1978). The SGMASQ is developed by Ramanathan (2002) and the Shibata criterion is suggested by Shibata (1981).

Table 3: Model M116.7.0 with entered variable before elimination procedure of insignificant variables and model M116.7.14 with remaining variable after elimination procedure of insignificant variables

Table 4: Eight selection criteria (8SC) for best model identification

From 192 possible models generated during the stage of this analysis, only 67 models have been selected with the same SSE value and number of model parameter. These models then been grouped and any models from this group can be the selected model. The best model was then chosen from the selected models by using the 8SC based on the majority of least values as shown in Table 5. The best model selected is M116.7.14.


Table 5: Value of 8SC for all selected models

Best model verification: By using the Wald test, the complete model (M116) was taken as initial possible model and M116.7.14 as the reduced model. The complete (C) model (M116):

(4)

The reduced (R) model (M116.7.14):

(5)

Hypothesis:

H01 = β2 = β5 = β15 = β23 = β26 = β35 = β36 = β56T = βP = β1T = β2T = β5T = β6T = β1P = β2P = β5P = β6P = 0

H0:At least one βS is non zero

Decision:

The value of Fcritical value from F distribution curve = Ftable = F0.05, 28, 49 = 1.80 and the calculated value of F = Fcal = 0.8358. Since the calculated value of F is less than Ftable, the decision is to accept H0. The removal of insignificant variables in coefficient test is justified.

The final phase of model building is applying the Goodness-of-Fit on the final best model. The goodness-of-fit comprises of the randomness test and normality test. Randomness test is to determine that the residuals are randomly distributed and normality test on the Kolmogorov-Smirnov statistics is to ensure that the normality assumptions are not violated. The Runs Test value is 3.7767, since the value of |Z| = 0.192<asymp. Sig (2-tailed) = 0.847, therefore, H0 is accepted and this test supported the conclusion that there is enough evidence that the residual is randomly distributed. Since the Kolmogorov-Smirnov statistics (0.192) gives the significant p-value = 0.200>0.05, therefore, H0 is accepted. There is enough evidence at 0.05 significant levels that the standardized residual is normal. This statement is supported by the scatter plot and histogram in Fig. 2.

From here, the best regression model would therefore be represented by:

(6)

where, X3 is Lithium, X6 is Magnesium, X1X2 is interaction between Sodium-Sulphate, X1X3 is the interaction between Sodium-Lithium, X1X6 is the interaction between Sodium-Magnesium and X2X5 is the interaction between Sulphate-Chlorine. This interaction factor could be considered to reflect ion-exchange reactions between groundwater and the aquifer matrix corresponded to the positive loadings especially between Sodium and Sulphate.

Fig. 2(a-b): (a) Standardized residual scatter plot and (b) Histogram with normal curve

Sodium played a substantial role in controlling the behaviour of Sulphate and Magnesium in the shallow aquifer which will increase the salinity (EC level).

Model accuracy measurement: The Mean Absolute Percentage Error (MAPE) is commonly used in quantitative forecasting methods because it produces a measure of relative overall fit. The absolute values of all the percentage errors are summed up and the average is computed (Levy and Lemeshow, 1991). In this study, MAPE is used to verify the best model obtain. It usually expresses accuracy as a percentage and is defined by the formula:

(7)

where, At is the actual value and Ft is the forecast (estimated) value. The difference between At and Ft is divided by the actual value At again. The absolute value of this calculation is summed for every fitted or forecast point in time and divided again by the total number of fitted point’s a. In this case, the number of a = 3, number of data reserved for this purpose. In general a MAPE of 10% is considered very well, a MAPE in the range 11-25% or even higher is quite common. The lower MAPE value the best the model can be used in forecasting or evaluating the missing values. By substituting the remaining observation that has no been included in the model building analysis, the value of MAPE obtained is 2.022%. This value indicates that this model could be best used for estimation of missing value or forecasting.

CONCLUSION

EC is widely used for monitoring the mixing of fresh and saline water. The groundwater with high EC level is not appropriate for drinking purposes attributed to its high salinity and elevated concentration of several minor elements. In this study, the model obtained clearly stated the contribution of each parameter in determining the EC level. Na is dominant ions of seawater, high levels of Na ions in coastal groundwater may indicate a significant effect of seawater mixing. Eventually, Na is not independently significant in estimation of EC level. The interaction between Na-SO4, Na-Li and Na-Mg has given a significant impact in the model. Two dummy variables (Tides and Borehole position) have been created to be included in the process of model building but have been eliminated during the modeling process. The dummy variables do not show any significant effect in estimation of EC level. The application of the model to such an island proved useful in demonstrating the mechanism of seawater intrusion in monitoring the water quality. The uses of variable interaction effects in the statistical model especially for environmental datasets have shown a significant impact parallel with the environmental theory. The interpretation on the environmental theory supported by the statistical modelling plays an important role in this task of problem solving and decision making. For further analysis, the remaining 72 models (Table 2) will be analysing using MRA with higher interaction. The highest interaction that can be considered for this dataset is until 5th order. With the higher interaction effects, the model is expected to give more significant.

ACKNOWLEDGMENT

The data for this study is financially supported by the Ministry of Science, Technology and Innovation, Malaysia (under Science Fund Grant No 04-01-10-SF0065. The authors thank the Mr. Lin Chin Yik and his project team members for providing the data for this research. The authors would also like to thank anonymous reviewers for their useful comments and enlightened ideas.

REFERENCES

  • Al-Naeem, A.A., 2008. Hydrochemical processes and metal composition of Ain Umm-Sabah natural spring in Al-Hassa Oasis Eastern province, Saudi Arabia. Pak. J. Biol. Sci., 11: 244-249.
    CrossRef    PubMed    Direct Link    


  • Akaike, H., 1970. Statistical predictor identification. Ann. Inst. Stat. Math., 22: 203-217.
    Direct Link    


  • Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Autom. Control, 19: 716-723.
    CrossRef    Direct Link    


  • Alslaibi, T.M., Y.K. Mogheir and S. Afifi, 2011. Assessment of groundwater quality due to municipal solid waste landfills leachate. J. Environ. Sci. Technol., 4: 419-436.
    CrossRef    Direct Link    


  • Baharuddin, M.F.T., R. Hashim and S. Taib, 2009. Electrical imaging resistivity study at the coastal area of Sungai Besar, Selangor, Malaysia. J. Applied Sci., 9: 2897-2906.
    CrossRef    


  • Golub, G.H., M. Heath and G. Wahba, 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21: 215-223.
    Direct Link    


  • Hair, J.F., W.C. Black, B.J. Babin and R.E. Anderson, 2010. Multivariate Data Analysis: A Global Perspective. 7th Edn., Pearson Education Inc., Upper Saddle River, NJ., USA., ISBN-13: 9780135153093, Pages: 800


  • Hannan, E.J. and B.G. Quinn, 1979. The determination of the order of an autoregression. J. R. Stat. Soc. Ser. B: (Methodol.), 41: 190-195.
    CrossRef    Direct Link    


  • Knighton, A.D., K. Mills and C.D. Woodroffe, 1991. Tidal-creek extension and saltwater intrusion in Northern Australia. Geology, 19: 831-834.
    Direct Link    


  • Kristie, W., 2007. Salinity Management Handbook. West Region Publ., South Queensland, Australia


  • Levy, P. and S. Lemeshow, 1991. Sampling of Populations: Methods and Applications. John Wiley and Sons Inc., New York, USA


  • Lin, C.Y., M.H. Abdullah, S.M. Praveena, H.Y. Aminatul and B. Musta, 2012. Delineation of temporal variability and governing factors influencing the spatial variability of shallow groundwater chemistry in a tropical sedimentary Island. J. Hydrol., 432-433: 26-42.
    CrossRef    


  • Lind, D.A., W.G. Marchal and R.D. Mason, 2005. Statistical Techniques in Business and Economics. 11th Edn., McGraw-Hill Inc., New York, USA


  • Mitra, B.K., C. Sasaki, K. Enari, N. Matsuyama and S. Pongpattanasiri, 2007. Suitability assessment of shallow groundwater for irrigation in sand dune area of northwest Honshu Island, Japan. Int. J. Agric. Res., 2: 518-527.
    CrossRef    Direct Link    


  • Naeem, M., K. Khan, S. Rehman and J. Iqbal, 2007. Environmental assessment of ground water quality of Lahore Area, Punjab, Pakistan. J. Applied Sci., 7: 41-46.
    CrossRef    Direct Link    


  • Abdullah, N., H.J. Zainodin and A. Ahmed, 2011. Improved stem volume estimation using P-value approach in polynomial regression models. Res. J. Forest., 5: 50-65.
    CrossRef    Direct Link    


  • Abdullah, N., Z.H.J. Jubok and J.B.N. Jonney, 2008. Multiple regression models of the volumetric stem biomass. WSEAS Trans. Math., 7: 492-502.
    Direct Link    


  • Pedhazur, E.J. and L.P. Schmelkin, 1991. Measurement, Design and Analysis: An Integrated Approach. Routledge, Hillsdale, NJ., USA., ISBN-13: 9780805810639, Pages: 819


  • Ramanathan, R., 2002. Introductory Econometrics with Applications. 5th Edn., Harcourt College Publishers, Ohio, USA., ISBN-13: 9780030341861, Pages: 688


  • Rice, J., 1984. Bandwidth choice for nonparametric regression. Ann. Stat., 12: 1215-1230.
    Direct Link    


  • Schwarz, G., 1978. Estimating the dimension of a model. Ann. Stat., 6: 461-464.
    Direct Link    


  • Shibata, R., 1981. An optimal selection of regression variables. Biometrika, 68: 45-54.
    ASCI    Direct Link    


  • Thirumalini, S. and K. Joseph, 2009. Correlation between electrical conductivity and total dissolved solids in natural waters. Malays. J. Sci., 28: 55-61.
    CrossRef    Direct Link    


  • Voudouris, K., A. Panagopoulos and J. Koumantakis, 2000. Multivariate statistical analysis in the assessment of hydrochemistry of the Northern Korinthia prefecture alluvial aquifer system (Peloponnese, Greece). Nat. Resour. Res., 9: 135-143.
    CrossRef    Direct Link    


  • Zainodin, H.J., A. Noraini and S.J. Yap, 2011. An alternative multicollinearity approach in solving multiple regression problem. Trends Applied Sci. Res., 6: 1241-1255.
    CrossRef    Direct Link    

  • © Science Alert. All Rights Reserved