Abstract: This study discusses the selection of parametric models of the moisture retention characteristic MRC as nonlinear regression models from a mathematical and statistical viewpoint. Simulation studies and some measures of nonlinearity are given. A comparison is introduced between the used famous models of the moisture retention characteristic (van Genuchten and King) which are presented early and their variants. Following the considered simulation study and the used measures of nonlinearity in nonlinear regression the author find that although a variant of van Genuchten model with four parameters is an optimal model but it has a strong nonlinear parameter. Moreover, a possible appropriate reparametrization with respect to this nonlinear parameter is proposed.
INTRODUCTION
The moisture retention characteristic MRC, which is the relation between water content (the volumetric water content Θ) and pressure head (the soil capillary pressure h), can be measured for natural soils. It is referred to as the soil water retention curve. This relationship is primarily based on the soil pore structure and the pore size distribution. The MRC Θ(h) is typical for a given soil having its particular status of consolidation, geometrical arrangement of particles and aggregation and other chemical and biological feature. The retention curve graphically displays a continuously differentiable (smooth) S-shaped curve between the saturated and residual water contents (Θs and Θr, respectively). Knowledge of the MRC is indispensable in describing soil water processes (flow). The MRC can be determined either directly in the field or in the laboratory on undisturbed core samples. MRC data from Vereecken et al. (1989) are used. These data sampled the soil horizons of forty important Belgian soil series and measured their MRC.
Several of parametric models have been proposed to describe MRC and all these models are nonlinear regression models (NLR-models). Most of the proposed models are curve-fitting equations and they are able to describe the typical S-shaped behavior of MRC and represent NLR-models which can be fitted to give data of MRC. The model-parameters can be estimated by applying algorithms minimizing a least squares (LS) object function (Bates and Watts, 1988).
Mathematical statistics have developed several methods for model selection in nonlinear regression. In this study, the researcher studies problems of model selection for MRC from a statistical and mathematical viewpoint. This study is carried out through the comparison between the used MRC-model proposed by van Genuchten (1980), a MRC model developed by King (1965) and its variants. A main task of the analysis of experimental data is the estimation of the parameters in a NLR-model (Bates and Watts, 1988; Ratkowsky, 1983; Ratkowsky, 1990; Seber and Wild, 2005). Usually it is not known whether a proposed regression model describes the unknown true regression function sufficiently well. But the results of the statistical analysis may heavily depend on the chosen model.
Modelers probably have one of the three purposes in mind when they wish to fit a NLR-model to set of data. First, they are interested in obtaining a good fit to the data. That means, they are primarily interested in the representation of the relationship between the independent and dependent variables by the chosen model. Secondly, they focus on the prediction of values of the dependent (response) variable for given values of the independent (regressor) variable. Moreover in some situations the modelers wish to make inference based upon interpretation of the parameter estimates of the corresponding model. The current study concerns with the second and the third aspects.
A more general NLR-model may be written:
(1) |
where, yij is the values of the response variable Y at fixed values xi of exploratory variables (nonrandom design points). Here, the real valued function f(.,β) is expectation function and it is known up to a P-dimensional vector of parameters β = (β1, β2,...,βp) ∈RP and this function is twice continuously differentiable in β. Moreover, the random perturbations εij (the measurement errors) are uncorrelated random variables with zero mean and unknown variance, σ2 which do not depend on xi. Given the date
(2) |
the estimation of the regression function f reduces to
estimate its parameters. The most popular approach to estimate β
= (β1, β2, ..., βp) is
to employ the ordinary least squares estimator
(3) |
The linearity of the regression model (1) may be produced if the expected response f(x, β) is a linear function of the parameter vector β = (β1, β2,...,βp).
The solution
(4) |
with respect to a linear regression model (LR-model)
will be given by an explicit algebraic expression. This solution, which
is the least squares estimator
For the NLR-models the situation is different. There
is no explicit expression for
NLR-models differ in their estimation properties from linear regression models. Given the usual assumption of independent and identically distributed measurement errors, the parameter (LS) estimators for linear models are linear, unbiased, at least asymptotically normally distributed, minimum variance estimators. On the other hand, NLR-models tend to do so only when the sample size becomes very large.
In this study, the purpose is the exploration and the comparison of the properties of estimators for some different MRC models in situations having sample sizes typically obtained in practice. The MRC models with close-to-linear estimation behavior are looked. The extend of nonlinear estimation behavior (e.g., bias) depends upon the nonlinearity of a model/data set combination. Therefore, this work can be carried out in the first step by using simulation studies with respect to the considering models and in the second step by computing some measure of nonlinearity.
The aim of the presented study is to select an optimal MRC model, to determine the cause of the nonlinearity in this selected model and finally to diminish (or avoid) this defect through a suggested reparameterization.
THE CLASS OF FUNCTIONS USED FOR MODEL SELECTION
van Genuchten (1980) presented a widely used class of functions for parametrizing measured soil water characteristics
(5) |
where α, n and m are positive parameters defining the MRC`s shape and this model is denoted by (VG1). This function is a MRC model of five parameters α, n, m beside, Θs, Θr. Moreover, King (1965) suggested the following model
(6) |
with five real parameters Θs, h0, b, ε and γ where b having negative values and this model denoted by (K1). The Table 1 indicates to special cases (VG2), (VG3), (VG4) and (VG5) of van Genuchten model (VG1) and also the model (K2) as a special case of the model (K1) of King, which are models with fewer parameters:
Each of the models (VG2), (VG3) and (VG4) has four parameters
Θs, Θr, α, n but (VG5) is the model
with only three parameters Θs, α, n. Also, the model
(K2) has four parameters and Θs, h0, b and
γ.
Table 1: | Some variants of van Genuchten model (VG1) and King model (K1) |
USING METHODS TO SELECT A NLR-MODEL FOR MRC
Two methods to select a NLR-model for MRC are used in this study. These methods are described in the following two subsections.
Description of the Simulation Experiments
Simulation studies (experiments) are probably the most direct and
best way to enable the modeler to study the sampling properties of the
LS estimator. Data are generated using a set of predetermined values of
the parameters, allowing only the values of the measurement error to change
randomly or pseudo-randomly form set to set. By this means, many sets
of simulated data produced and each set provides a LS estimates of the
parameters of the model under consideration. These estimated parameters
may be examined for their bias, variance and other distributional properties.
Throughout this study, the reported results of simulation
are based on 100 sets of simulated data. Recall that the NLR-model is
given by (1), where the εij are uncorrelated identically
distributed random variables. The measurement errors are generated under
the considering assumptions (independent identically normally distributed
measurement errors with zero mean and constant variance σ2).
If one assumes such a type of distribution of the measurement error, then
there is a possibility to choose an appropriate value of the variance.
The considering selection of σ is based on MRC data from Vereecken
et al. (1989) (Table 2). The variance is model
independent estimated using the repeated measurements (pure error, intra
sample estimates). The calculated value
Table 3 contains on the true parameters of the model (VG2) for two different soils sand and loam (Carsel and Parrish, 1988).
Every simulated data set has the structure of Vereecken
MRC data. A simulated value is the sum of the expected value corresponding
to the value of pressure and a random error, which is normally distributed
with zero mean and variance
Each set of simulated data is then fitted by least squares.
That means, the vector of parameters β of the considering models
(VG1), (VG2), (VG3), (VG4), (VG5), (K1) and (K2) is estimated.
Table 2: | Measured MRC-data, theoretically expected (MODEL) of sand and the first simulated MRC-data for sand (SIMUL_1) |
Table 3: | True parameters of van Genuchten model (VG2) |
Table 4: | Estimated vector of parameters of the model (VG1) corresponding to the first simulated data set for sand |
Different algorithms are used, for instance the Levenberg/Marquardt
method (Bates and Watts, 1988). In the case of model (VG1) and (VG2),
the true (simulated) vector of parameters is considered as an initial
estimate. Table 4 contains the estimated vector of parameters
For every model under consideration, 100 estimates are produced. Taking each parameter separately, univariate statistics are calculated and the corresponding distribution is graphed. In addition, it is possible to examine the multivariate behavior of the LS estimator by calculating bivariate statistics and using two or three dimensional plots. This means that the researcher can see how close to linear in its behavior the LS estimator is. An acceptable model leads to LS estimations for its parameters with small biases, with distributions close to normal distribution and variances close to the minimal possible variances.
Description of the Measures of Nonlinearity
Another possible way to analyze the nonlinear behavior of a model
data set combination is the calculation of so called measures of nonlinearity
(e.g., curvature, bias and skewness). Using differential geometry-concepts,
the measures of nonlinearity based on the notion of curvature were developed
(Bates and Watts, 1988; El-Shehawy, 2001; El-Shehawy and Karawia, 2006).
These measures are independent of scale changes in both the data and parameters.
They can be used to compare different models with different parameterizations
combined with different data set.
The N-vector η(β) with the components
(7) |
defines a P-dimensional surface, the so called expectation
surface or solution locus in the n-dimensional response space. The LS
estimate
If the quadratic term in the second order Taylor approximation
of η(β) is neglected, one will have for β in the vicinity
of
(8) |
where
(9) |
From some statistical properties of the linear models
(Bates and Watts, 1988; Haines et al., 2004), the following ellipsoid
with center
(10) |
where
From (7), (8) and (10), the expectation surface η(β)
lies approximately within the intersection of tangent plane and a sphere
with center η(
A LR-model has a linear solution locus, which means a hyper-plane for p≥3 and (9) holds exactly. In addition, lines (parameter lines) on the solution locus representing constant values of βr, r = 1, 2, ..., P, are straight, parallel and equally spaced for equal increments of βr.
For a nonlinear solution locus the situation is different. The solution locus is a curved surface and the parameter lines on this surface (or the projections of these lines onto the tangent plane) are, in general, neither straight, parallel nor equi-spaced.
The extent of the curvature of the solution locus has been called intrinsic nonlinearity, since this nonlinearity cannot be altered by reparametrization. It is an inner geometrical property of the surface. The extent of the curvature of the parameter lines, their lack of parallelism and equi-spacedness has been called parameter-effects nonlinearity, since it is determined by the way in which the parameters appear in the model, that means, it depends on the parameterizations (Bates and Watts, 1988).
The validity of the tangent plane approximation (8) will
depend on the magnitude of the quadratic term in the Taylor expansion
of η(β) relative to the linear term. In making this comparison
it is helpful to split the quadratic term into two orthogonal components,
the respective projections onto the tangent plane and the component normal
to the tangent plane. These components were compared with the linear term
and got two measures of nonlinearity, the parameter-effects curvature
and the intrinsic curvature. Standardizing the model and the data leads
to scale independent quantities. Both measures depend on the direction
(β -
A convenient scale of reference can be established by
comparing the RMS curvature with that of the (scaled) confidence disk
(10) at a specified level (1-a), a>0. That means, we compare the radius
of curvature 1/c (c = cIN or c = cPE) with the radius
of the confidence disk
Following some earlier several research study s, an expectation
surface with radius of curvature 1/c is considered and the deviation of
the tangent plane from the surface or the deviation of the parameter line
from the straight line at a distance
• | A value of |
• | A value of |
• | A value of |
If cIN is replaced by cPE then a corresponding rule about the deviation of a parameter line from a straight line will be get.
In almost all cases known from several works the intrinsic curvature is very much smaller than the parameter-effects nonlinearity. Since cPE depends on the parameterization it is possible to reduce the parameter-effects nonlinearity using an appropriate reparametrization for the model under consideration.
Another practicable way to study the nonlinear behavior
of a model data set combination is the calculation of estimation for the
bias in the LS estimates. A corresponding formula for bias was presented
by Box (1971) and El-Shehawy (2001). The approximate bias for each component
of the estimate
Although the bias can be used as a measure of the extent
to which parameter estimates may exceed or fall short of the true parameter
values yet it cannot be used to compare parameters in two different parameterizations
(Ratkowsky, 1990). This comparison is possible with another measure of
nonlinearity, the measure of skewness (of the distribution of the estimated
parameter) due to Hougaard (1985) and El-Shehawy (2001). Following Ratkowsky
(1990), it is possible to use a rule-of-thumb for asserting whether the
estimator
• | If Skr < 0.1, the estimator |
• | If 0.1 ≤ Skr <0.25, the estimator will be reasonably close-to-linear in behavior; |
• | If Skr ≥ 0.25, the skewness is very apparent and |
• | If Skr > 1, indicates considerable nonlinear behavior. |
RESULTS
Simulation Experiments
Table 5 shows the mean value of the residual sum
of squares
If the retention models with respect to the ability to fit the simulated data are compared, the models with five parameters (VG1) and (K1) will be most flexible. The difference between these best fitting models and the models with four parameters is marginal, only the model (VG5) with three parameters is not flexible enough. These properties of the retention models neither depend on the type of the simulated soil nor on the used optimization algorithm.
Consider the distribution of the estimated parameters
of the models (VG1) and (VG4). Some descriptive statistics of the 100
estimated parameters of the model (VG1) for the simulated data sets of
sand are given in Table 6.
Table 5: | Residual sum of squares for the simulated data sets of sand |
Table 6: | Descriptive statistics for the distribution of the estimated parameters for the model (VG1) and the simulated data sets of sand |
Fig. 1: | The shape of the distributions of the estimated parameters Θr ( (a), (b), (c) figures) and m ((d), (e), (f) figures) of the model (VG1) and the simulated data sets of sand (Histogram with density, Pox-Plot and Q-Q Plot) |
Fig. 2: | The scatter-plot matrix for the estimated parameters of the model (VG1) and the simulated data sets of sand |
The symbol Sign. refer to the p-values for the goodness of fit test (Kolmogorov-Smirnov test) for the fit of a normal distribution. Only the distribution of the linear parameter Θr is similar to a normal distribution (p-value>0.05). In case of all other parameters the hypotheses of normal distribution has to be rejected (since p-value<0.05). These parameters are nonlinear parameters. The estimation of these parameters is biased and the distribution of the estimations is non-symmetric with a great variability. Especially, the parameters n and m are skewed (right tail) with some very extreme values. Figure 1 describes the shape of the distributions and the similarity to the normal distribution of the linear parameter Θr and of the nonlinear parameter m.
Consider dependencies between the estimated parameters
of the model with five parameters (VG1). The scatter-plot matrix in Fig.
2 shows strong relationships especially between the nonlinear parameters
a, n, m. In the Fig. 2, Beta 1, Beta 2, Beta 3, Beta
4 and Beta 5 denote to Θr, Θs, α,
n and m, respectively.
Table 7: | Descriptive statistics for the distribution of the estimated parameters for the model (VG4) and the simulated data sets of sand |
Fig. 3: | The shape of the distributions of the estimated parameters Θr ((a), (b), (c) figures) and n ((d), (e), (f) figures) of the model (VG4) and the simulated data sets of sand (Histogram with density, Pox-Plot and Q-Q Plot) |
Spearman`s rank correlation coefficient (a measure on monotone dependence (Johnson and Bhattacharyya, 1992) for a pair α, m is equal to -0.917. Within the three dimensional space the estimated values α, n, m describe a nonrandom point could. The points are concentrated along space curves. The corresponding constellations of parameters allow an equal fit of the same data. That means, with the model (VG1) over fitting is possible, estimated parameters may depend on each other and therefore it is impossible to identify a (simulated or real) soil using the estimated vector of parameters.
If the model (VG4) with four parameters and m = 1 is considered, the results for the simulated data sets of sand are given in Table 7.
It is obvious, that the estimation properties of the parameters of the model (VG4) are much more better than the corresponding properties of the model (VG1). The strong nonlinear parameter m of the model (VG1) is fixed. Although the model (VG4) is flexible yet it is also stiff enough (Fig. 3).
The scatter-plot matrix in Fig. 4 shows strong relationships especially between the nonlinear parameters α and n.
The parameter n of the model (VG4) has the worst estimation properties in comparison with the other parameters Θr, Θs and α. The distribution of the estimated values of this parameter has a great variability and is skewed (Fig. 3).
If the interest is considered in the model (VG4), an attempt to improve
these problematic qualities will be occurred by a reparametrization. As
an example, consider the replacement of the parameter n by
Table 8 and Fig. 5
show these the distributional properties in comparison with the corresponding
properties of the model (VG4). The results with respect to the other parameters
do not change.
Fig. 4: | The scatter-plot matrix for the estimated parameters of the model (VG4) and the simulated data sets of sand |
Fig. 5: | The shape of the distributions of the estimated parameter
|
The statistical properties of the model (VG4) are better than the estimation properties of the other models of the van Genuchten type with four parameter (VG2) and (VG3). These observed properties do not depend on the simulated soil.
Consider the model (K1) of King with five parameters and its variant (K2), a similar situation will be obtained. The model (K1) is very flexible but not stiff enough. Fix the strong nonlinear parameter ε, the model (K2) with four parameters will be obtained. The distributional properties of the estimated parameters of this model are much better than the properties of the corresponding model (K1).
Some Calculations for Measures of Nonlinearity
Throughout this subsection some results of nonlinearity measures are
discussed. These results relate to the current and previously studied
models in combination with some simulated data sets of sand and loam.
Firstly, consider the calculated RMS curvature measures
for the models (VG1), (VG2), (VG3), (VG4), (VG5), (K1) and (K2). Table
9 indicates to the scaled RMS curvatures (the intrinsic curvature
and the parameter-effects curvature) relative to a 95% confidence disk
radius (i.e., a = 0.05) for the studies models under consideration and
the first simulated data set SIMUL_1 of sand.
Table 8: | Descriptive statistics for the distribution of the estimated parameters for the model (RVG4) and the simulated data sets of sand |
Table 9: | Scaled RMS curvatures relative to 95% confidence disk radius for the considered models and the first simulated data set of sand |
Table 10: | Calculated parameter-effects curvature in the direction corresponding to the parameters of the van Genuchten models in combination with the first simulated data set of sand |
The intrinsic curvature only depends on the geometry (the shape) of the model data set combination. This measure is independent of the chosen parameterization of the model.
Referring to Table 9, it can be seen that the model (VG4) with four parameters possess the best value with respect to the parameter-effects curvature in comparison with the other models. This table shows also the extreme value of RMS parameter-effects curvature with respect to the model with five parameters (VG1).
Consider different data sets, different values of the
curvature are obtained and the order of models is in general stable. If
the number of repeated measurements increases, the curvature will be reduced.
A value of
Considering for the single parameters of the studied model, which cause the
corresponding value of the parameter-effects curvature, it is possible to calculate
a value of
It is obvious, that the corresponding values for the
parameters Θr and Θs are zero, since these
parameters having linear behavior in the models of van Genuchten. Similarly,
the linear parameter Θs in the models of King has a value
of zero. On the other hand, the parameter-effects curvature has the largest
values for the parameters n and m, which having strong nonlinear behavior.
Therefore, these are strong nonlinear parameters with estimation properties,
which greatly differ from the properties of linear or near linear parameters.
Since these values depend on the parameterization, it is possible to reduce
this nonlinear behavior by a reparametrization.
Table 11: | Bias and skewness for the parameter n and its reparameterization
|
Considering the reparametrization (RVG4) of the model (VG4), a value of
The values of the RMS curvatures for the models (K1) and (K2) are smaller than the corresponding values for the van Genuchten model (VG1) with five parameters, but as a rule the model (VG4) with four parameters has the best value of the parameter-effects curvature. On the other hand, the King model (K2) with four parameters has more favorable qualities in comparison with the model (K1) of King with five parameters.
The results of the calculations of the bias and the skewness
measures for the parameter n and its reparametrization
Referring to Fig. 3, 5
and Table 11, it can be seen that the distribution
of the estimated values for the parameter n is skewed with heavy right
tail and the distribution of
Using data sets of different simulations (or soils), similar results with respect to the used measures of nonlinearity will be obtained as a rule.
CONCLUSIONS
Model choice of NLR-models depends on the main aim of the analysis. A retention model, which considers as a NLR-model, has to be able to describe the typical S-shaped retention curve. If one is interested in the best fitting model, very flexible models with more parameters should be used. As a criterion ordinary or in case of heteroscedasticity, weighted least squares can be used.
In case of a retention curve, the models (VG1) and (K1) with five parameters of van Genuchten and King respectively are optimal with respect to the goodness of fit.
If the stable estimation of the parameters of a model is of prime importance, the distributional properties of the model parameters should be studies. These can be done by simulation studies or the calculation of measures of nonlinearity. In this case one looks for models with parameters which have near linear estimation behavior. These estimation properties are dependent on the inner nonlinearity of model data set combination, which is independent of the parameterization and the nonlinearity of model data set combination, which is dependent on the used parameterization. Near linear behavior means unbiased, near symmetrically distributed estimations with a near minimal variability in comparison with an approximating linear model.
Using the list of the studied models for the retention function in combination with the considered data sets (simulated or real), the van Genuchten model (VG4) with four parameters is as an optimal model. At the same time, this model has a strong nonlinear parameter n. A possible appropriate reparametrization with respect to this parameter is the model (RVG4).
ACKNOWLEDGMENTS
I would like to thank the committee of the journal and the referees for their constructive comments.