INTRODUCTION
According to Ramanathan1 in a linear regression
model, regression coefficients are the unknown parameters to be estimated. In
a simple linear regression, only two unknown parameters have to be estimated.
However, problems arise in a multiple linear regression, when the numbers of
parameters in the model are large and more complex, where three or more unknown
parameters are to be estimated. Challenges arose when computer programme has
to be written and complex iterations are required to perform with certain criterion.
Thus, the exact number of parameters involved should be known in order to prepare
the amount of data to suit such large and complex model. It is also important
to note that in order to have a unique solution in finding the estimated parameters,
according to the assumptions of multiple regression model stated by Gujarati
and Porter2, the number of estimated parameters
must be less than the total number of observations.
According to Zainodin et al.3, Yahaya
et al.4, in a multiple linear regression
analysis, there are four phases in getting the best model, namely: Listing out
all possible models, getting selected models, getting best model and conducting
the validity of goodness-of-fit. In phase 1, the number of parameters for a
possible model is denoted by NP. To get the selected models, after listing out
all of the possible models in phase 1, multicollinearity test and coefficient
test are conducted on the possible models in phase 2. Before continuing to phase
2, number of parameters in each model must be less than the sample size, n.
Discard the model which failed the initial criteria. In the multicollinearity
test, multicollinearity source variables are removed from each of the possible
models. Then, coefficient test is conducted on the possible models that are
free from multicollinearity problem. Detailed procedure of this phase is explained
in Zainodin et al.3. This is to eliminate
insignificant variables from each of the possible models.
For a general model Ma.b.c with parent model number a, the number
of variables removed due to multicollinearity problem is denoted by b, the number
of variables eliminated due to insignificance is denoted by (k+1) and the resulting
number of parameters for a selected model is represented by (k+1). In most of
the cases, if the number of the unknown parameters to be estimated for a possible
model is large, then the number of parameters for a selected model will most
probably be large too5,6.
In these cases, the manual counting on the large number of parameters is found
to be time consuming. Furthermore, some of the parameters might be missed out
due to human error in manual counting. Thus, the objective of this study is
to propose a method to count the number of parameters for a selected model,
(k+1). The information of the NP, b and c is useful in getting the number of
parameters for a selected model, (k+1).
According to Gujarati and Porter2, a simple
linear regression model without any interaction variable can be written as follows:
where, Y is dependent variable, β0 and β1 are
regression coefficients and they are the unknown parameters to be estimated,
X1 is single quantitative independent variable and u is error term.
So, it can be observed that there are two unknown parameters to be estimated
in Eq. 1. This equation can also be written in the form as
follows:
Next, a hierarchically multiple linear regression models7
with interaction variable can be written as follows:
where, Y is dependent variable, β0,
β1,
β2
and β12
are regression coefficients and they are the unknown parameters to be estimated,
X1 and X2 are single quantitative independent variables,
X12 is first-order interaction variable and u is error term. Thus,
it can be seen that there are four unknown parameters to be estimated in Eq.
2. This equation can also be written in the following form:
Next, an example for a linear regression model with interaction variables and
dummy variables is shown as follows:
where, Y is dependent variable, β0, β1, β2
β12, βD, β1D and β2D
are regression coefficients and they are the unknown parameters to be estimated,
X1 and X2 are single quantitative independent variables,
X12 is first-order interaction variable, D is single independent
dummy variable, X1D is first-order interaction variable of X1
and D, X2D is first-order interaction variable of X2 and
D and u is error term. Therefore, it can be observed that there are seven unknown
parameters to be estimated in Eq. 3. This equation can also
be written as follows:
Equation 1-3 can be written in general
model in the form:
where, Ω0
is the intercept and Ωj
is the jth partial regression coefficient of the corresponding independent variable
Wj for j =1, 2,..., k.
According to Zainodin et al.3, independent
variable Wj included the single independent variables, interaction
variables, generated variables, dummy variables and transformed variables. In
this study, (k+1) denotes number of parameters for a selected model. The corresponding
labels of the general model in Eq. 4-3 are
shown in Table 1.
From Table 1, it is known that β0
represents the Ω0
in the general model, β1
represents the Ω1
and the same goes to other estimated parameters in Table 1.
Variable X1 in Eq. 3 represents W1 in
the general model and the same goes to other variables in Table
1.
Instead of counting the number of parameters one by one as above, a equation
to count the number of parameters in model without interaction variable and
in model with interaction variable is proposed in this study. The Eq.
5 is presented as follows:
where, NP is number of parameters for a possible model, g is number of single
independent quantitative variables, h is number of single independent dummy
variables and v is highest order of interaction (between single independent
quantitative variable) in the model.
Here, v = 0 denotes model without interaction variable (or model with zero-order
interaction variable) and v = 1, 2,
denotes model with first or higher
order interaction variable(s). Hence, Eq. 5 can now be tested
in the following instances to prove its validity in counting the number of parameters
in a hierarchically multiple regression model. The aim of this study is to propose
a method to count the number of parameters for a selected model, (k+1). The
information of the NP, b and c is useful in getting the number of parameters
for a selected model, (k+1).
MATERIALS AND METHODS
This will help to better understand the application of the equation as mentioned
in previous section. In order to achieve a model free from multicollinearity
effects and insignificant effects the following 4 phase model building procedure
is implemented (details can be found in3, 4).
• |
Phase 1: All possible models |
• |
Phase 2: Selected model |
• |
Multicollinearity test and coefficient test (Include NPM is Near Perfect
Multicollinearity test and NPC is Near Perfect Collinearity test) |
• |
Phase 3: Best model |
• |
Phase 4: Goodness-of-fit |
Randomness test and normality test: This study also revealed the parameters
of independent variables measure multicollinearity effect between independent
variables parameters and structural parameters. As discussed in earlier section,
the number of parameters is very important before arriving at a selected and
best model. Thus, some illustrations follow:
Models without interaction variable: Here, the following models are
considered without interaction variable or with zero-order interaction variable.
As pointed out earlier in Eq. 5, the number of parameters
in the case of model without interaction variable (or v = 0) can be computed
using g+h+1. For instance, consider Eq. 6, a model with zero-order
interaction variable (or v equals to 0), number of single quantitative independent
variable, g equals to 1 and number of single dummy variable, h equals to 5,
as follows:
where, Y is a dependent variable, X1 is a single quantitative independent
variable (g =1) and D, B, R, A and G are 5 single dummy variables (h =5). Then,
total number of parameters involved is:
For simplicity, consider another example for calculating the number of parameters
in a model without an interaction variable. Considering Eq. 7
which is a model with zero-order interaction variable (or v equals to 0), number
of single quantitative independent variable, g equals to 8 and number of single
dummy variable, h equals to 10, as follows:
where, X1, X2, X3, X4, X5,
X6, X7 and X8 are single quantitative independent
variables and B, C, L, E, W, K, A, G, H and S are 10 single dummy variables.
Then, the following is obtained:
Models with interaction variable: Now, consider models with interaction
variable (i.e., v = 1, 2,
)
in this subsection. According to the equation mentioned in Eq.
5, it is known that the number of parameters in a model with interaction
variable is calculated in a different way from a model without interaction variable.
A few examples are presented to provide better understanding of this equation.
For instance, model with interaction variable up to first-order (i.e., v equals
to 1), number of single quantitative independent variable, g equals to 2 and
number of single dummy variable, h equals to 5 is presented in Eq.
8:
where, X2 and X4 are single quantitative independent
variables, D, B, R, A and G are single dummy variables and X24, X2D,
X2B, X2R, X2A, X2G, X4D,
X4B, X4R, X4A and X4G are first-order
interaction variables. Then, this led to:
Next, consider a larger model with higher order of interaction variable, for
instance, a model with fifth-order interaction (i.e., v equals to 5), number
of single quantitative independent variable, g equals to 6 and number of single
dummy variable, h equals to 5 as presented in Eq. 9:
In Eq. 9, X1, X2, X3, X4,
X5 and X6 are single quantitative independent variables,
X12, X13, X14, X15, X16,
X23, X24, X25, X26, X34,
X35, X36, X45, X46 X56,
X1D, X1B, X1R, X1A, X1G,
X2D, X2B, X2R, X2A, X2G,
X3D, X3B, X3R, X3A, X3G,
X4D, X4B, X4R, X4A, X4G,
X5D, X5B, X5R, X5A, X5G,
X6D, X6B, X6R, X6A and X6G
are first-order interaction variables, X123, X124, X125,
X126, X134, X135, X136, X145,
X146, X156, X234, X235, X236,
X245, X246, X256, X345, X346,
X356 and X456 are second-order interaction variables,
X1234, X1235, X1236, X1245, X1246,
X1256, X1345, X1346, X1356, X1456,
X2345, X2346, X2356, X2456 and X3456
are third-order interaction variables, X12345, X12346,
X12356, X12456, X13456 and X23456
are fourth-order interaction variables and X123456 is fifth-order
interaction variable. Then, total number of parameters:
Lastly, consider another example of a larger model which has interaction variable
up to 6th order and nine single dummy variables. Consider Eq.
1, a model with highest order of interaction, v equals to 6, number of single
quantitative independent variable, g equals to 7 and number of single dummy
variable, h equals to 9.
Here, X123456, X123457, X123467, X123567,
X124567, X134567 and X234567 are 5th order
interaction variables and X1234567 is a 6th order interaction variable.
Then, total number of parameters:
As can be seen from the illustrations, number of parameters calculated using
the derived equation tally with that in manual counting.
RESULTS AND DISCUSSION
The proposed equation defined in Eq. 5 is especially useful
in counter checking the variables when listing all of the possible models in
an analysis. This is because some of the variables might be missed out when
there are a large number of parameters involved in a possible model.
Table 2 shows all the possible models for an analysis that
has two single quantitative independent variables (X1 and X2)
and one single dummy variable (D).
In Table 2, with the information on g, h and v, the NP for
possible models M1-M12 can be computed by using Eq. 5. Then,
the number of parameters for each of the possible models can be counterchecked
by using the computed NP values. For simplicity, each of the 12 models can be
written in general form as in Eq. 4.
After introducing the equation in counting the number of parameters for a possible
model, the way of getting the number of parameters for a selected model, (k+1)
is presented8,9.
A model M32, is used as an illustration and it had also been mentioned earlier
in Eq. 3. So, multicollinearity test is conducted on the possible
model, model M32. The removal of multicollinearity source variables from this
model are shown in Table 2-4. This study
uses the modified method in removing multicollinearity source variables (excel
command: COUNTIF()).
In Table 3, all of the multicollinearity source variables
(variables with absolute correlation coefficient values greater than or equal
to 0.9500) are circled. Then, it is found that variables X12, D,
X1D and X2D have frequencies 2. So, according to Zainodin
et al.3, model M32 belongs to case
B. To avoid confusion between dummy variable B and case B, case B is represented
by case 2 in this study.
Similarly, case C is represented by case 3 in this study. Based on the removal
steps for case 2, variable X12 which has the weakest absolute correlation coefficient
with dependent variable Y, if compared to variables D, X1D and X2D,
is removed from model M32. The same removal steps are carried out on the reduced
model; model M32.1, as presented in Table 4.
After removing variable X2D from model M32.1 in Table
4, details correlation coefficients of the reduced model M32.2 are shown
in Table 5. It is observed that each of the variables D and
X1D has the highest frequency of one, respectively. So, it is identified
that model M32.2 belongs to case III. Therefore, variable X1D which
has a weaker absolute correlation coefficient with the dependent variable Y
is then removed from this model. Thus, the resulting model free from multicollinearity
is M32.3.
Table 6 shows that model M32.3 is free from multicollinearity
source variables because all the absolute correlation coefficient values between
all the independent variables are less than 0.9500 (except the diagonal values).
By observing the model M32.3, it is found that the number of variables removed
due to multicollinearity problem is 3. Therefore, b for model M32.3 is 3. More
details on the definition of model name can be found in3
and4,6,10.
Thus the resulting model, M32.3 is free from multicollinearity effects and the
coefficient test is conducted on the model. The task then is to eliminate insignificant
variables from this model, M32.3.
In Table 7, it is found that variable X2 has the
highest p-value among other independent variables and is greater than 0.05 (since
the number of single quantitative independent variables is greater than 5 and
the coefficient test is a two-tail, the level of significance is set at 10%.
This is based on11 recommendations). Thus,
variable X2 is eliminated from model M32.3 and this reduced model
is called model M32.3.1 as 1 variable is eliminated due to insignificance. Details
on the coefficient test can be found in8,12,13.
Table 8 shows that model M32.3.1 is free from insignificant
variable because both the p-values of variable X1 and D are less
than 0.05. From Table 8, it is found that 3 parameters (constant
of the model, coefficient of variables X1 and D) are left in model
M32.3.1, or in other words, (k+1) equals to 3. From the model name, model M32.3.1;
it is noticed that 1 variable is eliminated in coefficient test and c equals
to 1. As mentioned earlier in Eq. 3, there are seven unknown
parameters to be estimated for model M32, so NP equals to 7. Thus, by knowing
the NP, b and c for model M32.3.1 (i.e., Ma.b.c), the number of parameters (k+1)
for model M32.3.1 can be counterchecked using the proposed equationin this study
as:
Therefore, it is shown in Table 8 that the number of parameters
left in the selected model M32.3.1 is the same with the value obtained from
the proposed in Eq. 11.
In line with the above discussion, other researchers also highlighted the
importance of this parameters counting. They ranked them (as importance, significance
or dependency etc.) the parameters of a model based on the magnitude of the
coefficients11,14,15.
CONCLUSION
This study is new and groundbreaking. It has succeeded in proposing a equation
in counting the number of parameters for each of all possible models (details
can be found in phase 1 and in3,4).
This equation helps to countercheck the number of parameters left in the selected
model which is free from multicollinearity and from insignificant variable.
It presented a equation to calculate the number of parameters and demonstrated
their application on models with and without interaction variables.
As can be seen from previous section, it requires lengthy time to calculate
the number of parameters in a model, especially for bigger models like Eq.
9-10. Instead of calculating the number of parameters
one by one manually, the equation established in this study allow researchers
to obtain the number of parameters is an easier, faster yet accurate way. Besides,
human errors that are caused by manual counting, (have happened during model
development) can also be minimised and avoided. The proposed equation also helps
to save tremendous amount of time, where there are analysis involving complex
iterations or repeated tasks, especially in software development.