INTRODUCTION
Regression analysis is a statistical technique concerning about the study of
the relationship between one dependent variable and one or more independent
variable (Gujarati, 1999). Researchers have made heavy
use of regression analysis in business, social sciences, biological sciences
and many other fields. The linear regression analysis is used to find the influence
of the independent variable on the dependent variable while the multiple regressions
is used to find the influence of more than one independent variables on the
dependent variable. An example of the study done on multiple regressions is
by Matiya et al. (2005) in determining the factors
influencing the prices of fish and its implications on development of aquaculture
in Malawi. Reliable alternative approaches are also suggested to other existing
methods in order to obtain better estimates, such as, Midi
et al. (2009) had proposed a leverage basednear neighbors in the
estimation of parameters in heteroscedastic multiple regression models. Besides
that, the effect of processing parameters on the microstructures and properties
of automobile brake drum using multiple regression analysis was also studied
by Oluwadare and Atanda (2007).
A common problem in multiple regression is multicollinearity. As Zainodin
and Khuneswari (2009a) had stated that multiple regression is a regression
model with more than one explanatory variable. The general form of multiple
regression is shown as follows:
where, Y is the dependent variable, Ω_{0 }is constant term, Ω_{j}
is the jth coefficient of independent variable W_{j }and W_{j }is
the jth independent variables (included the single independent variables, interaction
variables, generated dummy variables and transformed variables) where j = 1,
2, ..., k. When there exist highly correlated independent variables in the model,
then multicollinearity effects are said to exist. Various methods had been suggested
to overcome this problem. ElSalam (2011) had proposed
an estimation procedure for determining ridge regression parameter in terms
of least Mean Square Error (MSE). In the presence of multicollinearity, models’
parameter estimation became inaccurate. Hence, Camminatiello
and Lucademo (2010) had developed an extension of the principal component
logistic regression to overcome this problem. Midi et
al. (2010) had also proposed Robust Variance Inflation Factors (RVIFs)
in the detection of multicollinearity due to high leverage points which were
the sources of multicollinearity.
MATERIALS AND METHODS
Multiple regressions are used to analyse the data in this study. There are four phases in the model building procedures of multiple regressions, from listing down the all possible models to carrying out the goodnessoffit on the residual of the best model. The model building procedures are shown in Fig. 1.
All possible models: According to Fig. 1, all of the
possible models have to be listed out before analysis is carried out. Zainodin
and Khuneswari (2009a) stated that the number of all possible models can
be calculated as follows:
where, N is number of possible models and q is single independent variables which excluded the dummy variables.
Selected models
Multicollinearity test: In order to get the selected models, the multicollinearity
test is carried out to remove multicollinearity source variables from each models
and the procedures are shown in Fig. 2.
In this study, the alternative method is used in overcoming multicollinearity,
rather than the conventional method. The multicollinearity source variables
are variables with absolute correlation coefficient greater than 0.95 and they
are marked with circles in the correlation coefficient matrix. There are three
types of cases in the multicollinearity test and the removal steps of multicollinearity
source variable are based on these three cases as follows:
Case A: 
The most common variable is removed first. Then, rerun the
reduced model 
Case B: 
When more than one tie exists (or with frequency two and above), the variables
with the highest frequency are considered first. Then, independent variable
which has the smallest absolute correlation coefficient with Y is removed.
Then, rerun the reduced model 
Case C: 
When only one tie exist (or with frequency one), the pair variables which
have a higher correlation coefficient is considered first. Then, independent
variable which has a smaller absolute correlation coefficient with Y is
removed. Then, rerun the reduced model 

Fig. 1: 
Model building procedures 

Fig. 2: 
Multicollinearity test procedures 
Then, to get the frequency for a specific identified multicollinearity variable
in the correlation coefficient matrix, the algorithm of counting the frequency
is as follows:
Step 1: 
For each variable, draw a horizontal line until offdiagonal
values 
Step 2: 
Then, the horizontal line is continued by drawing a vertical line on the
lower part values from diagonal value and circle absolute values greater
than 0.95 
Step 3: 
Lastly, among all of the values cut by both horizontal and vertical lines,
count the number of times the circle (s) has appeared (the diagonal values
are not considered) 
Since the correlation coefficient matrix is symmetry, thus only the lower diagonal
values are considered in counting the number of frequency. Thus, according to
Fig. 2, after the frequency for each independent variables
in a model are obtained, type of case can be identified and removal of multicollinearity
source variable can be carried out. This ZainodinNoraini multicollinearty remedial
procedure is carried out to each of the possible model.
Coefficient test: After removal of multicollinearity source variables,
according to Fig. 1, the next step is to perform coefficient
test on the reduced model. Zainodin and Khuneswari (2009a)
stated that coefficient test is used to test the coefficient of the corresponding
variables. Variables which are insignificant are eliminated subsequently. For
a specific j, the hypothesis for Coefficient Test is as below:
The decision is that the null hypothesis is rejected if t_{cal} is
greater than t_{critical} where
and t_{critical} is t_{α/2,(nk1)}.
is the standard error for
and Ω_{j} (H_{0}) is the value of Ω_{j} under
H_{0} for j = 1, 2,…, k. The decision is to accept the null hypothesis.
Thus, variable with the smallest t_{cal} and is nearest to zero is
eliminated from the models. The elimination process is repeated until there
is no more insignificant variable in the models.
Best model: After all of the selected models are obtained, models with
the same independent variables are filtered out. After that, to get the best
model, Eight Selection Criteria (8SC) is carried out on the selected models
which have undergone filtration. Zainodin and Khuneswari
(2009b) have discussed in detail the usage of the 8SC. The Akaike Information
Criterion (AIC) (Akaike, 1974) and Finite Prediction
Error (FPE) (Akaike, 1969) are developed by Akaike.
The Generalised Cross Validation (GCV) is developed by Golub
et al. (1979) while the HQ criterion is suggested by Hannan
and Quinn (1979). The RICE criterion is discussed by Rice
(1984) and the SCHWARZ criterion is discussed by Schwarz
(1978). The SGMASQ is developed by Ramanathan (2002)
and the SHIBATA criterion is suggested by Shibata (1981).
The Eight Selection Criteria (8SC) is presented in Table 1.
Table 1: 
Eight selection criteria (8SC) 

SSE = Sum of square error, k+1 = No. of parameters and n =
No. of observations 
Goodnessoffit
Randomness test: Randomness test is used to test the randomness of residuals.
The distribution of the residual can be obtained from the histogram and scatter
plots of the residuals. Bin Mohd et al. (2007)
stated that the randomness of residuals, u_{i} (i = 1, 2, 3,…,
n), can be checked by simple correlation coefficient. The procedures are as
below:
Step 1: The null and alternative hypotheses are defined as follow:
• 
H_{0}: The residuals, u_{i} are randomly distributed 
• 
H_{1}: The residuals, u_{i} are not randomly distributed 
Step 2: Test statistic is calculated as follows:
where,
and R is simple correlation coefficient and n is sample size. Since u_{i}
are independent on i, then random variable
follows a tdistribution with degree of freedom = np where p = k+1 which is
the number of estimated parameters.
Step 3: The null hypothesis is accepted if t_{critical} is greater than T_{n} which means that the residuals u_{i} are randomly distributed.
Normality test: According to Gujarati (1999) the
normality of a regression model can be obtained by using the histogram of residuals
and Normal Probability Plot (NPP). By plotting the histogram of residuals, the
shape of the underlying probability distribution can be estimated. In the NPP,
the variable of interest is normally distributed if a straight line fits the
data well. Besides that, the KolmogorovSmirnov test and ShapiroWilk test are
also used to test the normality of the residuals. KolmogorovSmirnov test is
used when the number of observations is large while ShapiroWilk test is used
when the number of observations is small. Both of these tests can be carried
out by using the SPSS software. The null and hypotheses for normality test are
as below:
• 
H_{0}: The residuals, u_{i} are normally distributed 
• 
H_{1}: The residuals, u_{i} are not normally distributed 
The decision is to accept the null hypothesis if the pvalue from the SPSS
output is greater than 0.05. Thus, the residuals are assumed to be normally
distributed. Apart from this, some graphical plots, such as scatter plot, histogram,
QQ plot and box plot can also used as supporting evidence for the normality
test.
Data analysis
Data description: The data is obtained from Dr. A. Garth Fisher from
the Human Performance Research Centre of Brigham Young University and contains
the observations of 252 men (Johnson, 1996). In this study,
nine variables are selected and analysed. They are the percentage of body fat
using Siri’s equation, abdomen circumference, adiposity index, chest circumference,
hip circumference weight, density, height and neck circumference. According
to BosyWestphal et al. (2005), the Siri’s
equation used in estimating the percentage of body fat is as follows:
where, the body density will be calculated as weight/volume. The descriptive statistics of these 9 variables are shown in Table 2.
The correlation among dependent variable, percent of body fat using Siri’s equation and the other 8 independent variables is presented in Table 3. However, due to limited space, the name of the variables in Table 3 are represented by their short forms, where their full names can be referred in Table 2.
Table 2: 
Descriptive statistics for all 9 variables 

Dummy transformation: Dummy variables are variables that take the values
of 0 and 1 (Gujarati, 1999). Among the eight independent
variables in Table 2, the latter three are transformed into
dummy variables because density (X_{6}) and height (X_{7}) have
negative skewness and among the other six independent variables which are highly
correlated with dependent variable Y, neck circumference (X_{8}) has
the weakest correlation coefficient value. In addition, neck circumference can
also used in identifying overweight and obese patients (BenNoun
and Laor, 2006). Therefore, it is suitable to be selected as one of the
variables for this study.
The transformation of independent variables into dummy variables can help to decrease the number of possible models in this study. This can be seen by using Eq. 1 and 2 if the three independent variables are not transformed into dummy variables, the number of independent variables are 8 and the number of possible models are 1024. However, if density, height and neck circumference are transformed into dummy variables, the number of possible models for 5 independent variables are 80 only.
After transformation, density (X_{6}), height (X_{7}) and neck circumference (X_{8}) are represented by D, H and N, respectively. The mode for Density (D), Height (H) and neck circumference (N) is 1.061, 71.5 and 38.5, respectively. For those which are less than their respective modes are denoted as 0, while for those observations which are more than their respective modes are denoted as 1. For better understanding, partly of the data of dummy variables after transformation is presented in Table 4.
Procedures in getting the best model: After transformation, according to the model building procedures in Fig. 1, all of the possible models are listed out by using Eq. 1 and 2. Since, there are five single nondummy independent variables in this study, thus the numbers of all possible models are 80. Then, the selected models can be obtained by carried out the multicollinearity test. For illustration purpose, model M53.0.0 is considered as follows:
However, due to limited space, model M53.12.0, which has eliminated 12 independent variables from the parent model is considered as follows:
Model M53.12.0. which has eliminated 12 independent variables from the parent
model can be known from its model name, where 12 represents that 12 variables
are eliminated in Phase 2.1 and zero shows that no variable is eliminated in
Phase 2.2 from the parent model.
Table 4: 
Partly of the data of dummy variables after transformation 

For better understanding, the definition of model name is presented in Fig.
3. Besides that, the removal of multicollinearity source variables from
model M53.12.0 is presented in Table 5.
The frequency tables for several cases in removing the corresponding variable from model M53.12.0 until model M53.17.0 are shown in Table 6.
In Table 5, variable X_{12 }is numbered as 13 because it is the 13th variable removed from model M53.0.0. Model M53.12.0 belongs to Case B because there exists more than one tie, where variables X_{12}, D, X_{1}D and X_{3}D has frequency of two respectively. Since variable X_{12 }has the smallest absolute correlation coefficient with Y, which is 0.7505, so it is removed from model M53.12.0. Then, the analysis is rerun and a new model M53.13.0 is produced.
Besides that, for model M53.16.0, it belongs to Case C because there exists only one tie. This is due to variable X_{5}, X_{35}, H, X_{2}H has frquency of one respectively. Then, the pair variables of X_{5} and X_{35 }is considered first because it has a higher correlation coefficient than the pair variables of H and X_{2}H, which are 0.9859. After that, X_{5} is removed from model M53.16.0 due to its smaller absolute correlation coefficient with Y than X_{35}, which is 0.6124. Then, the analysis is rerun and a new model M53.17.0 is produced. The same removal steps are carried out on other multicollinearity source variables according to their types of cases. The way to count frequency and the removal steps based on related types of cases. Thus, after removal of 18 variables from model M53.0.0, the correlation coefficient table for variables in model M53.18.0 is shown in Table 7.
From Table 7, it can be observed that all of the absolute correlation coefficient values (excluded the diagonal values) are less than 0.95 and thus model M53.18.0 is said to be free from multicollinearity.
Table 5: 
Removal of multicollinearity source variables from Model
M53.12.0 

Bold values are cases which are in removing the corresponding
variable from model 

Fig. 3: 
Definition of model name 
Table 6: 
Frequency tables from Model M53.12.0 until Model M53.17.0 

Table 7: 
The correlation coefficient for variables in Model M53.18.0 

Then, according to Fig. 1, the coefficient test is carried out to remove insignificant variables from the models. Therefore, further analysis is taken on model M53.18.0, where Table 8 shows the t_{cal} values for each variable in model M53.18.0.
For the hypotheses of coefficient test for model M53.18.0, t_{critical} is t_{0.025, (25271)}, which is 1.97. The decision is to accept the null hypothesis, where the t_{cal} is smaller than t_{critical}, which shows that the corresponding variable of the specific coefficient has no contribution to the model. For M53.18.0, both of the corresponding variables of β_{H} and β_{2N}, H and X_{2}N have t_{cal} which are smaller than the t_{critical}, however only one variable is eliminated in each elimination step. Thus, only variable H is eliminated due to its t_{cal} is nearer to zero than variable X_{2}N. The analysis is rerun with the remaining variables and the new model is model M53.18.1. The resulting t_{cal} values after eliminated variable H are shown in Table 9.
The t_{critical} is t_{0.025, (25261)}, which is 1.97.
The decision is to accept the null hypothesis, where the t_{cal} is
smaller than t_{critical}. Since only corresponding variables of β_{2N},
X_{2}N has t_{cal} which is smaller than the t_{critical}
and is nearest to zero, thus it is eliminated from model M53.18.1.
Table 8: 
The t_{cal} values for each variable in model M53.18.0 

Table 9: 
The t_{cal} values for each variable in model M53.18.1 

Table 10: 
The t_{cal} values for each variable in model M53.18.2 

The analysis is rerun with the remaining variables and the new model is model
M53.18.2. The resulting t_{cal} values after eliminated variable X_{2}N
are shown in Table 10.
The t_{critical} is t_{0.025, (25251)}, which is 1.97. Since, all of the variables have t_{cal} that are greater than the t_{critical}, thus no variable is eliminated from model M53.18.2. Therefore, model M53.18.2 is said to be free from multicollinearity and insignificancy. Besides that, pvalues can also used in eliminating insignificant variables, variables with the highest pvalues and greater than 0.05 are eliminated from the model one by one. Similar procedures are carried out for other 79 possible models. Table 11 shows the summary for selected models.
All the selected models in Table 11 have filtered out models with the same independent variables, where the first appeared name of the model is taken. For example, model M53.18.2 has the same independent variables with model M57.25.3, model M75.22.3 and model M80.40.4, thus model M53.18.2 is taken to carry out the analysis. Table 12 shows the corresponding selection criteria values for each selected models.
From Table 12, model M53.18.2 is found to be the best model because it has most of the minimum values among the others in 8SC. Model M53.18.2 can be written as in the equation as follow:
where, X_{1 }represents the abdomen circumference, X_{2 }is
the adiposity index, X_{3 }is the chest circumference, X_{35 }is
the firstorder interaction variables of chest circumference and weight, D represents
density and u is the residual.
Table 11: 
Summary for selected models 

Table 12: 
The corresponding selection criteria values for each selected
models 


Fig. 4: 
Scatter plot of standardized residual 
According to Fig. 1, after the best model is obtained, the goodnessoffit is carried out on the residuals of the best model. In this case, the randomness test is carried out to verify the randomness of residuals. The hypothesis of randomness test is as follow:
• 
H_{0}: The observations u_{i} are random 
• 
H_{1}: The observations u_{i} are not random 
where, I = 1, 2,…, 252
The null hypothesis is accepted if T_{n} is less than t_{critical}, where t_{critical} = t_{α, nk1}. The calculation of T_{n} and the result is T_{n} equals to 0.0013, where k equals to 5 as can be seen in Eq. 6 that there are five independent variables in the best model. From the tdistribution table, at α = 0.05, t_{critical} = 1.65. Since T_{n} = 0.0013 is less than t_{critical}, the null hypothesis is accepted and the residuals u_{i} are randomly distributed. Besides that, the scatter plot for the standardized residual in Fig. 4 also shows that the residuals are randomly distributed because no obvious pattern is observed.
Then, the normality test is also carried out to test the normality of the residuals in the best model. In this study, KolmogorovSmirnov is used to test normality since the number of observations are large, which are 252 men.
The hypothesis of normality test is shown as follow:
• 
H_{0}: The standardized residual is normally distributed 
• 
H_{1}: The standardized residual is not normally distributed 
The decision is that the null hypothesis is rejected if the pvalue is less
than 0.05. Table 13 shows the SPSS output of the KolmogorovSmirnov
Test on the standardized residual.
Since the pvalue in Table 13 is 0.2000, which is greater
than 0.05, thus the null hypothesis is accepted and residuals are said to be
normally distributed. Besides that, the bellshaped histogram of standardized
residual in Fig. 5 also shows that the residuals are normally
distributed.
Table 13: 
Kolmogorovsmirnov test on standardized residual 

Table 14: 
The final coefficient values of model M53.18.2 


Fig. 5: 
Histogram of standardized residual 
Therefore, the residuals of the best model are said to be random and normally
distributed.
DISCUSSION
This study showed that model M53.18.2 is the best model, where the equation is shown as in Eq. 6 to represent the factors that affect the percentage of body fat in men. Table 14 shows the final coefficient values of model M53.18.2.
As can be seen from Table 14, the positive coefficient values show that the percentage of body fat in men by using the Siri’s equation (Y) will increase if the corresponding variables increase, while the negative coefficient values show that the percentage of body fat in men (Y) will decrease if the corresponding variables decrease. Thus, the increment in abdomen circumference (X_{1}), adiposity index (X_{2}) and chest circumference (X_{3}) will cause increment on percentage of body fat by Siri’s equation (Y) in men. However, the increment in Density (D) and firstorder interaction variables of chest circumference and weight (X_{35}) will cause decrement on percentage of body fat by the Siri’s equation (Y) in men. This increment in Density (D) is found to bring the most decrement or influence (β = 7.4430) but a very minor change (β = 1.1x10^{3}) on the percentage of body fat in men.
CONCLUSION
As a conclusion, the body density is found to be the main factor that contributed negatively in estimating the percentage of body fat in men, followed by the positive relationships of the other main factors, namely, the abdomen circumference, adiposity index and chest circumference. The interaction variable between the chest and the body weight only caused a very minor negative effect on the percentage of body fat. It is also suggested that further analysis can be carried out by including the Brozek’s equation, which is also used in estimating percentage of body fat in human. Comparisons can then be made on the efficiency of both the Siri’s equation and Brozek’s equation in estimating the percentage of body fat.