INTRODUCTION
The regression analysis studies the statistical dependence of one or more dependent variables, Y, on one or more explanatory variables, X. All procedures used and conclusions drawn in a regression analysis depend on assumptions of a regression model. The most used model is the classic linear regression model and the most used method for estimating classic model parameters is the method of Ordinary Least Squares (OLS).
Under the classic assumptions this method has some attractive statistical properties that have made it one of most powerful and popular methods of regression analysis. However, OLS is not appropriate when strong correlation among predictors (multicollinearity) exists, so alternative methods to OLS have been developed.
The presence of multicollinearity may indicate that some explanatory variables are linear combinations of the other ones. Consequently, they do not improve explanatory power of a model and could be dropped from the model. In some situations, it is not feasible to use variable selection to reduce the number of explanatory variables or it is not desirable to do so.
An alternative way of dealing with unpleasant consequences of multicollinearity lies in biased estimation: we can sacrifice a small bias for a significant reduction in variance of an estimator. This observation motivates a whole class of biased estimators called shrinkage estimators.
There are various shrinkage methods that perform well under multicollinearity
and that can possibly act as variable selection tools as well: the Ridge Regression
(RR) (Hoerl and Kennard, 1970) and its modifications,
Partial Least Squares (PLS) regression (Wold, 1966, 1985)
Principal Components Regression (PCR) (Massy, 1965).
Frank et al. (1993) compared the methods above
mentioned. Although the results are conditional on the simulation design used
in the study, the researchers indicate that RR, PCR and PLS have similar proprieties,
give similar performance and are highly preferable to variable selection.
When the response is measured on nominal scale with more than two categories,
the multinomial logit model (Luce, 1959; McFadden,
1974) is applied.
In this study, we show, in presence of multicollinearity, the estimation of the multinomial model parameters becomes inaccurate because of the need to invert nearsingular and illconditioned information matrices. To provide an accurate estimation of the model parameters, we propose a new strategy based on PCR. Then, the performance of the proposed model is analyzed by developing a simulation study. In order to validate the results obtained from each simulation we apply bootstrap procedure.
A SOLUTION TO MULTICOLLINEARITY: PRINCIPAL COMPONENT MULTINOMIAL REGRESSION
The binary logit model is used to predict a binary response variable in terms
of a set of explicative ones. When the response is measured on nominal scale
with more than two categories, the multinomial logit model is applied. Both
the models become unstable when there is multicollinearity among predictors
(Ryan, 1997).
To improve the estimation of the binary logit model parameters, Marx
(1992) introduced iteratively reweighted partial least squares algorithm,
Bastien et al. (2005) proposed partial least
squares logit regression, Aguilera et al. (2006)
presented Principal Component Logistic Regression (PCLR), Vagoand
Kemeny (2006) developed the ridge logistic regression.
Following the approach proposed by Aguilera et al.
(2006), we propose to use as covariates of the multinomial logit model a
set of orthogonal variables, linear combination of original ones, in order to
provide an accurate estimation of the parameters.
Here, we describe Principal Component Analysis (PCA) and multinomial logit regression, then we illustrate our idea.
Principal Component Analysis
Principal Component Analysis (PCA) is a multivariate technique that transforms
a number of correlated variables into a (smaller) number of uncorrelated variables
with maximum variance, called Principal Components (PC). The first principal
component accounts for as much of the variability in the data as possible and
each succeeding component accounts for as much of the remaining variability
as possible.
Let be
a set of p quantitative independent variables, y a categorical response variable
with more than two categories. The aim of PCA is to find a set of uncorrelated
latent variables
which are linear combinations of the original variables Z = XV. The weight matrix
is
built by the eigenvectors of the covariance or correlation matrix. These matrices
can be calculated from the data matrix X. The covariance matrix contains scaled
sums of squares and cross products. The correlation matrix is similar to the
covariance matrix but first the variables, i.e., the columns, have been standardized.
For the reasons which are beyond the scope of this paper, it is often preferable
to perform the analysis on correlation matrix R, whose elements are the correlation
coefficients among the independent variables. The basic proprieties of the analysis
are:
• 
The PC’s are orthogonal 
• 
The weights used to determine the PC’s maximize the variance among
the x variables, so, the first a<p PC’s lead to a good approximated
reconstruction of original matrix X = ZV^{T} 
Multinomial Logit Regression
Multinomial logit regression is the simplest model in discrete choice analysis
when more than two alternatives are in a choice set. It is derived from utilitymaximizing
theory that states that consumer chooses the alternative which maximizes his
utility. Obviously not all the attributes of the alternatives will be observed
and for this reason the utility is divided in two parts:
• 
D_{ib} is the systematic part of the utility that the individual
i receives by a generic alternative b 
• 
ε_{ib} is the random part and summarizes the contribution
of unobserved variables (BenAkiva and Lerman, 1985).
The probability to select a specific alternative c for the individual i
is then: 
where, D_{ic} is the systematic part of the utility that the individual i receives by the alternative c, ε_{ic} is the disturbance and s is the number of alternatives.
If we assume that the disturbances are independent and identically extreme
value distributed (Marschak, 1960) we obtain the multinomial
logit model. The probability can be then expressed as follows:
The term μ is a scale parameter and it can be normalized to 1; furthermore,
if the systematic part of the utility is linear in the parameters, we have (Train,
2003):
where, x_{ij}, (i = 1,...,n,; j = 1,...,p) are the elements of the X matrix and β_{jb} are the parameters to be estimated.
Principal Component Multinomial Regression
At this point, we can present the new approach: Principal Component Multinomial
Regression (PCMR). At first step, PCMR creates the PC’s of the regressors
as described above. At second step the multinomial model is carried out on the
set of p PC’s. The probability, for the individual i, to choose the alternative
c can be expressed in terms of all PC’s as:
Where: 


z_{ik}, (i = 1,...,n; k = 1,...,p)

= 
The elements of the PC matrix 
v_{kj}, (j = 1,...,p) 
= 
The elements of the transposed matrix V^{T} 
_{
} 
= 
The coefficients to be estimated 
β_{jb } 
= 
The parameters expressed in function of original variables and 
S 
= 
The number of alternatives of the data set 
At third step, the number of PC’s a<p, to be retained in the model, is chosen. The next paragraph discusses about the different tools for selecting the number of PC’s. At fourth step, the multinomial model is carried out on the subset of a<p PC’s. The probability, for the individual i, to choose the alternative c can be expressed in terms of a PC’s as:
Where: 

= 
The coefficients to be estimated on the subset of a PC’s 

= 
The PCMR parameters obtained after the extraction of the a components 
Finally, the multinomial model parameters can be expressed in function of original variables (X matrix).
where, β^{(a)} = V^{(a)}γ^{(a)}; Z^{(a)} is the matrix of a PC’s; γ^{(a)} is the matrix of parameters on a PC’s for the s alternatives; V^{(a)} is the matrix of a eigenvectors; β^{(a)} is the matrix of parameters expressed in function of original variables. An interesting result which has been obtained is β^{(p)} = β, that is, if we retain all PC’s in the model, the matrix of parameters expressed in function of original variables, β^{(p)} is equal to the matrix of classical multinomial parameters, β.
However, the most important result is that the PCMR leads to lower variance
estimates of model parameters comparing to classical multinomial model. We calculate
the variance of the estimated parameters of the multinomial model by bootstrap
resampling. Let,
be the bootstrap estimate of the parameter
for the lth sample, let
be the estimated parameter, the bootstrap estimate of variance of
is the empirical estimate calculated from m bootstrap values:
where,
is the bootstrap mean of the estimations of the jth parameter
Model Calibration and Validation
The number of PC’s, a, is bounded from above by p, the number of x
variables. Hence, the number of components should be chosen in the range 1≤a≤p.
The number of PC’s, a, to be retained in the model can be selected according
to different tools. The first possibility is to retain all the components, but
the most used criteria are:
• 
To consider the PC’s in their natural order and stop when explained
variability is about 75% 
• 
To consider the PC’s that correspond to eigenvalues bigger than one 
However, the dependence relationship between the response and the predictor
variables is not taken into account. For this reason we propose the criterion
of considering in the model all the PC’s that influence in statistical
significant manner the response variable. A forward stepwise procedure is applied
for selecting the significant components.
To determine the goodness of the different criteria, we develop a bootstrap
procedure and we use the bootstrap samples to estimate the parameters, both
for the original matrix and for the PC matrix. We propose two accuracy measures:
• 
The Root Mean Squared Error (RMSE) of bootstrap estimates
for 
• 
The BIAS for ,
calculated as the differences, in absolute value, between the bootstrap
mean of the parameter estimations and the true values of the parameters 
They are defined as follows:
The simulation study and the accuracy measures show the best results are obtained, when the criterion of significant components (the third abovewrittencriterion) is used.
A SIMULATION STUDY
In order to illustrate the performance of the proposed approach, we develop
a simulation study according to the scheme proposed by Hosmer
et al. (1997).
The first step in the simulation process is to obtain a set of p explicative
variables with a known correlation framework. For this purpose, we apply Cholesky
decomposition. The second step is to fix a vector of real parameters β
and compute the real probabilities. Finally, each value of the response is simulated
from a multinomial distribution. After each data simulation, we fit the multinomial
model. As we will see the estimated parameters
are always very different to the real ones due to multicollinearity.
As we stated in previous sections, the PCA of the regressors helps to improve this inaccurate estimation of the parameters. Once the PC’s of the simulated covariates are computed, we fit the PCMR(a) models with different sets of a PC’s. Then, we compute for all fitted PCMR(a) models the estimated parameters in terms of the original variables and their variance defined in the Eq. 7 for testing the improvement in the parameter estimation.
This simulation design is carried out for three different numbers of regressors (p = 10, 12 and 15) two different sample sizes (n = 100 and 200) three different number of alternatives (s = 3, 4 and 5) and two different distributions of regressors (standard normal and uniform distribution). The number of performed simulations is 360. Different criteria are considered to decide the number of components to retain in the model.
We present two specific simulation studies:
• 
Simulation 1 with n = 100, p = 10 and s = 4, correlation among the regressors
from 0.4 to 0.9 and regressors with standard normal distribution 
• 
Simulation 2 with n = 100, p = 10 and s = 4, correlation among the regressors
from 0.4 to 0.9 and regressors with uniform distribution 
Table 1 to 5 shows the results for the
simulation 1 and their validation. Table 6 displays the results
for the simulation 2.
Let us focus on the simulation 1. In the first column of table
1 there are the labels of the parameters, then we have the real parameters
(Real), the parameters estimated with all the components (All PC's), the parameters
calculated on the dataset of PC’s with eigenvalue bigger than one (PCMR(1))
and the parameters estimated on the PC’s that influence in significant
manner the dependent variable (PCMR (sign)). In all the cases the parameters
are calculated for the three possible alternatives (the fourth is the fixed
alternative). As β^{(p)} = β, we do not insert the column
of parameters obtained trough classical multinomial model.
It is easy to see (in bold character) that the parameters estimated with all
the components have many discordances in the sign (15 discordant signs) and
they are always very different to the real ones.
Table 1: 
Estimated parameters with the different methods for all the
alternatives (simulation 1) 

Bold values indicate the parameters estimated with all components
have many discordances in the sign 
Table 2: 
Differences in absolute value between the real parameters
and the estimated ones for all the alternatives (simulation 1) 

Table 3: 
RMSE of estimated parameters for the different methods for
all the alternatives (simulation 1) 

Table 4: 
BIAS: differences between the mean of the parameter estimations
and the true values for all the alternatives (simulation 1) 

Table 5: 
Bootstrap estimates of variance of the parameters for the
different methods for all the alternatives (simulation 1) 

Table 6: 
Estimated parameters with the different methods for all alternatives
(simulation 2) 

Bold values indicate the parameters estimated with all components
have many discordances in the sign 
The situation improves if we consider the parameters calculated with the other
two methods and we have the best results for the PCMR(sign): 9 signs discordant
and parameters more similar to real ones.
For the sake of facilitating the reading we calculate in Table 2 the differences, in absolute value, between the real parameters and the estimated ones, with the different methods. It is possible observe that such differences are high for the classical multinomial method. As we have previously stated, these results must be caused by multicollinearity.
In order to validate the results obtained from each simulation we apply bootstrap resampling. Table 3 and 4 show the RMSE and the BIAS for the different methods, parameters and alternatives. The number of bootstrap samples is 100. In the Table 3, we can note that the RMSE, computed considering all the Principal Components, is very high for the first 2 alternatives, assuming, sometimes, values higher than 1000000. For the third alternative, the situation is a little better, but the results are not so good, if we compare them to the ones obtained with the techniques based on PCMR. In fact, form Table 3, that the values of the RMSE are always lower than before (not higher than 4) and, for the PCMR(sign), they are very low.
Passing to consider the Table 4 calculated BIAS, which is very high for the classical method. This means that the averages of parameter estimations are very far from the real values of the coefficients. If the reader looks at the results relative to the PCMR(1) and PCMR(sign), he can observe that they are better and, in particular, the values for the last method show that the bootstrap means of estimates are very near to the real parameters.
Finally in Table 5, we calculate the bootstrap estimates of variance of the estimated parameters using the Eq. 7. It is interesting to observe the variance of PCMR estimates is lower than classical multinomial one. In particular, the variance of classical multinomial estimates varies from 4.228807 to 1049110, PCMR (sign) one, instead, from 0.019927 to 1.944938. So, the aim of obtaining a considerable reduction in variability of estimates is reached. The positive effect of this result on confidence intervals, test hypothesis, etc., is remarkable.
Table 6 shows the results for the simulation 2. They confirm that the parameters estimated with all the components have many discordances in the sign (14 discordant signs) and they are always very different to the real ones.
The situation improves if we consider the parameters calculated with the other two methods and we have the best results for the PCMR(sign): 8 discordant signs and parameters more similar to real ones.
CONCLUSION
In this study, we showed the estimation of the multinomial logit model parameters
presents a high variance, when there is multicollinearity among the regressors.
To solve the problem, we proposed to use as covariates of the multinomial model
a reduced number of PC’s of the predictor variables. An extensive simulation
study showed that the proposed approach is a valid alternative in presence of
multicollinearity.
In order to select the optimum PCMR(a) model, we considered and compared different methods for including PC’s in the model. The method, that considered in model all the PC’s that influence in statistical significant manner the response variable, yielded more stable and lower variance estimates of model parameters. This is a considerable advantage that will allow us to focus on inferential aspects of the proposed approach.
Finally, a generalization of the other models, proposed in literature, for solving the problem of highdimensional multicollinear data in the binary logit model could be interesting.