HOME JOURNALS CONTACT

Pakistan Journal of Biological Sciences

Year: 2004 | Volume: 7 | Issue: 6 | Page No.: 870-878
DOI: 10.3923/pjbs.2004.870.878
Empirical Table Values of Eigen Values for Different Variable Numbers and Sample Size Combinations
M. Muhip Ozkan and Mehmet Mendes

Abstract: In this study, both interpretation proportions of Principle Component total variation and critical values of eigen values for each Principle Component with a probability of α=0.05 were tabulated empirically after 100000 simulation experiments, considering different sample sizes, variable numbers and correlation between variables. Therefore, alternative to the techniques used in many statistical package programs which takes the values equal or greater than 1, an empirical decision making mechanism was provided in defining whether the calculated eigen values are statistically significant.

Fulltext PDF Fulltext HTML

How to cite this article
M. Muhip Ozkan and Mehmet Mendes, 2004. Empirical Table Values of Eigen Values for Different Variable Numbers and Sample Size Combinations. Pakistan Journal of Biological Sciences, 7: 870-878.

Keywords: princpal components, simulation experiments and eigen values

INTRODUCTION

Multivariate Analysis is a branch of statistical methods that determine the main characteristics of the data obtained from subject characteristics with research and experiments. It provides an overall explanation of the structure coming out as a result of different degrees of correlation between each characteristic. In this analysis method, different techniques are used upon the objective. The objective may be classifying the data due to their characteristics in different groups or examining whether the data belongs to a certain group whose characteristic is to be known or examining whether there is a correlation between the variables classified in two groups in the beginning considering the characteristics of the variables or due to the correlation between characteristics omitting some of the variables and constituting a mathematical model with the remaining variables[1].

Principal Component Analysis tries to eliminate the correlation structure between the subject characteristics and tries to explain the structure with fewer characteristics. To eliminate the correlation structure, some artificial variables that are linear combinations of the original variables and individually independent from each other may be constituted. An important proportion of the total variation can be explained by the fewer and new variables[2]. The "an important proportion" expression used here is relative to the researcher and to the subject of the research. However, in some research explaining only 70% of the total variation may give an important opinion to the researcher, or depending on to the importance of the subject, an unexplained 1% proportion of the total variance might cause significant errors when the entire

structure is taken into consideration[3]. If the correlation between the original variables is perfect, a single Principal Component is expected to explain the total variation. On the contrary, if there is no correlation between the original variables, each Principal Component is expected to explain the total variation in equal proportions. However, these two conditions are very rare to encounter in real life. In all structures or systems in nature, there is a high, medium or low correlation between the characteristics. Therefore, which of the Principal Components explains the proportional amount of the total variation depends on the sample size, the number of variables and the level of correlation between the variables.

In this study, the proportions of the total variation explained by a Principal Component is calculated for different sample sizes, variable numbers and correlation levels. For each Principal Component, initial values of the eigen values in 5% region of the distribution are given in tables. The major objective of this study was to assist researchers in the decision making stage of identifying the eigen values that may be significant by using an empirical critical value.

MATERIALS AND METHODS

In this study, random samples from a population that has a Multivariate Standard Normal Distribution with p=3, 4,�, 20 variables and a sample sizes n=5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 70, 100, 200 and 500 were used (n>p). Because the original variables were taken from a population with a Standard Normal Distribution, calculated variances of the Principal Components were equal to the eigen value (λ) of the relevant Principal Component. The total variance equals to the sum of the eigen values obtained from all the Principal Components in every variable-number combination. Minimum correlation was taken as 0.05 and maximum correlation was taken as 0.90 with a simple algorithm depending on the variable number for determining the low, medium and high correlation levels between the original variables and for the correlations between these two values, the fixed value, calculated by dividing (0.90-0.05) 0.85 by ([p * (p-1)]/2)-1, is added to the correlation value 0.05 between the first variable couple and then is distributed to other couples in equal intervals until to the last variable couple. For example, when 4 variables are taken, the fixed value was calculated by dividing 0.85 to ([4 * (4-1)]/2)-1=5 and was found to be 0.85/5=0.17 for 6 correlation coefficients (r12, r13, r14, r23, r24, r34). By adding the fixed value to the correlation 0.05 between 1st and 2nd variables (r12), a structure with correlation coefficients 0.22 between the 1st and 3rd variables (r13) 0.39 between 1st and 4th variables (r14), 0.56 between 2nd and 3rd variables (r23), 0.73 for 2nd and 4th variables (r24) and 0.90 between 3rd and 4th variables (r34) was constituted. Therefore, a model that has insignificant correlation between some of the variables, a moderate correlation between some of the variables and significant high correlation between some others, with a significance level α=0.05 has been obtained. Especially when the number of the variables is increased, the same result for the correlation between the variables is expected to be obtained in practice.

The material of the study is constituted by the random numbers generated from the IMSL library of the Microsoft Fortran Developer Studio. All of the required calculations are carried out by the FORTRAN programs and every trial condition is repeated for 100000 times[4].

In the Principal Components Analysis, p original variables converted to new variables in their new axis by rotating with a specific angle θ. The new axis obtained with only one of the different θ angles shall have the highest variance. Although this variance is the highest, it cannot explain the whole variation. For this reason, a new axis, namely a new variable, perpendicular to the first axis having the same angle θ and explaining the highest proportion of the variation that can not be explained by the first axis, namely second variable, is obtained[5].

In theory, new artificial variables may be obtained as many as the original variables. These new variables are called the Principal Components and the 1st Principal Component having the highest level of explanation, they explain the total variance in descending degrees independent from each other.

The total variance, which is the sum of variances of the original variables, does not change after the new axes are obtained in the Principal Components Analysis. However, as the axes are constituted in a manner that the total variance is explained maximally beginning from the first axis, the proportions of explaining the total variance changes. This property provides a chance of objective evaluation to the researchers in eliminating the ending axes and consequently dimension reduction.

An information loss in the proportions of the total variation explanation is natural in the case of dimension reduction. The levels of this information loss or up to which variable a limitation shall be held on depend only to the researcher and the topic of the research. In principle, the level of the information loss depends on the correlation between the original variables subject to the research. As the correlations between the variables that are not generated from each other increase, the dimension reduction gets more successful.

RESULTS

100000 simulation experiments were performed for each n and p combination. Means of the eigen values were calculated for each sample size-variable combination independently and each eigen value was divided by the sum of these means to calculate the proportion of total variance explanation given in Table 1. Limit values of the 95000th eigen value of each Principal Component sequenced in ascending order for all n, p combinations, in a significance level α=0.05, are given in Table 2. As the acceptable information loss varies depending on the researcher and on topic of the research, the decision for proportions of the variation explanation is given in Table 1.

Eigen values equal to 1 or greater than 1 are usually used in most of the statistical package programs. The empirical values in Table 2 defines the limit values for which the eigen values are statistically significant at α=95000/100000=0.05 significance level. These values are given so that a researcher can make an accurate decision[6].

Satisfactory joint variance proportions for each sample size and number of variables (to be determined by the researcher), was calculated by summing the explanation proportions (λi) in the eigen value column of each relevant n-p combination in Table 1.

Table 1:
The proportions of total variance explanations of eigen values according to the variable and sample size combinations (%)



Table 2:
Empirical table values of the eigen according to variable and sample size combinations values in level α=0.05


These satisfactory explanation proportions are selected subjectively. Thus, some of the eigen values providing the limit that the researcher determined may not be significant statistically. It is necessary to designate the eigen values that may be significant to a specific error level (for this study α=0.05) to prevent this inconvenience. The criteria used in most of the statistical package programs for preventing such inconveniencies is to obtain the Principal Components set by taking the eigen values equal to or greater than 1. Alternatively, the researcher may constitute the Principal Components set more objectively by comparing his experiment’s eigen values with the critical values given in Table 2 (For the experiments different from the n and p combinations given in Table 2, an interpolation technique may be applied). For example, assume that for p=5 and n=30, the eigen values are Using the general criteria, a researcher would constitute his Principal Components set with the first two Principal Components as only the first two eigen values are equal to or greater than 1. Alternatively, for the p=5 and n=30 combination, the critical values in Table 2 calculated empirically in this study are

The researcher may decide to constitute the Principal Components set with the first three Principal Components.

Results of this study may be used for determining the number of the Principal Components that are necessary in constituting the set of Principal Components (depending to the eigen values) using the critical empirical eigen values given in Table 2, and providing an alternative decision making method to researchers, and may be used for a second opinion because of its practical use and objective evaluation.

REFERENCES

  • Kaiser, H.F., 1960. The application of electronic computers to factor analysis. Educ. Psychol. Measur., 20: 141-151.
    CrossRef    


  • Marriott, F.H.C., 1974. The Interpretation of Multiple Observations. Academic Press, New York


  • Johnson, R.A. and D.W. Wichern, 2002. Applied Multivariate Statistical Analysis. 5th Edn., Prentice Hall Inc., Upper Saddle River, New Jersey


  • Sharma, S., 1996. Applied Multivariate Techniques. John Wiley and Sons Inc., New York, USA


  • Anonymous, 1994. FORTRAN Subroutines for Mathematical Applications. Vol. 1-2, IMSL Publisher, Houston, USA


  • Tatsuoka, M.M., 1971. Multivariate Analysis: Techniques for Educational and Psychological Research. John Wiley and Sons, New York

  • © Science Alert. All Rights Reserved