A Test for Testing the Equality of Two Covariance Matrices for High-dimensional Data

Chaipitak, Saowapha; Chongcharoen, Samruam

ABSTRACT

This study proposed a test for the equality of two covariance matrices from two independent multivariate normal populations with high-dimensional data. The test statistic is based on unbiased and consistent estimator of the ratio between the sums of squares of covariance matrix elements. Under the null hypothesis, the proposed test statistic is asymptotically standard normal distributed when the number of variables and the sample sizes go together to infinity. Simulation study is conducted to investigate the performance of the proposed test statistic. The results showed that the proposed test is superior to the other three tests appeared in the literature for various patterns of common covariance matrix. Finally, two real data sets are analyzed to illustrate the application of our theoretical results.

PDF Abstract XML References Citation

INTRODUCTION

Let x_ij = (x_ij1, ..., x_ijp)’, j = 1, ..., n_i, i = 1,2, be random samples drawn from independent multivariate normal populations N_p(μ_i, Σ_i), where all the parameters are unknown. It is a requirement in many statistical techniques, such as in discriminant analysis, testing the equality of two mean vectors, testing the equality of two mean sub-vectors, to know whether covariance matrices of the two populations are equal or not (Johnson, 1998; Krzanowski, 2000; Srivastava, 2002; Gamage and Mathew, 2008; Fujikoshi et al., 2010). Before applying any further analysis, this equality must be tested. The widely used traditional technique for testing the hypothesis that H₀: Σ₁ = Σ₂ = Σ against H₁: Σ₁≠Σ₂ where Σ is the common unknown covariance matrix of the two populations, when the sample sizes n_i larger than the number of variables, p, is the modified likelihood ratio test. However, in applications concerning modern sciences and economics, the data consist of very large number of variables taken from small samples. For instance, DNA microarrays typically measure thousands to millions of gene expressions on the small sample sizes (Dudoit et al., 2002; Ibrahim et al., 2002; Sebastiani et al., 2006; Huang et al., 2009). When the data have p≥n_i, called high-dimensional data, the sample covariance matrices S_i are singular making the modified likelihood ratio test is not valid. The tests under this problem were recently worked by Schott (2007), Srivastava (2007) and Srivastava and Yanagihara (2010). To have more powerful choice of test statistic for testing H₀ against H₁ when p≥n_i, a new test statistic is proposed. It is shown that this proposed test statistic is asymptotically distributed as the standard normal distribution for any type of common covariance matrix considered with large p, n_i.

Let n_i be the sample size drawn from population i, i = 1, 2 and n = n₁+n₂-2, the following assumptions are made:

Image for - A Test for Testing the Equality of Two Covariance Matrices for High-dimensional Data

where, a_k = (trΣ^k)/p and a_ji = (trΣ^j_i)/p.

Let:

and

(1)

Since S₁ and S₂ are independent estimates of the covariance matrices Σ₁ and Σ₂, respectively, with (n_i-1) S_i ∼W_p (Σ_i, n_i-1), i = 1, 2, where W_p (Σ_i, n_i-1) is a Wishart distribution with n_i-1 degree of freedom and covariance matrix Σ_i then the common covariance matrix Σ can be estimated by:

Let:

and

(2)

The modified likelihood ratio test suggested by Bartlett (1937) on an intuitive ground is based on the statistic:

and is valid when p<n_i. In particularly, if p is fixed, the asymptotic null distribution of -2 log L, as n_i→∞, for i = 1, 2, is chi-squared distribution with p(p+1)/2 degree of freedom.

Because of the unavailability of the modified likelihood ratio test L when p≥n_i, Schott (2007) proposed a test for the equality of several covariance matrices. This study then considers Schott’s test statistic only for the case of two covariance matrices. Based on the consistent estimator of the square of Frobenius norm of Σ₁-Σ₂, namely tr(Σ₁-Σ₂)² his test statistic is given by:

Under the null hypothesis, T_J is asymptotically distributed as N (0, 1) as (p, n₁, n₂)→∞.

Srivastava (2007) proposed a test based on a lower bound on Frobenius norm. It is given by:

where:

and

where, c₀ = n (n³+6n²+21n+18), c₁ = 2n (2n²+6n+9), c₂ = 2n (3n+2) and c₃ = n (2n²+5n+7). Under the null hypothesis T_S is asymptotically distributed as N (0, 1) as (p, n)→∞.

Srivastava and Yanagihara (2010) proposed an alternative test based on a consistent estimator of a measure of distance by , where , i = 1, 2. The consistent estimators of γ_i are given by . The test statistic is given by:

where:

and

Under the null hypothesis, T_SY is asymptotically distributed as N (0, 1) as (p, n)→∞.

THE PROPOSED STATISTIC

To test the hypothesis H₀: Σ₁ = Σ₂ = Σ against H₁: Σ₁≠Σ₂ for p≥n_i it is observed that if Σ₁ = Σ₂, then . Thus under the null hypothesis, the measurement . Using lemma A3 extended from lemma A1 obtained from Srivastava (2005) in the Appendix, a consistent estimator of b can be estimated by . The following lemma gives the asymptotic distribution of the consistent estimators.

Lemma 1: Let (n_i-1) S_i∼W_p (Σ_i, n_i-1), â_2i, i = 1, 2, as defined in (1) and , then under the assumptions (A2) and (A4):

where, denotes x converges in distribution to y.

Proof: Since random samples x_ij, j = 1, ..., n_i, i = 1,2 are drawn from two independent populations and sample covariance matrices S_i are calculated from corresponding independent random samples x_1j and x_2j thus, S₁ and S₂ must be independent of each other. In fact, the statistic â₂₁ is a function of S₁ alone while the statistic â₂₂ is also a function of S₂ alone. Thus â₂₁and â₂₂ are also independent and then it makes COV (â₂₁, â₂₂) = 0. By lemma A4 in the Appendix, â_2i, i = 1, 2 are asymptotically normally distributed with mean a_2i and variance:

and the fact that the covariance between â₂₁ and â₂₂ is zero, it follows that the jointly asymptotic distribution of statistics â₂₁ and â₂₂ are the bivariate normal distribution with mean vector and covariance matrix as given above. The proof is completed.

Note that is a ratio of two uncorrelated estimators. By the delta method (Lehmann and Romano, 2005), it ensures that a function of two random variables can be approximated as normal distribution. The following theorem establishes the asymptotic normality of the statistic .

Theorem 1: Let b and be as defined above. Then, under the assumptions (A1)-(A4), , where:

Proof: We note that . Hence the partial derivative of with respect to â₂₁ is:

Similarly, the partial derivative of with respect to â₂₂ is:

Thus, by applying the delta method, asymptotically with:

The proof is completed.

Corollary 1: Let be as defined above. Under H₀: Σ₁ = Σ₂ = Σ and the assumptions (A1)-(A4), then:

Proof: Under H₀, then a₂₁ = a₂₂ = a₂ and a₄₁ = a₄₂ = a₄. Thus:

It follows Theorem 1, then the proof is completed.

In order to use T in practice, we have to estimate δ² involving estimate of a₂ and a₄. Under the null hypothesis and by using consistent estimators of a₂ and a₄ as â₂ and â^*₄ as given in lemmas A1 and A5 in the Appendix, respectively and by assumption (A2), we obtained a corresponding consistent estimator of δ² namely as:

Thus a test of H₀ is based on the statistic:

and also its asymptotic null distribution is the standard normal. The proposed test statistic T* with α level of significance rejects H₀ if |T*|>z_α/2 where z_α/2 denotes the upper α/2 quantile of the standard normal distribution.

SIMULATION STUDY

Here, the performance of the proposed test statistic T* compared to three tests T_J, T_S and T_SY was shown through numerical simulation technique. In order to assess being normality of the tests, the Attained Significance Level (ASL) of these tests were simulated and expected to be close to the nominal significance level setting. The empirical powers of these tests in different situations were also performed.

Parameter selection: Independent 10,000 replications of the multivariate normal random datasets were generated using International Mathematics and Statistics Library (IMSL) with multivariate normal random number generator (RNMVN) subroutine of Fortran programming language (FORTRAN). The nominal significance level α used was 0.05. Under the null hypothesis, the test statistics T*, T_J, T_S and T_SY were computed and the proportions of rejection of test statistics under the null hypothesis were recorded, called the Attained Significance Level (ASL). In our work presented here, the ASL under the null hypothesis and corresponding empirical power under the alternative hypothesis were manipulated for following hypotheses in different patterns of covariance matrix setup as follow:

•	Unstructured pattern (UN): It is defined as Σ = (σ_ij)^p_i,j=1. We considered the hypothesis as follows:
	•	: Σ₁ = Σ₂ = U₀ against
	•	: Σ₁ = U₀ and Σ₂ = U₁

	where, U₀ = (σ_ij)^p_i,j=1 where σ_ij = 1 (if i = j); σ_ij = (-1) ^i+j (0.10i)/j (if i≠ j) and U₁ = (σ_ij)^p_{i, j=1} where σ_ij = 1 (if i = j); σ_ij = (-1)^i+j (0.05i)/j (if i ≠ j)
•	Compound Symmetry pattern (CS): It is defined as Σ = σ²I_p+k1_p1'_p, where σ²>0, k is appropriate constant, I_p denotes the pxp identity matrix and 1_p denotes the px1 vector of ones. The hypothesis was set as:
	•	: Σ₁ = Σ₂ = C₀ = 0.99I_p+(0.01)1_p1'_p against
	•	: Σ₁ = C₀ and Σ₂ = C₁ = 0.95I_p+(0.05)1_p1'_p

•	Heterogeneous compound symmetry pattern (CSH): It is defined as Σ = (σ_ij)^p_{i,j = 1} where σ_ij = σ²_i >0 (if i = j); σ_ij = σ_iσ_jρ (if i ≠ j), where ρ is the correlation parameter satisfying \|ρ\|<1. The hypothesis was set as:
	•	: Σ₁ = Σ₂ = M₀ where M₀ is matrix in CSH with σ_ij∼U (5,6) (if i = j), ρ = 0.5, against
	•	: Σ₁ = M₀ and Σ₂ = M₁ where M₁ is matrix in CSH with σ_ij ∼U (4,5) (if i = j), ρ = 0.4.

•	Simple pattern (SIM): It is defined as Σ = σ²I We set the hypothesis testing according to:
	•	: Σ₁ = Σ₂ = 2I against : Σ₁ = 2I and Σ₂ = 1.5I
	•	: Σ₁ = Σ₂ = Σ = I_p against : Σ₁ = I_p and Σ₂ = Diag (1,1,1,2, ..., 1,1,1,2)

RESULTS AND DISCUSSIONS

Table 1 presents the ASL and empirical powers of T_J, T_S, T_SY and T* when all covariance matrices in the hypothesis were under unstructured pattern (UN). The ASL of the tests T_S and T_SY were not close to the nominal significance level 0.05 and much lower than it for all cases considered. The test T_J generally yielded the ASL not close to 0.05 for all cases considered. Moreover, the test T_J gave the ASL around 0.060 when the sample sizes were small, n₁ = n₂ = 20 here and tended to increase when the sample sizes became larger for any p. For instance, when p = 80, the ASL of T_J was 0.057 (at n₁ = n₂ = 20) and increased to 0.061 (at n₁ = n₂ = 80). From this table, it is observed that the ASL of the proposed test T* were reasonably close to 0.05 and get better when p and the sample sizes increased. This is clear that the tests T_J, T_S and T_SY were not reasonable tests whereas the proposed test T* were. Considering the power of the test, since the competitive tests T_J, T_S and T_SY were not reasonable tests at this situation, then their empirical powers provided in Table 1 will be skipped. As shown from this table, the empirical powers of the proposed test T* increased to one when p and the sample sizes increased. In addition, the empirical powers of the proposed test T* increased for increasing the sample sizes when p is fixed. For instance, when p = 160, the empirical power of the proposed test T* was 0.288 (at n₁ = n₂ = 20) and increased to 0.944 (at n₁ = n₂ = 160).

Table 1:	ASL of T_J, T_S, T_SY and T* under and their empirical powers under and applied at α = 0.05

ASL: The attained significance level, U₀ and U₁: The matrices defined under unstructured pattern (UN), α: The nominal significance level

Table 2:	ASL of T_J, T_S, T_SY and T* under and their empirical powers under and applied at α = 0.05

ASL: The attained significance level, C₀ and C₁: The matrices defined under compound symmetry pattern (CS), α: The nominal significance level

Table 2 reports the ASL and empirical powers of T_J, T_S, T_SY and T* when all covariance matrices in the hypothesis were set under compound symmetry pattern (CS). Both tests T* and T_J gave the satisfactory ASL which quite controlled 0.05 for all cases considered whereas those of T_S and T_SY were not close to 0.05. As seen in the table, the ASL of T_S and T_SY decreased as p and the sample sizes increased. For instance, when p = n₁ = n₂ = 80, the ASL of T_S and T_SY were 0.049 and 0.045, respectively and both decreased to 0.043 and 0.031 when p = n₁ = n₂ = 160, respectively. Moreover, when p is fixed, the ASL of T_S and T_SY were dropped when the sample sizes increased. For example, when p = 160 ASL of T_S and T_SY were 0.059 and 0.057 (at n₁ = n₂ = 20), respectively and both values decreased to 0.043 and 0.031, respectively (at n₁ = n₂ = 160). This indicates that the tests T_S and T_SY were not suitable whereas the proposed test T* and T_J test were appropriate. This table reports that the empirical powers of the proposed test T* and T_J test were quite high and rapidly tended to one. Moreover, the empirical powers of both tests were quite responsive to the increase of p and the sample sizes. Furthermore, the empirical powers of the proposed test T* were slightly higher than those of the test T_J in cases considered.

Table 3 displays the ASL and empirical powers of tests T_J, T_S, T_SY and T* when all covariance matrices in the hypothesis were set under heterogeneous compound symmetry pattern (CSH). We observed that the ASL of all tests under this CSH pattern were similar formats to the ASL obtained under UN pattern provided in Table 1. The ASL of the tests T_J, T_S and T_SY were not close to 0.05 whereas that of the proposed test T* well approximate 0.05 as p and the sample sizes increased. It can be observed that the ASL of the tests T_S and T_SY from this table were lower than those from Table 1 for all cases considered. This indicates that the convergences of the tests T_S and T_SY to the standard normal distribution were very slow and not accomplished when the common covariance matrix was under CSH and UN patterns. As displayed in this table, the empirical powers of the proposed test T* rapidly converged to one when p and the sample sizes increased.

Table 3:	ASL of T_J, T_S, T_SY and T* under and their empirical powers under and applied at α = 0.05

ASL: The attained significance level, M₀ and M₁: The matrices defined under heterogeneous compound symmetry pattern (CSH), α: The nominal significance level

Table 4:	ASL of T_J, T_S, T_SY and T* under and their empirical powers under and applied at α = 0.05

ASL: The attained significance level, α: The nominal significance level

Table 4 reports the ASL of tests T_J, T_S, T_SY and T* under the common covariance matrix Σ = 2I (simple pattern) and their empirical powers under Σ₁ = 2I and Σ₂ = 1.5I. As expected, the ASL of T* and T_J were quite close to 0.05 for all cases considered while those of T_S and T_SY seemed to be zero for all cases considered. The empirical powers of the proposed test T* and T_J test converged to one as p and the sample sizes increased. The convergence to one of the empirical powers of the proposed test T* was extremely faster than that of T_J, especially when n₁ = n₂≤40 for all p. For example, when p = n₁ = n₂ = 20, the empirical powers of T* and T_J were 0.757 and 0.117, respectively. This indicates that, under simple pattern, T* was reasonable test and more powerful than T_J test, particularly in case of small samples.

Table 5 presents the ASL of tests T_J, T_S, T_SY and T* under the common covariance matrix Σ = I (simple pattern) and empirical powers under Σ₁ = I and a certain matrix Σ₂ = Diag (1,1,1,2, ..., 1,1,1,2).

Table 5:	ASL of T_J, T_S, T_SY and T* under and their empirical powers under and applied at α = 0.05

ASL: The attained significance level, α: The nominal significance level

Table 6:	ASL of T_J, T_S, T_SY and T* T* under and their empirical powers under and when n₂ = 2n₁ applied at α = 0.05

ASL: The attained significance level, α: The nominal significance level

As displayed in this table, the ASL of the proposed test T* and T_J test were similar to those from Table 4 and reasonable approximate 0.05 for all cases of p and the sample sizes. This means that changing the scalar σ² defined in simple pattern, from σ² = 2 became σ² = 1, is not effected to the convergence of the asymptotic normality of the proposed test T* and T_J test. But it is greatly effected to the convergence of the asymptotic normality of T_S and T_SY because ASL of both tests when Σ = I were much better than those when Σ = 2I for all case considered. However, the ASL of the tests T_S and T_SY mainly were still not control 0.05, particularly when the sample sizes were less than or equal to 40 for any p. As expected, the empirical powers of the tests T_J and T* quickly tended to one as p and the sample sizes increased. In addition, the proposed test T* generally gave the higher power than T_J test.

We carried out additional simulations for the case that the sample sizes were not equal (n₁≠n₂) choosing n₂ = 2n₁, of four tests T_J, T_S, T_SY and T* under the null hypothesis . Corresponding empirical powers of these tests were also manipulated under the alternative hypothesis and Σ₂ = Diag (1,1,1,2, ..., 1,1,1,2). The results are provided in Table 6.

Table 6 presents that both ASL and empirical powers of these tests were not substantially different from those given in Table 5. It appears that the tests T_S and T_SY still had the ASL not close to 0.05, particularly for the small sample sizes, n₁ = 20 and n₂ = 40 here. The proposed test statistic T* and T_J test remained appropriate even the sample sizes were not the same. The empirical powers of the proposed test T* maintained better than and converged to one faster than those of T_J test.

APPLICATION

In this section, the dataset from Notterman et al. (2001) is online at http://genomics-pubs.princeton.edu /oncology/Data/CarcinomaNormaldatasetCancerResearch.txt (last accessed: 9 October 2012). Two groups of colon tissues (adenocarcinoma and adenoma) were examined by oligonucleotide arrays. The expression levels about 6500 human genes were probed in 18 colon adenocarcinomas and 4 colon adenomas. We restricted attention to a subset of all gene expressions of 100 expression levels on 4 colon adenocarcinomas and 4 colon adenomas. Thus we had n₁ = 4, n₂ = 4 and p = 100. We examined whether the covariance matrices of the two groups are equal. The data presented the observed test statistic values of T_J = 0.908 and T* = -0.636. Corresponding p-values were 0.182 and 0.524 indicating the hypothesis of equality of such two covariance matrices of these data was not rejected at any reasonable significance level.

CONCLUSIONS

In this study, we proposed an alternative test statistic for testing the equality of two covariance matrices for two independent multivariate normal data with p≥n_i, i = 1,2. The test statistic T* based on the consistent estimators is introduced. Its asymptotic distribution approximately follows the standard normal distribution as (p, n₁, n₂)→∞ even if p/n_i→c_i∈ (0, ∞), i = 1,2. The simulation results strongly supported the performance of the proposed test statistic T* that it accurately control size of test and not greatly affected by changing the common covariance matrix appearing in the null hypothesis. As seen in the simulation study, the proposed test statistic T* has the highest power among competitive test statistics; T_J, which is a special case of the test for testing the equality of several covariance matrices proposed by Schott (2007), T_S and T_SY given by Srivastava (2007) and Srivastava and Yanagihara (2010).

ACKNOWLEDGMENT

We would like to thank the Commission on Higher Education (CHE) of Thailand for financial support through a grant fund under the Strategic Scholarships Fellowships Frontier Research Networks.

APPENDIX

Most of work in this study could be viewed as an extension some results of Srivastava (2005) and Fisher et al. (2010). In order to proof Lemma 1 we have taken the following two useful lemmas (lemma A1 and A2) from Srivastava (2005).

Lemma A1: Let nS∼W_p(Σ,n) and a_k = (trΣ^k)/p, k = 1,...,4. Then under the assumptions (A1) and (A3), unbiased and consistent estimators of a₂ as (p, n)→∞ is given by .

Lemma A2: Let nS∼W_p(Σ,n), as defined in (2) and a_k = (trΣ^k)/p, k = 1,...,4. Then under the assumptions (A1) and (A3):

where, Φ(x) denotes the cumulative distribution function of a standard normal random variable and:

The extensions of lemmas A1 and A2 can be obtained without proofs as follow:

Lemma A3: Let and i = 1,2, j = 1,...,4. Then under the assumptions (A2) and (A4), unbiased and consistent estimators of a_2i as (p, n_i) → ∞ are given by Image for - A Test for Testing the Equality of Two Covariance Matrices for High-dimensional Data , i = 1,2.

Lemma A4: Let Image for - A Test for Testing the Equality of Two Covariance Matrices for High-dimensional Data , i = 1,2, as defined in (1) and , i = 1,2, j = 1,..., 4. Then under the assumptions (A2) and (A4):

where Φ(x) denotes the cumulative distribution function of a standard normal random variable and:

The following lemma is taken from Fisher et al. (2010). Thus it also is presented without proof.

Lemma A5: Let nS∼W_p(Σ, n) and a_k = (trΣ^k)/p, k = 1,...16. Then under the assumptions (A1) and (A3), unbiased and consistent estimators of a₄ as (p,n)→∞ is given by defined as:

where:

and

REFERENCES

Bartlett, M.S., 1937. Properties of sufficiency and statistical tests. Proc. R. Soc. Lond. A, 160: 268-282.
CrossRef Direct Link
Dudoit, S., J. Fridlyand and T.P. Speed, 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Statist. Assoc., 97: 77-87.
CrossRef Direct Link
Fisher, T., X. Sun and C.M. Gallagher, 2010. A new test for sphericity of the covariance matrix for high dimensional data. J. Multivariate Anal., 101: 2554-2570.
CrossRef
Fujikoshi, Y., V.V. Ulyanov and R. Shimizu, 2010. Multivariate Statistics: High-Dimensional and Large-Sample Approximations. John Wiley and Sons, New Jersey, ISBN-13: 9780470411698.
Gamage, J. and T. Mathew, 2008. Inference on mean sub-vectors of two multivariate normal populations with unequal covariance matrices. Stat. Probabil. Lett., 78: 420-425.
CrossRef
Huang, D., Y. Quan, M. He and B. Zhou, 2009. Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data. J. Exp. Clin. Cancer Res., Vol. 28.
CrossRef
Ibrahim, J.G., M. Chen and R.J. Gray, 2002. Baysian models for gene expression with DNA microarray data. J. Am. Statist. Assoc., 97: 88-99.
CrossRef
Johnson, D.E., 1998. Applied Multivariate Methods for Data Analysis. Duxbury Press, New York, USA.
Krzanowski, W.J., 2000. Principles of Multivariate Analysis: A User's Perspective. 1st Edn., Oxford University Press, USA., ISBN-10: 0198507089, Pages: 608.
Lehmann, E.L. and J.P. Romano, 2005. Testing Statistical Hypotheses. 3rd Edn., Springer, New York, ISBN-13: 9780387988641, Pages: 786.
Notterman, D.A., U. Alon, A.J. Sierk and A.J. Levine, 2001. Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Res., 61: 3124-3130.
PubMed
Schott, J.R., 2007. A test for the equality of covariance matrices when the dimension is large relative to the sample size. Comput. Stat. Data Anal., 51: 6535-6542.
CrossRef
Sebastiani, P., H. Xie and M.F. Ramoni, 2006. Baysian analysis of comparative microarray experiments by model averaging. Bayesian Anal., 1: 107-732.
CrossRef Direct Link
Srivastava, M.S., 2002. Methods of Multivariate Statistics. John Wiley and Sons, New York, ISBN-13: 9780471223818, Pages: 728.
Srivastava, M.S., 2005. Some tests concerning the covariance matrix in high dimensional data. J. Japan Statist. Soc., 35: 251-272.
Direct Link
Srivastava, M.S., 2007. Testing the equality of two covariance matrices and testing the independence of two subvectors with fewer observations than the dimension. Proceedings of the International Conference on Advances in Interdisciplinary Statistics and Combinatorics, October 12-14, 2007, Greensboro, North Carolina, USA.
Srivastava, M.S. and H. Yanagihara, 2010. Testing the equality of several covariance matrices with fewer observations than the dimension. J. Multivariate Anal., 101: 1319-1329.
CrossRef

Journal of Applied Sciences

Research Article

A Test for Testing the Equality of Two Covariance Matrices for High-dimensional Data

ABSTRACT

How to cite this article

Search

INTRODUCTION

CONCLUSIONS

ACKNOWLEDGMENT

REFERENCES

Search

Leave a Comment