ABSTRACT
DNA microarrays can measure the differential expression of hundreds and thousands of genes to identify changes in their expression between pair-wise biological states. Although, many methods are developed to determine the significance of these changes while accounting for the enormous number of genes, there is no uniform or standard method for different microarray experiments to date. Researches are often confused with diverse algorithm issues, especially the changing fold of differential expression genes, in the selection of suitable candidate genes. Here, a simulation was described to analyze the microarray experimental data correlated with corresponding biological phenotypes, aiming at to measure the proper fold changes of differential expression genes with rigorous statistics of p values and acceptable FDRs (false discovery rates) in permutated univariate tests. The result showed that the commonly used fold change method of candidate gene selection was relatively reasonable and the variation scope was suggested as fold changes of 1.5-2.1. The fold changes no less than 1.5 of gene expression would be a good choice of gene selection with stable gene numbers passed the filtering criteria. When the data filtering fold change was rigid, the number of genes passed the filtering criteria turned to decrease sharply. The present analysis provided a good basis for further gene expression studies.
PDF Abstract XML References Citation
How to cite this article
DOI: 10.3923/ajava.2012.745.753
URL: https://scialert.net/abstract/?doi=ajava.2012.745.753
INTRODUCTION
Recent years have seen an explosion of work on DNA microarray experiments. For biological problems, such data mainly arise from DNA microarrays (both genome-wide gene expression profile and Single Nucleotide Polymorphism (SNP) mutation assays), from which investigators try to classify disease categories, tumor types, response to drugs, gene expression regulation or other categories (Ludwig and Weinstein, 2005). DNA microarray is a kind of high-throughput biotechnology containing millions of cDNA or oligonucleotide probes for the measuring of gene expression and it can be used to detect the expression hundreds and thousands of genes in a hybridization experiment or single arrays. Despite the success and popularity of oligonucleotide arrays as a high-throughput technique for measuring mRNA expression levels, quantitative calibration studies have until now been limited due to unsuitable data. However, a new high-throughput technology produced by Affymetrix Corporation contains calibration data and generated their unique oligonucleotide arrays permitting the detailed study of intensity dependent gene expression. Affymetrix oligonucleotide arrays (GeneChip®, Affymetrix Corporation) have been applied to many gene expression studies in the past years (Tavazoie et al., 1999; Cho et al., 2001; Hakak et al., 2001). It adopted a set of perfect match and mismatch oligonucleotide probes, usually 11 to 20 pairs, to measure the mRNA concentrations of genes expressed in an array. Besides linear normalization, average difference and signal methods provided by MAS software (Affymetrix Corporation), researchers have now proposed alternative multi-level analysis methods such as feature extraction (Schadt et al., 2001a, b), normalization (Hill et al., 2001), statistical inferences of gene expression changes (Baldi and Long, 2001; Liu et al., 2005; Fox and Dimmic, 2006), expression index computation (Li and Wong, 2001a, b; Irizarry et al., 2003; Lazaridis et al., 2002; Zhou and Abagyan, 2002) and other aspects (Pavelka et al., 2004; Lu, 2004; Weng et al., 2006) in the attempts to improve these issues.
Although, massive amounts of data are generated and methods are developed to determine whether changes in gene expression are experimentally significant, there is no universal uniform method or justified filtering criterion for different microarray experiment results. Most of the efforts in method development have appropriately focused on what to do with their own datasets. Researches are often confused with diverse algorithms in practice. Many experiments were reported with 2 fold or at least a 1.5 fold changes with an estimated FDR (false discovery rate) (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001) and other experiments were analyzed with lower fold changes since few genes were selected out with 2 fold or 1.5 fold changes sometimes (Tusher et al., 2001; Naef et al., 2002). It seems arbitrary to determine the fold changes of candidate gene selection for further researches to freshmen or laymen. Moreover, most experiments reported only the diverse differential expression genes without correlation with biological phenotypes, especially phenotype data. In the study, a simulation analysis was made in an attempt to resolve the issue for candidate genes selection based on fold changes using their correlation with phenotype data in DNA microarray experiments.
MATERIALS AND METHODS
Data sets: The datasets were retrieved from GEO http://www.ncbi.nlm.nih.gov/geo/or Array Express http://www.ebi.ac.uk/arrayexpress/ (GSE12675) for the chicken embryo gene expression profiling in heart tissues at Hamburger-Hamilton stage 43. The data (GSE12675) were produced with Affymetrix GeneChip® Chicken Genome Arrays which contains 32773 transcripts corresponding to over 28000 chicken genes and 684 transcripts from 17 avian viruses. It included 60 animals and 12 chicken genome arrays in total. The definite animal assembly, as well as the procedures of sample preparation, RNA preparation, microarray hybridization and quantitative PCR experiment, were described in the previous study (Li and Zhao, 2009). In the simulation analysis, the phenotype data were the tender weights of organism muscle tissues (breast muscle and thigh muscle) produced by the corresponding 60 full or half sibs (i.e., mixed family) at the age of eight weeks in the same assembly as described (Li and Zhao, 2009) with proper statistical amendment based on the farms statistical data. It was aimed at probing into the effect of differential expression genes on muscle traits and assumed the chicken full or half sibs phenotype data were genetically the same as those animals used in previous studies (Li and Zhao, 2009; Song et al., 2010).
Data preprocessing: The raw datasets (*.CEL files) were re-analyzed by Affymetrix GCOS software (Affymetrix Corporation) for quality issues. Data preprocessing was performed using software packages developed in version 2.5.0 of Bioconductor (Gentleman et al., 2004) and R version 2.10.0 (R Development Core Team, 2009). Each Affymetrix dataset was background adjusted, normalized and log2 probe-set intensities calculated using the Robust Multichip Averaging (RMA) algorithm in affy package (Irizarry et al., 2003; Gautier et al., 2004). When the multiple probe sets target one gene, the probe set with largest variability was kept.
The correlation of differential expression genes with phenotype data was accomplished with a program written in the Fortran language and Bioconductor 2.5.0 (Gentleman et al., 2004) for permutated univariate tests of a random variance model (Wright and Simon, 2003). Two correlation algorithms, i.e. Pearson correlation and Spearman correlation, were used in the analysis with three levels of significance of p value of the permutated univariate tests (0.05, 0.01 and 0.001) and corresponding FDRs. Pearson correlation is computed between the measurements from two datasets. The Pearson correlation is commonly used in statistical models, but Spearman correlation is a more robust measure of correlation. The Spearman correlation is defined as a rank correlation like the Pearson correlation except that within each profile, each gene expression measurement is replaced by its rank within that profile and it is those ranks that are used in computing the correlation. The Spearman correlation is less sensitive to outlying observations and therefore tends to be more stable, but it is not clear how to handle assignment of rank if there are some missing expression measurements.
RESULTS AND DISCUSSION
In recent years, gene expression profile studies, including real-time quantitative PCR (quantitative RT-PCR or RT-PCR) and DNA microarray experiments, are increasingly common in domesticated animal researches (Salleh et al., 2004; Endrini et al., 2007; Chaudhry, 2008; Baatartsogt et al., 2009; Cossio-Bayugar et al., 2009; Chen et al., 2011; Huang et al., 2011; Long et al., 2011; Baatartsogt et al., 2011; Suzuki et al., 2011), especially in birds (Mekki et al., 2006; Lim et al., 2009; Goyal et al., 2010). However, few studies reported the comparative selection of suitable differential expression genes or candidate genes for further studies with dynamic fold changes of gene expression. Correlation of differential expression genes with biological phenotype information was also seldom reported and many of the experiments were designed concerning the gene interaction or interactome predicted from the genome-wide correlation (Ge et al., 2001). To the best of our knowledge, there were few researches or reports on the justification of candidate gene selection determined by fold changes in diverse microarray experiments. It is for the first time that the selection of differential expression genes based on fold changes had been experimentally investigated in chicken microarray experiments with reasonable and/or acceptable parameters correlated with specific phenotype data in the simulations.
The simulation analysis was time consuming since all the datasets were Affymetrix GeneChip® Chicken Genome Arrays which contains more than 28000 chicken genes transcript information. In total, twenty-four simulations were done in over 150 permutated univariate tests. The result was processed through permutated univariate tests of Pearson correlation and Spearman correlation with statistical significance (Table 1). Table 1 showed the summary of the prime result in detail. It was noticed that all the positive result was mainly from the Spearman correlation tests due to Spearman correlation is more robust and stable than Pearson correlation. Simulations using the Pearson correlation tests were almost unavailable and made no use later.
Table 1: | No. of significant genes detected with different filtering criteria in simulations (only Spearman correlations were stably available) |
As showed in Table 1, Pearson correlation tests with p value less than 0.001 found no differential expression genes. The fold changes ranged from 1.01 to 2.5 with detailed analyses from 1.01 to 1.09 due to tremendous variations in gene numbers (Table 1).
Although, p value of 0.05 or 0.01 is significant in the context of common experiments designed to evaluate small numbers of specimens or genes, microarray experiments designed for over 10,000 genes would identify 100 genes falsely by chance. This was why we gave a crucial consideration and took integrated correlation tests with p<0.001 (Table 1, Fig. 1). Spearman correlation tests with three significance levels revealed that number of genes passed the filtering criteria and correlation tests decreased rapidly before the fold change of 1.5. However, the number of genes passed the filtering criteria had changed more than 1.5 fold and turned into a relatively stable trend. Especially, the Spearman correlation tests with p<0.001 clearly produced no differential expression genes more than 1.5 fold change. These suggested that there would be no chance of mistake in the selection of differential expression genes correlated with the phenotype data. The fold change of 1.5 should be regarded as the lower limit of filtering criteria of candidate genes selection. Moreover, the selection of differential expression genes with more rigorous filtering criteria will result in less candidate genes available and the fold change of 2.1 were suggested as the suitable upper limit in candidate genes filtering (Table 1, Fig. 1). The variation scope should be 1.5-2.1 fold changes while the 1.8-2.0 fold changes of gene expression could be a better choice of gene selection.
Fig. 1: | Plot of the No. of significant genes detected with different filtering criteria in simulations (only Spearman correlations were stably available and showed), The lateral axis displayed the fold changes (i.e. filtering criteria) of differential expression genes and the vertical axes indicated the differential expression gene numbers passed the filtering criteria with Spearman correlations at the significance levels of p-values. The left vertical axis gave the gene numbers passed at the significance level of 0.05 of p-values while the right vertical axis provided the gene numbers passed at the significance levels of 0.01 and 0.001 of p- values |
When the data filtering fold changes were rigid, the number of genes passed the filtering criteria turned to decrease sharply. Figure 1 showed the plot of Spearman correlation test results.
The analyses above gave fine results and data with significant p values and acceptable FDRs in permutated univariate tests. Table 2 showed the first 37 genes sorted by the nominal 0.001 level of p value with Spearman correlation in a permutated univariate test. The significance of the p values and FDRs values were relatively acceptable (Benjamini and Yekutieli, 2001; Tusher et al., 2001; Wu, 2008) in permutated tests or multiple tests. Benjamini and Hochbergs method (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001) assumes independent tests and guarantees an upper bound for the FDR by a step-up or step-down procedure applied to the individual p values. Because of the limited number of permutations, the FDR is too granular. For data used in the simulations, the p value for each gene is calculated from permutations of only twelve microarray experiments and the FDR values (0.3-0.7) were relatively acceptable depending on how the p value was defined. Table 2 also indexed the simulation information and parametric p values and FDR values. Another important factor that might impact the FDR values was the selection of phenotype data since the simulation analysis was based on the correlation of differential expression genes with the full or half sibs organism muscle tender weights. The phenotype data were produced from full or half sibs genetically the same as those chicken in the previous study and the farms housing environment and management measures was constant and stable in the farm However, the full or half sibs were not the chicken used in the microarray experiments and the time was not same as that of gene expression.
Table 2: | The first 37 genes sorted by the nominal 0.001 level of p-value in the permutated univariate test with the 1.01 fold change of gene expression |
Positive coefficients showed the genes were significantly up-regulated and negative coefficients indicated the genes were significantly download-regulated |
Therefore, the experiments result could only be regarded as simulations and it remains to be confirmed by further studies in a more realistic experiment. The resulted data of the simulation analysis are available through e-mail upon demand.
CONCLUSION
So far, many studies appear as taking advantage of novel advanced biotechnologies such as DNA microarray to select suitable candidate genes and dissect the complex of gene regulation network. In this study, for the first time, simulation analysis was made to find out differential expression genes correlated with corresponding biological phenotypes with rigorous statistics of p values (p<0.05, 0.01, 0.001) and acceptable FDR values with stable correlation analyses in the permutated univariate tests. Generally speaking, the result showed that the traditional fold change method of candidate gene selection was relatively reasonable. The variation scope should be 1.5-2.1 fold changes and the 1.8-2.0 fold changes of gene expression should be the better choice of gene selection with stable gene numbers passed the filtering criteria. The simulation gave the first step to resolve the issue for candidate genes selection by means of observing fold changes and correlations with corresponding phenotype data. Potentially, this might offer a good reference and further connection to study gene regulatory network kinetics, as well as bridging the gap between the microarray experiments and individual biological phenotypes.
ACKNOWLEDGMENTS
We are grateful to the anonymous reviewers for their constructive comments and suggestions. This work was supported by Chinese Research Funds to LWY (2011PTFY04ZD, KJ2012A216).
REFERENCES
- Baatartsogt, O., K. Choi, P.K. Mandal, H.K. Lim and G.H. Li et al., 2009. Gene expression profile of synovial cells in experimental post-traumatic arthritis of knee in swine. Biotechnology, 8: 70-77.
CrossRefDirect Link - Baldi, P. and A.D. Long, 2001. A bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics, 17: 509-519.
CrossRefPubMedDirect Link - Benjamini, Y. and Y. Hochberg, 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B: (Methodol.), 57: 289-300.
Direct Link - Benjamini, Y. and D. Yekutieli, 2001. The control of the false discovery rate in multiple testing under dependency. Ann. Stat., 29: 1165-1188.
CrossRefDirect Link - Chaudhry, M.A., 2008. Induction of gene expression alterations by culture medium from trypsinized cells. J. Boil. Sci., 8: 81-87.
CrossRefDirect Link - Chen, Y., S.L. Yang, W.D. Deng and H.M. Mao, 2011. Differential gene expression in different tissues of black-bone sheep and normal sheep. Asian J. Anim. Vet. Adv., 5: 379-384.
CrossRefDirect Link - Cho, R.J., M. Huang, M.J. Campbell, H. Dong and L. Steinmetz et al., 2001. Transcriptional regulation and function during the human cell cycle. Nat. Genet., 27: 48-54.
CrossRefPubMedDirect Link - Cossio-Bayugar, R., E. Miranda-Miranda, D. Portilla-Salgado and J. Osorio-Miranda, 2009. Quantitative PCR detection of cholinesterase and carboxylesterase expression levels in acaricide resistant Rhipicephalus (Boophilus) microplus. J. Entomol., 6: 117-123.
CrossRefDirect Link - Mekki, D.M., B. Liang-yong, W. Jin-yu, Y. Yan, Y. Yabo and H.H. Musa, 2006. Avian E2A expression level variation associated with phenotypic variation of two Chinese indigenous duck breeds, (Anas platyrhynchos). Int. J. Poult. Sci., 5: 1023-1028.
CrossRefDirect Link - Fox, R.J. and M.W. Dimmic, 2006. A two-sample Bayesian t-test for microarray data. BMC Bioinformatics, Vol. 7.
CrossRefDirect Link - Gautier, L., L. Cope, B.M. Bolstad and R.A. Irizarry, 2004. Affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20: 307-315.
CrossRefPubMedDirect Link - Ge, H., Z. Liu, G.M. Church and M. Vidal, 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet., 29: 482-486.
CrossRefPubMedDirect Link - Gentleman, R.C., V.J. Carey, D.M. Bates, B. Bolstad and M. Dettling et al., 2004. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., Vol. 5.
CrossRefDirect Link - Goyal, G., V. Upmanyu, S.K. Singh, S.K. Shukla, S. Mehra, V. Kumar and D. Sharma, 2010. Differential expression of IL-6 and IGF-II in guinea fowl and chicken. Int. J. Poult. Sci., 9: 756-760.
CrossRefDirect Link - Hakak, Y., J.R. Walker, C. Li, W.H. Wong and K.L. Davis et al., 2001. Genome-wide expression analysis reveals dysregulation of myelination-related genes in chronic schizophrenia. Proc. Natl. Acad. Sci. USA., 98: 4746-4751.
CrossRefPubMedDirect Link - Hill, A.A., E.L. Brown, M.Z. Whitley, G. Tucker-Kellogg, G.P. Hunter and D.K. Slonim, 2001. Evaluation of normalization procedures for oligonucleotide array data based on spiked cRNA controls. Genome Biol., Vol. 2, No. 12.
CrossRefDirect Link - Huang, J., L. Liu, H. Wang, C. Zhang and Z. Ju et al., 2011. Variants and gene expression of the TLR2 gene and susceptibility to mastitis in cattle. Asian J. Anim. Vet. Adv., 6: 51-61.
CrossRefDirect Link - Irizarry, R.A., B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf and T.P. Speed, 2003. Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics, 4: 249-264.
CrossRefPubMedDirect Link - Lazaridis, E.N., D. Sinibaldi, G. Bloom, S. Mane and R. Jove, 2002. A simple method to improve probe set estimates from oligonucleotide arrays. Math. Biosci., 176: 53-58.
CrossRefPubMedDirect Link - Li, C. and W.H. Wong, 2001. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Natl. Acad. Sci. USA., 98: 31-36.
PubMedDirect Link - Li, M. and C. Zhao, 2009. Study on Tibetan chicken embryonic adaptability to chronic hypoxia by revealing differential gene expression in heart tissue. Sci. China. C. Life. Sci., 52: 284-295.
PubMed - Lim, H.K., K. Choi, P.K. Mandal, O. Baatartsogt, C.H. Lee, J.H. Lee and H.B. Kim, 2009. Transcriptional profiling of spleen lymphocyte in fowl typhoid of broilers. Asian J. Anim. Vet. Adv., 4: 66-75.
CrossRefDirect Link - Liu, X., A. Krishnan and A. Mondry, 2005. An Entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics, Vol. 6.
CrossRefDirect Link - Long, X.Y., Y.X. Liu, H. Rocheleau, T. Ouellet and G.Y. Chen, 2011. Identification and validation of internal control genes for gene expression in wheat leaves infected by strip rust. Int. J. Plant Breed. Genet., 5: 255-267.
CrossRefDirect Link - Lu, C., 2004. Improving the scaling normalization for high-density oligonucleotide GeneChip expression microarrays. BMC Bioinformatics, Vol. 5.
CrossRefDirect Link - Ludwig, J.A. and J.N. Weinstein, 2005. Biomarkers in cancer staging, prognosis and treatment selection. Nat. Rev. Cancer, 5: 845-856.
CrossRefPubMedDirect Link - Salleh, M.N., P. Ismail, Y.H. Taufiq-Yap and P. Carmichael, 2004. Identification of Estrogen-responsive genes in p53+/- knockout and isogenic wild-type parent strain mice by CDNA macroarray analysis. J. Med. Sci., 4: 312-317.
CrossRefDirect Link - Naef, F., C.R. Hacker, N. Patil and M. Magnasco, 2002. Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol., Vol. 3.
CrossRefDirect Link - Baatartsogt, O., P.K. Mandal, H.K. Lim, C.H. Lee, J.H. Lee and K. Choi, 2011. Differential expression of immuno-inflammatory genes in synovial cells from knee after inducing post-traumatic arthritis in swine. Asian J. Anim. Vet. Adv., 6: 333-343.
CrossRefDirect Link - Pavelka, N., M. Pelizzola, C. Vizzardelli, M. Capozzoli, A. Splendiani, F. Granucci and P. Ricciardi-Castagnoli, 2004. A power law global error model for the identification of differentially expressed genes in microarray data. BMC Bioinformatics, Vol. 5.
CrossRefDirect Link - Schadt, E.E., C. Li, C. Su and W.H. Wong, 2001. Analyzing high-density oligonucleotide gene expression array data. J. Cell. Biochem., 80: 192-202.
CrossRefDirect Link - Schadt, E.E., C. Li, B. Ellis and W.H. Wong, 2001. Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell. Biochem., 84: 120-125.
CrossRefPubMedDirect Link - Song, M., W. Han, H. Bao, C. Liu, C. Wu and C. Zhao, 2010. Physiological adaptability of tibet chicken embryo liver to hypoxia and its protein profile analysis. Asian J. Anim. Vet. Adv., 5: 547-556.
CrossRef - Suzuki, H., R. Ohtsuka and M. Takeda, 2011. Differences in gene expression profiles among the proximal, middle and distal peyer's patches in the mouse small intestine. Res. J. Immunol., 4: 19-24.
CrossRefDirect Link - Endrini, S., A. Rahmat, P. Ismail and Y.H. Taufiq-Yap, 2007. Comparing of the cytotoxicity properties and mechanism of Lawsonia inermis and Strobilanthes crispus extract against several cancer cell lines. J. Med. Sci., 7: 1098-1102.
CrossRefDirect Link - Weng, L., H.Y. Dai, Y.H. Zhan, Y.D. He, S.B. Stepaniants and D.E. Bassett, 2006. Rosetta error model for gene expression analysis. Bioinformatics, 22: 1111-1121.
CrossRefDirect Link - Wright, G.W. and R.M. Simon, 2003. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics, 19: 2448-2455.
CrossRefPubMedDirect Link - Wu, W.B., 2008. On false discovery control under dependence. Ann. Stat., 36: 364-380.
CrossRefDirect Link - Zhou, Y. and R. Abagyan, 2002. Match-Only Integral Distribution (MOID) algorithm for high-density oligonucleotide array analysis. BMC Bioinformatics, Vol. 3.
CrossRefDirect Link