Clustering techniques find interesting and previously unknown patterns in large-scale data embedded in a large multi dimensional space and are applied to a wide variety of problems like customer segmentation, biology, machine learning and geographical information systems. Clustering algorithms are used efficiently to scale up with the dimensionality of the data sets and the data base size. Hierarchical clustering methods in particular are widely used to find patterns in multi dimensional data. Since clustering is an unsupervised learning technique, fewer or greater numbers of clusters may be desired. A key step in the analysis of gene expression data is the identification of groups of genes that are similar in nature. The developments of micro array technologies provide a powerful tool by which the expression patterns of thousands of genes can be monitored simultaneously. In this research, we study some of the major statistical approaches in hierarchical clustering and compare the linkage methods that are used in gene expression data which can assist us to know functions of many genes for which information is not available currently.
PDF Abstract XML References Citation
How to cite this article
Clustering has been extensively studied in statistics, machine learning, pattern recognition and image processing (Kaufman, 1989). Clustering techniques find patterns previously unknown in large-scale data, embedded in a large, multi dimensional space. Efficient representation of the detected clusters is as important as cluster detection and improves its usability. Most of the earlier works in statistics operate and find clusters in the whole data space. The outputs of these algorithms are very sensitive to the input parameters. The scalability of these algorithms with the database size is important as their scalability with their dimensionality of the data sets. Noise present with data makes cluster detection harder. A wide range of techniques has been applied for clustering gene expression data (Eisen et al., 1998). Examples include hierarchical clustering, adaptive resonance theory, self-organizing map, k-means, graph-theoretic approaches and growing cell structures network. However, most of the above-mentioned clustering algorithms are heuristically motivated and the issues of determining the correct number of clusters and choosing a good clustering algorithm are not yet rigorously solved. Clustering gene expression data using hierarchical clustering and Self Organized Maps has been very popular among the bioinformatics community. Typically this will involve data processing using various statistical techniques to identify the patterns. In addition, data needs to be packaged, presented, archived and compared with other types of information.
Scope and relevance: The rapid advance of genome-scale sequencing has driven the development of methods to exploit this information by characterizing biological processes in new ways. The knowledge of the coding sequences of virtually every gene in an organism, for instance, invites development of technology to study the expression of all of them at once, because the study of gene expression of genes one by one has already provided a wealth of biological insight. To this end, a variety of techniques has evolved to monitor, rapidly and efficiently, transcript abundance for all of an organisms genes. The mass of numbers produced by these techniques, which amounts to hundreds of data points for thousands of genes, is an immense amount of biological information. In this study we address the problem of analyzing and presenting information on this genomic scale for gene expression analysis and finding out functions of genes that is not yet known.
Statistical analysis: The statistical analysis of micro array data is probably the most difficult problem associated with the use of clustering. The aim is to apply standard statistical approaches to determine gene expression, thus enabling the extraction of significant biological information from noise and variability. Statisticians are experiencing with handling data involving a limited number of variables, but a large number of samples. A number of different methods have been explored. The output of statistical analysis of a micro array experiment is usually a large data spreadsheet or dendrogram filled with numbers related to the signal intensity for each element on the chip. Further analysis is required to identify groups of elements that are similarly regulated across the biological samples under study. A variety of mathematical and statistical procedures have been developed that partition elements or samples into groups, or clusters, with maximum similarity and thus enabling the identification of elements.
The hierarchical clustering techniques create a hierarchy of clusters from small to big since clustering is an unsupervised learning technique. Hence depending on the particular application of the clustering, fewer or greater numbers of clusters may be desired. With a hierarchy of clusters defined it is possible to choose the number of clusters that are desired and required. Any clustering algorithm that ends up with as many clusters as there are records has not helped the user understand the data better. Exactly how many clusters should be formed is a matter of interpretation. The advantage of hierarchical clustering methods is that they allow the end user to choose from either many clusters or only a few. Hierarchical clustering has the advantage over non-hierarchical techniques in that the clusters are defined solely by the data (not by the users predetermining the number of clusters) and that the number of clusters can be increased or decreased by simple moving up and down the hierarchy (Ranjan, 2005).
All the hierarchical methods often work either in a top down manner, (by repeatedly partitioning the set of elements) or in a bottom up manner. More information on clustering techniques is available (Hartigan, 1975; Dopazo, 2002). Hierarchical clustering methods are represented usually by a dendrogram. We focus on top down manner, which is called agglomeration. In agglomerative clustering technique we start with as many clusters where each cluster contains just one record. The clusters that are nearest to each other are merged together to form the next largest cluster. This merging is continued until a hierarchy of clusters is built with just a single cluster containing all the records at the top of the hierarchy. Several agglomerative procedures are probably most widely used in hierarchical clustering. They produce a series of partitions of the data: the first consists of n single-member clusters; the last consists of a single group containing all n individuals (Everitt et al., 2001). At each stage the methods fuse individuals or groups of individuals which are closest or most similar (single linkage). Differences between the methods arise because of the different ways of defining the distance (or similarity) between an individual and a group containing several individuals or between two groups of individuals (Everitt et al., 2001). The standard agglomerative hierarchical clustering methods are Single Linkage, Complete Linkage, Average Linkage, Centroid Linkage, Median Linkage and Wards Method. A detailed description of these methods can be found in (Everitt et al., 2001; Kaufman, 1989). Researchers and scientists have developed several clustering software packages using agglomerative methods for gene expression data.
Background: The most frequently used analysis on gene expression data is clustering. It is an exploratory technique for gene expression data as it groups similar objects together and allows the biologist to identify meaningful relationships between the objects. There are numerous algorithms and programs associated with clustering like hierarchical methods, self-organized maps, k-means and model-based approaches (Yeung et al., 2003). In this research, we focus on hierarchical agglomerative clustering. Many similarity measures can be applied to calculate the similarity and dissimilarity between a pair of objects. We have chosen the most popular Euclidian distance and correlation co-efficient. High correlation implies high similarity (Yeung et al., 2003). Euclidian distance is a dissimilarity measure, high distance means low similarity. Most clustering algorithms uses similarity matrix as input and produces group of objects similar to each other as output. The output of hierarchical algorithms is usually in form of dendrogram. The cluster similarity can be computer from the similarity matrix or on the data matrix (Shannon et al., 2003). The Euclidean (and squared Euclidean) distances are usually computed from raw data and not from standardized data.
MICRO ARRAY DATA
Micro arrays exploit the preferential binding of complementary single stranded nucleic acid sequences. A micro array is typically a glass slide, onto which DNA molecules are attached at fixed locations. There may be tens of thousands of spots on an array each containing a huge number of identical DNA molecules. Micro array technology makes use of the sequences resources created by the genome projects and other sequencing efforts (Ranjan and Ahson, 2004a). For instance, they allow comparisons of gene expression between normal and diseased (cancerous) cells. There are several names for this technology-DNA micro arrays, DNA arrays, DNA chips and gene chips. The technology is very new; methodologies are still evolving. Micro array techniques find utility in Gene discovery, Gene regulation, Drug discovery and toxicology, Diagnosis and identification of patterns of gene expression that define disease states and they may represent prognostic indicators (Ranjan and Ahson, 2004b). The micro array technology is rapidly developing there fore it is natural that currently there are no established standards for micro array experiments and how the raw data can be processed. The repository for gene expression data is being developed at NCBI-National Center for Biotechnology Information in US. Array Express is storing at all the information, the details of which is called Minimum Information About A Micro array Experiment (MIAME) defined by the Micro array Gene Expression Database (MGED) consortium. We have taken-Acute Myeloid Leukemia (AML) and Lymphoblastic Leukemia (ALL) data at the cancer genomic center http://www.broad.mit.edu. We have also collected data sets from http://sdmc.lit.org.sg/ GEDatasets/Datasets.html#BreastCancer,http://www. genomebiology.com/2003/4/5/r34/suppl/,http://www.columbia.edu/~xy56/project.html and http://www.expression. washington.edu/public. In recent years there has been an explosion in the rate of acquisition of bio medical data (Shapiro and Tamayo, 2003). The major challenges in micro array data mining includes gene selection, classification of deceases and predicting outcomes and finding new biological classes or refining the existing ones (Shapiro and Tamayo, 2003).
RESULTS AND DISCUSSION
Microarray-based genomic surveys and other high-throughput approaches (ranging from genomics to combinatorial chemistry) are becoming increasingly important in biology and chemistry. As a result, we need to develop our ability to see the information in the massive tables of quantitative measurements that these approaches produce. Our approach to this problem can be generalized as follows. First, we use a common-sense approach to organize the data, based on order inherent in the data. Next, recognizing that the rate-limiting step in exploring and searching large tables of numerical data is a trivial one: reading the numbers (human brains are not well adapted to assimilating quantitative data by reading digits), we represent the quantitative values in the Table by using a naturalistic color scale rather than numbers. This alternative encoding preserves all the quantitative information, but transmits it to our brains by way of a much higher-bandwidth channel than the number-reading channel.
A natural way of viewing complex data sets is first to scan and survey the large-scale features and then to focus in on the interesting details. What we have found to be the most valuable feature of the approach described here is that it allows this natural and intuitive process to be applied to genomic data sets. The approach is a general one, with no inherent specificity to the particular method used to acquire data or even to gene-expression data. It is therefore likely that very similar approaches may be applied to many other kinds of very large data sets. In each case, it may be necessary to find alternative algorithms and computation methods to bring out inherent structures in the data and, equally important, to find dense naturalistic visual representations that convey the quantitative information effectively. We recognize that the particular clustering algorithm we used is not the only, or even the best, method available. We have used and are actively exploring alternatives such as parametric ordering of genes and supervised clustering methods based on representative hand-picked or computer-generated expression profiles (Eisen et al., 1998; Chu et al., 1998). The success of these very simple approaches has given us confidence to face the coming flood of functional genomic data.
We have implemented several agglomerative approaches and performed the comparison study by evaluating the clustering results using the data sets. The data are arranged in Table 1 where rows represent all genes and the columns represent the individual expression values obtained in each DNA array.
|Table 1:||Agglomeration schedule of clusters using median linkage|
|Fig. 1:||Dendrogram using median linkage method|
We studied the performance of agglomerative algorithms on our data sets. For assessing the cluster quality, we obtained desired number of clusters from the dendrogram by cutting the merging process. For efficient visualization of the patterns extraction from the micro array data sets, we used different tools like HCE (http://www.cs.umd.edu/hcil/hce), GEPAS-Gene Expression Pattern Analysis Suite (Dopazo, 2002), European Bio-Informatics Institutes Expression Profiler (Kapushesky et al., 2004), Clusfavor software (http://www.clusfavor.com), SPSS (http://www.spss.com), S-Plus (http://www.insightful.com/products/splus/default.asp) and Eisens Tree view (Eisen et al., 1998). We are interested in comparing the same type of data on agglomerative different methods.
We found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality. Once we have our agglomerative hierarchy of clustering, the best clustering can be chosen via domain-dependent preferences. One general method is by looking at the dendrogram for clusters and pick the clustering when all (or most) such clusters have formed as the natural clustering; other less subjective methods also exist, which use heuristics such as ensuring that the within-cluster variance is less than the between-clusters variance.
Hierarchical clustering has the advantage that the clusters are defined solely by the data (not by the users predetermining the number of clusters) and that the number of clusters can be increased or decreased by simple moving up and down the hierarchy (Kaufman, 1989).
We evaluated the performance of the various clustering methods to obtain hierarchical solutions using the available clustering tools from S-Plus, Eisens Cluster, European Bioinformatics Institutes expression Profiler, GEPAS-Gene Expression Pattern Analysis Suite, Hierarchical Clustering Explorer and Rousseau and Kaufmans Web clusters for single linkage, complete linkage and average linkage methods. The first step we did in analyzing micro array data is to filter out genes that are not expressed or do not show variation across samples. This usually reduces the data set by 3000-5000 genes. We have filtered genes using Eisens Cluster (Eisen et al., 1998).
We preprocessed the patterns by transforming the scale, handling the missing values, removing of flat gene expression patterns and the standardizing the remaining patterns. How ever these steps are optional and depend on the actual data set. All the algorithms to a certain extent impose their own structure of visualization. Of course there is no single best clustering procedure. Single linkage methods optimality is the oldest and simplest among all. Its useful property is that monotone transformations on the dissimilarities do not change the clustering (Han and Kamber, 2000).
Its nearest neighborhood property even works for large sets of data. Figure 1 shows dendrogram for median linkage and represents a hierarchical agglomerative clustering of classes and genes derived using Squared Euclidian Distance.
|Dendrogram pertaining to breast cancer data|
|Visualization for the complete linkage method|
The results of the median linkage method produced agglomerative clusters as in Table 1. We found that cluster centers are less sensitive to outliers. The median link method is invariant under monotone transformation of the distances like single linkage and complete linkage.(where as in average linkage, centroid methods the results show to be variant).
|Fig. 4:||Correlation coefficient between genes, the pointers representing blue colors consist of genes against the experiments|
|Fig. 5:||Hierarchical clustering results from expression profiler using UPGMA|
It was difficult to spot a structure in a data set by merely looking at its dissimilarity matrix. The complete linkage method produced compact clusters since they had small diameter. We tried to investigate the effectiveness of clustering gene expression data. Since genes are clustered, the experimental conditions are the variables.
We have investigated the linkage algorithms that would detect the number of clusters in the gene expression data we used for the experiments. Prior to clustering, the data were normalized to have zero mean and unit variance. As described before, the number of clusters at which the tree-based index reaches its highest peak can be used as an indicator of optimal number of clusters (Fig. 2). The results Expression Profiler (Kapushesky et al., 2004) indicated the distance of 2763.090 when the cluster size is 11. To evaluate the consistency of the clustering, the number of common genes in two clusters for different number of initial nodes was investigated. The results, which are represented in Fig. 3 and 4, describe the number of genes obtained using UPGMA and the correlation between the genes. The results using UPGMA indicated robustness in cluster quality (Fig. 5).
The distance matrix is recalculated to include the distance between genes not clustered and the new cluster formed by the two genes. Many algorithms follow the series of merging of number of genes to produce hierarchical clustering of data. Variations between the algorithms can lead to different dendrograms and hence different clusters. It should always be noted that different authors define average clustering in different ways. Average operates by iteratively merging the genes or gene clusters with the smallest distance between them followed by an updating of the distance matrix. Single linkage calculates the distances between each gene in the new cluster to each of the genes in another cluster and takes the smallest distance. Complete linkage uses the largest distance of all these distances as the distance between the clusters. We have found the average linkage algorithm generally works well with standardized micro array data in contrast to single linkage that generally performs poorly.
The statistical analysis of micro array data is probably the most difficult problem associated with the use of these techniques. We have applied standard statistical agglomerative approaches to determine gene expression and gene expression alteration, thus enabling the extraction of significant biological information. Statisticians are experienced with a large number of samples instead of variables. Micro arrays produces thousands of variables from a small number of samples to handle the large samples. We have investigated the linkage algorithms that would detect the number of clusters in the gene expression data we used for the experiments. The linkage methods are usually considered together because of their easy implementation with the same computer program. In this study, DNA micro array data from samples of primary breast tumors were analyzed using statistical clustering analysis to evaluate the patterns of interactions of groups of genes. Through Hierarchical clustering we compared the gene expression data. The methods used are very general and applies to any data type providing that they can be coded as a series of numbers and that a computable measure of similarity between data items can be used. These improvements of neighborhoods correlate well with the improvements of overall clustering solutions, which suggests that constrained agglomerative schemes benefit from starting with purer neighborhoods and hence lead to clustering solutions with better quality.
- Chu, S., J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown and I. Herskowitz, 1998. The transcriptional program of sporulation in budding yeast. Science, 282: 699-705.
- Kapushesky, K., P. Kemmeren, A.C. Culhane, S. Durinck and J. Ihmels et al., 2004. Expression Profiler: Next generation-an online platform for analysis of microarray data. Nucleic Acids Res., 32: W465-W470.
- Shannon, W., R. Culverhouse and J. Duncan, 2003. Analyzing micro array data using cluster analysis. Pharmacogenomics, 4: 41-51.
- Yeung, K.Y., M. Medvedovic and R.E. Bumgarner, 2003. Clustering gene expression data with repeated measurements. Genome Biol., 4: R34-R34.