Subscribe Now Subscribe Today
Research Article
 

Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets



Sanjay Garg and Ramesh Chandra Jain
 
Facebook Twitter Digg Reddit Linkedin StumbleUpon E-mail
ABSTRACT

Clustering algorithms are computationally intensive, particularly when they are used to analyze large amount of high dimensional data. Exploring faster algorithms for clustering is a vital and often encountered research problem. k-mean algorithm is a well known partition based clustering technique and so many variations of this basic algorithm are proposed by various researchers. In order to explore the strength and weaknesses an attempt has been made to compare some of the existing variations of k-mean algorithms using synthetic sets of high dimensional data as benchmark for evaluation and some criteria is also evolved for comparison of clustering algorithms.

Services
Related Articles in ASCI
Search in Google Scholar
View Citation
Report Citation

 
  How to cite this article:

Sanjay Garg and Ramesh Chandra Jain , 2006. Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets. Information Technology Journal, 5: 1132-1135.

DOI: 10.3923/itj.2006.1132.1135

URL: https://scialert.net/abstract/?doi=itj.2006.1132.1135

INTRODUCTION

The clustering problem is a classical problem in the database, knowledge discovery, artificial intelligence and theoretical literature is used to find similar groups of records in very large database. The clustering algoritlnns are used for similarity search, customer segmentation, pattern recognition, trend analysis and classification. Basically clustering problem is to find a partition of the points into clusters so that the points within each cluster are similar to one another for a given set of points in multidimensional space. Similarity can be determined using various distance fnnctions (Han and Kamber, 2001). Clustering techniques are basically classified in so many categories such as partitioning Techniques, Hierarchical Techniques, Density based Techniques etc. (Pujari, 1999).

In the past three decades, cluster analysis has been extensively used to many areas such as medicine for classification of diseases, chemistry for grouping of components, social studies for classification of statistical findings, and so on. Its main aim is to discover structures or clusters present in the data. While there is no rmiversal definition of a cluster, algoritlnns have been developed to find several kinds of clusters: spherical, linear, drawn out, etc. redefining clustering problem in a diverse way for liigli dirnensioml applications (Aggarwal and Yu, 2002) facilitates fast computations.Among all the open variations of k-mean clustering algoritlnn, k-means, k-medoid and h-k-mean clustering algoritlnn are chosen for our study. Unlike many other partitioning methods, k-means andk-medoid (Kaufmann and Rousseeuw, 1990) are the most basic methods for clustering and h-k-mean (Garg et al., 2004) is heuristic based hybrid model of these two algoritlnn.

The study introduces clustering algoritlnns based on partition and variations of k-means algoritlnn i.e., k– mean, k-medoid (PAM) and h-k-mean, presents experimental results comparing the performance of k-means, k-medoid and h-k-mean clustering algoritlnns on a criterion evolved and finally explains conclusion and future scope in this field.

CLUSTERING ALGORITHMS BASED ON PARTITIONING

Basic concept of Partition based clustering method is to Construct a partition of a database D of n objects into a set of k clusters and minimizing an objective frmction. Exhaustively enumerate all possible partitions into k sets m order to find the global minimum is too expensive. following heuristic is used

Choose k representations for clusters, e.g., randomly
Improve these initial representations iteratively
Assign each object to the cluster it "fits best" in the current clustering
Compute new cluster representations based on these assignments
Repeat rmtil the change m the objective frmction from one iteration to the next drops below a threshold

The most well-knownl and commonly used partitioning methods are k-means, k-medoid and their variants.

k-means: The algorithm first select k of the objects, each of which mainly represents a cluster mean or center. Each of the remaining object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then recomputes the new mean for each cluster.

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
Fig. I: Algorithm k-mean

This process iterates nntil the criterion fnnction converges. Typically the squared-error criterion is used, defined as

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets

k-mean (Kuafmm and Rousseeuw, 1990) algorithm is described in Fig. 1.

k-medoid: There are two well-knownll k-medoid methods, PAM and CLARA. PAM (Partitioning Around Medoids) (Kaufman and Rousseeuw. 1990) .To find k clusters. PAM's approach is to determine a representative object for each cluster. This representative object, called a medoid is meant to be the most centrally located object for each cluster. Once the medoids have been selected, each non-selected objects is grouped with the medoid to which it is the most similar. More precisely, if Oj is a non-selected object, and Oi is a (selected) medoid, i. Oj belongs to the cluster represented by Oi if d(Oj, Oi) min oed(Oj , Oi), where the notation min oedenotes the minimum over all medoids Oe, and the notation d(Oa, Ob) denotes the dissimilarity or distance between objects Oaand Ob. All the dissimilarity values are given as inputs to PAM. Finally, the quality of the chosenmedoids is measured by the average dissimilarity between an object and the medoid of its cluster. Figure 2 illustrates k-medoid (Kuafmann and Rousseeuw. 1990) algorithm.

To find the kmedoids. PAM begins with an arbitrary selection of k objects. Then in each step, a swap between a selected object Oi and a non-selected object Oh is made, as long as such a swap would result in an improvement of the quality of the clustering. In particular, to calculate the effect of such a swap between 0i and Oh, PAM computes costs Cih for all non-selected objects Oj. depending on which of the following cases Oj is in, Cjih is defined by one of the equation below.

First case: Oj cmently belongs to the cluster represented by 0i. Furthermore, let Oj be more similar to Oj,2 than Oh, i.e., d (Oj, Oh)>=d(Oj, Oj.2) where Oj.2 is the second most similar medoid to Oj. Thus, is Oi is replaced by Oh as a medoid, Oj would belong to the cluster represented by Oj.2 hence, the cost of the swap as far as q is concerned is:

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
(1)

This equation always gives a non-negative Cjih indicating that there is a non-negative cost incmed in replacing Oi with Oh.

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
Fig. 2: Algorithm k-medoid (PAM)

Second case: Oj cmently belongs to the cluster represented by 0i. But this time, Oj is less similar to Oj.2 than . i.e.d(Oj,Oh)<d(Oj,Oj.2) Then, if Oi is replaced by Oh, Oj would belong to the cluster represented by Oh. Thus, the cost for Oj is given by:

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
(2)

Unlike in Eq. 1, Cjih here can be positive or negative, depending on whether Oj is more similar to Oi or to Oh.

Third case: suppose that Oj cmently belongs to a cluster other than the one represented by Oi. Let Oj.2 be the representative object of that cluster. Furthermore let Oj be more similar to Oj.2 than Oh. Then even if Oi is replaced by Oh, Oj would stay in the cluster represented by Oj.2 Thus, the cost is:

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
(3)

Fourth case: Oj currently belongs to a cluster represented by Oj.2 . But Oj is less similar to Oj.2 than Oh. Then replacing Oi with Oh would cause Oh to jump to the cluster of Oh from that of Oj.2 . Thus the cost is:

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
(4)

and is always negative. Combining the four cases above, the total cost for replacing Oi with Oh is given by:

Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
(5)

h-k-mean: A hybrid approach of k-mean and k-medoid algoritlnn is h-k-mean algorithm which can deal with presence of noise and outliers efficiently. Heuristic followed by h-k-mean algorithm (Garg et oZ.,2004) is as under:

Initially cluster centers are detected by using the strategy of k-mean algorithm in first iteration
Cluster centre means are recalculated after temporary removal of most distant object from cluster centre in each cluster and a reference point (medoid) is chosen as new cluster mean for each cluster which is nearest to recalculated mean . Same process is followed for subsequent iterations rmtil algorithm converges.

Running time for this algorithm is a little mre than nmning time of k-mean algorithm. This algorithm selects reference point which is most centrally located after considering removal of a possible outlier unlike a random reference point in k-medoid algorithm. k-mean algorithm is more robust than both the algorithms i.e., k-mean algorithm and k-medoid algorithm in the presence of noise and outliers because farthest values are temporarily removed while deciding cluster representing object.

RESULTS

The simulations were performed on synthetically generated datasets varying their dimension from 2 to 30, varying their size from 1 000 points to 100000 points and including zero to 20 numbers of outliers.

Table 1: Summary of comparative study
Image for - Variations of k-mean Algorithm: A Study for High-Dimensional Large Data Sets
#more sensitive than h-k-mean algorithm, +better thank-mean algorithm, ++close to spheres, better than k-medoid algorithm

Criterion used to study k-mean, k-medoid and h-k- mean algorithm is as follows For the computation of each parameter each algorithm has been nm on all the datasets(l0 times each) and average value/inference is worked out. Ecludian distance frmction (Han and Kamber, 2001) is used. Experimental findings are summarized in Table 1.

Average running time: Time complexity is always an important criterion for evaluation of an algorithm. Of course these three algorithms are having asymptotically same time complexity but actual nmning time is different.

Average distance of points in each cluster from representing point/mean: Average distance of points in each cluster from representing point/mean indicates compactness of clusters and it is a one of the quality of cluster parameter.

Maximum interclass distance: Maximum interclass distance (Gonzalez, 1985) is also a quality of cluster parameter and to minimize it always an interesting issue in speculative as well as realistic literature.

Sensitivity to outliers/noise: Outliers and noise present in data badly affects quality of clusters so it always desirous that clustering algorithm should not be sensitive to the noise/outliers.

Cluster shapes: Distance based clustering algorithms generates convex shaped clusters. It is always important to have exactly spherical clusters for various applications viz. indexing applications (Garg and Jain, 2005b).

Average density of clusters: Most of the applications e.g., clustering based indexing for high dimensional databases require dense/compact clusters for better competence. density of cluster can be calculated by dividing number of datapoints in a cluster by area of the cluster.

Dependence on initial partitions: In distance based clustering algoritlnns initial partitions are made randomly, convergence of this kind of algoritlnns mostly dependent on initial partitions so it is always desirous to make the algorithm robust to initial partitions.

Effect of high dimensionality of data: Increase in the number of dimensions degrades the performance of clustering algorithm. High dimensional clustering (Garget al., 2005) is required for various applications viz. geographic/spatial information systems (Ng and Han,1994), gene expression data applications etc. so effect of high dimensions is studied on nmning time of the algorithms.

Effect of size of dataset: Increase in the size of the dataset crushes the performance of clustering algoritlnn. Clustering of large datasets is required for various knowledge extraction applications so consequences of size of dataset are extracted out on nmning time of the algoritlnns.

DISCUSSION

Numerous variations of most basic clustering algoritlnn k-means are suggested by many researchers, we studied only k-mean, k-medoid and h-k-mean algoritlnns for high dimensional large datasets i.e., two most basic algorithms and a hybrid model thereof. This study indicates that in most of the criterion h-k-mean algoritlnn is overperforming other two. An remarkable observation is that h-k-mean algoritlnn is providing better quality of clusters even in the presence of outliers and noise. It is also been observed that for large high dimensional dataset h-k-mean algoritlnn is robust.

Criterion used for comparison may not be sufficient, more criteria may be evolved and a precise mathematical model is required to predict quality of clusters. A high performance clustering algoritlnn for large high dimensional dataset with variety of attributes 1s a frmdarnental and open ended research region.

REFERENCES

  1. Aggarwal, C.C. and P.S. Yu, 2002. Redefining clustering for high dimensional applications. IEEE Trans. Knowledge Data Eng., 14: 210-225.
    CrossRef  |  Direct Link  |  


  2. Garg, S., P. Amit and R.C. Jain, 2004. H-k-mean algorithm: Integrating k-mean and k-medoid clustering algorithm. Proc. Intl. Conf. AIECT, 2: 70-76.


  3. Garg, S., A. Patel and R.C. Jain, 2005. A comparative study of clustering algorithms for high-dimensional data. J. Eng. Technol. SPU, 18: 120-125.


  4. Garg, S. and R.C. Jain, 2005. An experimental study on quality of cluster shapes for h-k-mean algorithm. ADIT J. Eng., 2: 120-125.


  5. Gonzalez, T., 1985. Clustering to minimize the maximum intercluster distance. Theor. Comp. Sci., 38: 293-306.


  6. Han, J. and M. Kamber, 2001. Data Mining: Concepts and Techniques. Morgan Kauffmann Publishers, New York, pp: 6-8


  7. Kaufman, L. and P.J. Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York, ISBN: 0471878766, Pages: 342


  8. Ng, R. and J. Han, 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20th International Conference on Very Large Data Bases, September 12-15, 1994, San Francisco, CA., USA., pp: 144-155
    Direct Link  |  


  9. Pujari, A.K., 1999. Data Mining Techniques. University Press, Hyderabad


©  2022 Science Alert. All Rights Reserved