
Research Article


A Comparative Study of Clustering Algorithms 

Satchidanandan Dehuri,
Chinmay Mohapatra,
Ashish Ghosh
and
Rajib Mall



ABSTRACT

Data clustering is an unsupervised task that can generate different shapes of clusters for a particular type of data set. Hence choosing an algorithm for a particular type of data set is a difficult problem. This study presents the choice of an appropriate clustering algorithm by a comparative study of three representative techniques like Kmeans, Kohonen`s Self Organizing Map (SOM) and Density Based Spatial Clustering of Applications with Noise (DBSCAN) based on the extensive simulation studies. Comparison is performed on the basis of cluster quality index `ß`, percentage of samples correctly classified and CPU time. The experimental results show that if the clusters are of arbitrary shape, a density based clustering algorithm like DBSCAN is preferable, where as if the clusters are of hyper spherical or convex shape and wellseparated then the SOM or Kmeans is preferable.





INTRODUCTION Recently there has been an enormous growth in the amount of commercial and scientific data, such as protein sequence, retail transaction and web logs (Fayyad et al., 1996). The questions arise, why so much of data? People store data because they think some valuable assets are implicitly coded within it. But raw data is rarely of direct benefit. Discovering groups of similar data items in the data set, known as cluster analysis, is an interesting and challenging issue. There are many different data clustering algorithms emerged from several disciplines such as statistics, pattern recognition and machine learning (Jain et al., 1999). They give different types of clusters for a particular type of data set. Some of them require some additional prior knowledge about the nature of the data. Each clustering algorithm works well only for certain types of data and with certain applications. The choice of an appropriate clustering algorithm can be made from a comparative study of the clustering algorithms. Furthermore, the comparison is necessary as it is widely used in many areas like information retrieval and text mining (Cutting et al., 1992; Steinbach et al., 2000; Dhillon et al., 2001), spatial database applications, e.g., Geographical Information System (GIS) or astronomical data (Xu et al., 1998; Sandar et al., 1998; Ester et al., 2000), sequence and heterogeneous data analysis (Cadez et al., 2001), web applications (Heer et al., 2001; Foss et al., 2001), DNA analysis in computational biology (BenDor et al., 1999) and many others.
The clustering problem is the problem of dividing a given set {x_{1},...,
x_{N}} of N data points into several nonoverlapping homogenous groups.
Each such group or cluster should contain similar data items and data items
from different groups should not be similar. An k clustering of a data set X
is the partition of X into k clusters, C_{1},...,C_{k} such
that the following three conditions are met:
i) 
C_{i} ≠ φ, i = 1,...,k 
ii) 
∪^{k}_{i = 1} (except outliers) 
iii) 
C_{i} ∩ C_{j} = φ, i ≠ j, i, j = 1,...,k.
(sometimes violated) 
The various data clustering algorithms can be classified into partitional,
density based and clustering using artificial neural networks. In this work,
the three representative algorithms like Kmeans (Hartigan et al., 1979;
Hartigan, 1975), DBSCAN (Ester et al., 1996) and Kohonen Selforganizing
Map (SOM) (Kohonen, 1990) corresponds to partitional, density based and artificial
neural networks are taken for comparative study.
The comparison is performed on the basis of cluster quality index ‘β’, percentage of samples correctly clustered and CPU time. Mathematically, it is defined as
where, n_{i} is the number of points in the ith (i = 1,...,k) cluster,
X_{ij} is the feature vector of the jth pattern (j = 1,...,n_{i})
in cluster I,
is the mean of n_{i} patterns of the ith cluster, n is the total number
of patterns and
is the mean value of the entire set of patterns. ‘β’ is the ratio
of the total variation and withincluster variation. For a given data and number
of clusters k, the higher the homogeneity within the clustered regions, the
higher would be the β value.
CLUSTERING ALGORITHMS This technique produces clusters by optimising a criterion function defined either locally or globally. The user should have prior knowledge of the number of clusters. The algorithm runs multiple number of times with different starting points and the best configuration obtained from all of the runs is used as the output clustering. Let us discuss each of the clustering algorithms.
Kmeans: The kmeans algorithm (Hartigan et al., 1979; Hartigan,
1975) is the most popular clustering tool used in scientific and industrial
applications. The name comes from representing each of k clusters c_{j}
by the mean (or weighted average) x_{j} of its points, the socalled
centroid. While this obviously does not work well with categorical attributes,
it has the good geometric and statistical sense for numerical attributes. The
sum of discrepancies between a point and its centroid expressed through appropriate
distance measure is used as the criterion function. For example, the L_{2}norm
based objective function, the sum of the squares of errors between the points
and the corresponding centroids, is equal to the total intracluster variance
where k is the number of points in cluster j.
The sum of the squares of errors can be rationalized as (a negative of) loglikelihood for normally distributed mixture model and is widely used in statistics (SSE). Therefore kmeans algorithm can be derived from general probabilistic framework. The intercluster variance is measured as
where, k is the number of clusters. Two version of kmeans iterative optimisation
are known. The first version is known as Forgy’s algorithm (Forgey, 1965)
and consists of the following steps:
1. 
Choose the number of clusters, k. 
2. 
Set initial centres of clusters, c_{1},c_{2},.... c_{k}
to the arbitrarily selected k vectors from the dataset. 
3. 
Classify each vector x_{l} = [x_{l1} ,x_{l2} ,......x_{ld}]
(d is the dimension of the input vectors) into the closest centre c_{i
}by Euclidean distance measure:

4. 
Recompute the estimates for the cluster centres c_{i.} Let c_{i}=
[c_{ii},c_{i2},….c_{id}], c_{im} is
computed by
where n_{i} is the number of vectors in the ith cluster. 
5. 
If none of the cluster centres (c_{i}, i = 1,2, …,k) changes
in step 4, stop; otherwise go to step 3. 
The second version (classic in iterative optimisation) of kmeans iterative optimisation reassigns points based on more detailed analysis of effects on the objective function caused by moving a point from its current cluster to a potentially new one. If a move has a positive effect, the point is relocated and the two centroids are recomputed. It is not clear that this version is computationally feasible, because the outlined analysis requires an inner loop over all member points of involved clusters affected by centroids shifts. However, in L_{2 }case it is known (Duda et al., 1973) that all computations can be algebraically reduced to simply computing a single distance! Therefore, in this case both versions have the same computational complexity.
The wide popularity of kmeans algorithm is well deserved. It is simple, straightforward
and is based on the firm foundation of analysis of variances. The Kmeans algorithm
also suffers from all the usual suspects: i) the result strongly depends on
the initial guess of centroids (or assignments), ii) it is not obvious what
is a good k to use, iii) the process is sensitive with respect to outliers,
iv) the algorithm lacks scalability, iv) only numerical attributes are covered
and v) resulting clusters can be unbalanced.
Kohonen’s SelfOrganizing Map (SOM): Kohonen selforganizing map (Kohonen, 1990) is an unsupervised, competitive learning, clustering network, in which only one neuron (or only one neuron in a group) is “on” at a time. Selforganizing maps (SOMs) can be used as a data visualization technique as they can reduce the dimensions of data through the use of selforganizing neural networks while preserving their topological structure. The problem that data visualization attempts to solve is that humans simply cannot visualize high dimensional data as is so techniques are created to help us understand this high dimensional data. The way SOMs go about reducing dimensions is by producing a map of usually one or two dimensions, which plot the similarities of the data by grouping similar data items together. The plot is drawn by giving each output neuron a gray scale colour that is proportionate to its weight vectors average distance from it immediate neighbours weight vectors. Thus SOMs accomplish two things, they reduce dimensions and display similarities.
Basic architecture: Figure 1 shows the architecture
of Kohonen SOM. It contains a single layer of neurons in addition to an input
layer of branching nodes. There are ‘k’ neurons in the neural layer
and each has a parametric weight vector w_{k} of dimension ‘n’,
which is same as the dimension of the input feature vectors The
weight vectorsare
randomly initialised in the feature space at the beginning. One exemplar input
vector
is selected from the sample and put into the network and squared distances between
and each
are computed by Euclidean distance. The minimum distance is then determined
to obtain the neuron that is the winner over the other neurons. In this network
the weight vectors are multiplied by three different strategies.
• 
In the first strategy called winnertakeall strategy, the
winning neuron updates its parametric weight vector. All other neurons keep
their old values. 
• 
In second strategy, update positively (reinforce, or reward) all neurons
that are close to the winning neuron and to update negatively all of those
neurons that are farther way from the winner. 
• 
In the third strategy, when a vector
is presented to a Kohonen network, the dot product is
computed as output from each ‘k’ neurons. 

Fig. 1: 
Kohonen selforganizing map 

Fig. 3: 
2D rectangular grid topology 
Neighbourhoods of a unit designated by # of radii R = 2, 1 and 0 in a onedimensional topology with 5 cluster units is shown in Fig. 2. A neighbourhood for a 2D planar rectangular grid topology is shown in Fig. 3.
Algorithm SOM
1. 
Fix the number of output neurons k, each of which represents
a cluster. 
2. 
Initialise the weight vector for each output neuron with i) random values,
or ii) the first k input vectors. 
3. 
Set topological neighbourhood parameters and learning rate parameter α. 
4. 
While stopping condition is false 
5. 
For each input x 
6. 
Update weight vector w_{j} of the closest output neuron and its
neighbors as follows:
w_{ji}(new) = w_{ji}(old) +α.[x_{i} –
w_{ji}(old)], i = 1,..., n. 
7. 
End for 
8. 
Reduce learning rate 
9. 
Reduce radius of neighbourhood at specified intervals. 
10. 
End while 
The learning process has two separate but related phases, that is, the ordering phase and the convergence phase. During the ordering phase, the learning rate should be set close to unity and then gradually decreased, but not allowed to go below a certain threshold value (usually 0.1). It is during this part of the learning process that the topological ordering of the weight vectors is carried out. The convergence phase is the second phase of the learning that is generally the longest part of the network learning (typically 80% of the epochs). The remaining iterations are necessary during this phase for carrying out fine adjustments of the map i.e., the weight vectors. In this phase the learning parameter should have very small values for a long time. Typically, it is slowly reduced from 0.1 to zero. The neighbourhood radius is kept relatively large at the beginning of the training and then shrunk monotonically with the epochs to zero. The learning rule drags the weight vector associated with the winning unit towards the input vector and it also drags the weights of the closest unit (i.e., those within it neighbourhood radius) along with it. We can imagine the working of the SOM as an elastic net in the input space that wants to come as close as possible to the inputs of the network. The elastic net has the topology of the output array and the points of the net can be thought of as having the weights as coordinates. DBSCAN: In the Euclidean space a set can be divided into a subset of connected components. The implementation for partitioning a set of points requires concepts of density, connectivity and boundary. A cluster, defined as a connected dense component, grows in any direction that density leads. Therefore densitybased algorithms are capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers. Figure 4 shows some cluster shapes that present a problem for partitioning relocation clustering (e.g., kmeans), but are handled properly by densitybased algorithms. They also have good scalability.

Fig. 4: 
Irregular shapes of clusters 
The algorithm DBSCAN (Density Based Spatial Clustering of Applications with Noise) (Ester et al., 1996) targeting low dimensional spatial data is the major representative in densitybased connectivity. The other algorithm of this category includes GDBSCAN (Sandar et al., 1998) and DBCLASD (Ester et al., 2000). The basic ideas of densitybased clustering involve a number of new definitions. We intuitively present these definitions and then follow up with a highlevel algorithm.
• 
The εneighbourhood of a point p, denoted by p and N_{ε
}(P) is defined as N_{ε }(P) = {q∈D  d(p, q) ≤
ε }. 
• 
There are two kinds of points in a cluster, points inside of the cluster
(core points) and points on the border of the cluster (border points). In
general, εneighbourhood of a border point contains significantly less
points than εneighbourhoods of a core point. 
• 
A point p is directly densityreachable from a point q with respect to
ε, MinPts if 
i. 
p ∈ N_{ε }(q) and 
ii. 
N_{ε }(q) ≥ MinPts (core point condition). 
• 
A point p is density reachable from a point q with respect to ε and
MinPts if there is a chain of points p_{1},...,p_{n}, p_{1}
= q, p_{n} = p such that p_{i+1} is directly densityreachable
from p_{i}. 
• 
A point p is density connected to a point q with respect to ε and
MinPts if there is a point O such that both, p and q are densityreachable
from O with respect to ε and MinPts. 
To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points densityreachable from p with respect to ε and MinPts. If p is a core point, this procedure yields a cluster. If p is a border point, no points are densityreachable from p and DBSCAN visits the next point of the dataset.
Algorithm DBSCAN
Input: i) A dataset denoted as SetOfPoints of points that are UNCLASSIFIED.
ii) The global density parameters ε and MinPts.
Output: Dense clusters and noise points.
1. 
ClusterId = 1; 
2. 
For i = 1 to SetOfPoints.size do 
3. 
Point = SetOfPoints.get(i); 
4. 
If Point.Clid = UNCLASSIFIED then 
5. 
If ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then 
6. 
ClusterId = ClusterId + 1; 
7. 
End If 
8. 
End If 
9. 
End For 
ExpandCluster(SetOfPoints, Point, Clid, ε, MinPts) 
1. 
seeds = SetOfPoints.regionQuery(Point, ε); 
2. 
If seeds.size < MinPts then 
3. 
SetOfPoints.changeClid(Point, NOISE); 
4. 
Return FALSE; 
5. 
Else 
6. 
SetOfPoints.changeClid(seeds, Clid); 
7. 
seeds.delete(Point); 
8. 
While seeds ≠ Empty do 
9. 
currentP = seeds.first(); 
10. 
result = SetOfPoints.regionQuery(currentP, ε); 
11. 
If result.size ≥ MinPts then 
12. 
For i = 1 to result.size do 
13. 
resultP = result.get(i); 
14. 
If resultP.Clid in (UNCLASSIFIED, NOISE) then 
15. 
If resultP.Clid = UNCLASSIFIED then 
16. 
seeds.append(resultP); 
17. 
End If 
18. 
SetOfPoints.changeClid(resultP, Clid); 
19. 
End If 
20. 
End For 
21. 
End If 
22. 
seeds.delete(currentP); 
23. 
End While 
24. 
Return TRUE; 
25. 
End If 
EXPERIMENTAL STUDIES
Description of the dataset: Experimental studies are performed on two artificially created data sets named as Art_data_1 and Art_data_2 and one real life dataset named as IRIS (Blake et al., 1998). Art_data_1: This data set contains two nonconvex clusters in 2D. Cluster 1 contains 7534 samples and cluster 2 contains 7346 samples. Each sample has two features. The underlying dataset contains two nonoverlapping clusters. Figure 5 shows the distribution of dataset. Art_data_2: This dataset contains two linearly separable clusters of 500 samples each. The samples are a vector of three numeric feature values i.e. all the samples are in 3D. The classes are cubic shaped, with one having its center at (4,4,4) and sides of length 1 and the other having its centre at (6.5,6.5,6.5) and sides of length 1.5. The two clusters are fairly close to each other but not overlap.

Fig. 5: 
Art_data_1 artificial data set 

Fig. 6: 
Art_data_2 artificial data set 
They are also of much different densities. The Art_data_2 is shown in Fig. 6. Iris data set: The Iris data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. There are 4 attributessepal length, sepal width, petal length and petal width. One class is linearly separable from the other 2; the latter are not linearly separable from each other. To visualize the IRIS data set, the 4 attributes have been plotted in 2 triplets, as shown in Fig. 7.
RESULTS The percentage of samples correctly clustered is calculated by using the true class labels of the samples as given in the data set. Each cluster discovered is assigned with the class label of the class to which the majority of the samples in it belong. Then the percentage of samples in clusters with the same class label as the actual class of the sample is calculated. The CPU time is obtained on a Pentium III 500MHz workstation with 256MB RAM. Kmeans: Table 1 gives the results of the Kmeans algorithm obtained from all three data sets by taking 1000 iterations.
Self organizing feature map: Figure 810
shows the topographical map of 30 by 30rectangular grid of output neurons obtained
from three datasets. In this method, initially all the weight vectors are initialised
to the first sample of the dataset and in the successive iterations the weights
are adjusted. Table 2 gives the results of the SOM algorithm
using all three data sets by taking 1000 iterations.
DBSCAN: The sorted kdist graphs for the three data sets are shown in
Fig. 1113.
Based on the estimate made using the sorted kdist graphs and some trail and error search around that value, fairly good density parameter values were chosen for each data set. The results of clustering by DBSCAN using those parameter values are shown in Table 3.
Comparative analysis: Table 4 shows the comparative
performance of the three clustering algorithms, KMeans, SOM and DBSCAN, for
each of the three data sets. DBSCAN could discover the clusters in the Art_data_1
data set with good accuracy while Kmeans and SOM were unable to cluster it
properly. This shows that DBSCAN can detect nonconvex clusters while Kmeans
and SOM cannot. However, the DBSCAN algorithm takes considerably more time than
Kmeans and SOM.
Table 1: 
Results of KMeans algorithm 

Table 2: 
Results of SOM algorithm 

Table 3: 
Results of DBSCAN algorithm 


Fig. 8: 
Topological similarity map for Art_data_1 (Number of epochs:
1000) 

Fig. 9: 
Topological similarity map for Art_data_2 (Number of epochs:
1000) 

Fig. 10: 
Topological similarity map for IRIS (Number of epochs: 1000) 

Fig. 11: 
Sorted 4dist graph for Art_data_1 

Fig. 12: 
Sorted 4dist graph for Art_data_2 

Fig. 13: 
orted 4dist graph for IRIS 
Table 4: 
Comparative performance of clustering algorithms 

The cluster quality index β for DBSCAN is lower because the actual clusters
were not very homogeneous. The well separated convex clusters in Art_data_2
data set were discovered by all three algorithms equally well with DBSCAN taking
the most CPU time followed by SOM and then Kmeans. Again DBSCAN has a slightly
lower β value but that resulted in a higher percentage of samples getting
correctly clustered.
For the Iris data set, where two of the clusters are not linearly separable, DBSCAN has the least percentage of samples correctly classified as it misclassified some of the sample as noise when there is no noise in the data set. However, its β value is much higher than the others because it removed some of the input data that were increasing the nonhomogeneity as noise. So, DBSCAN is able to cluster nonlinearly separable clusters better than or as well as Kmeans and SOM, particularly when the clusters are not convex or linearly separable. It is difficult to determine the actual number of clusters for Kmeans and SOM particularly when the clusters are nonconvex or nonlinearly separable. The topological maps produced by the SOM can be used to determine the number of clusters when the clusters are well separated and linearly separable, but they are unable to show clusters that are not linearly separable. The number of clusters in Art_data_1 is not apparent from its topological map, which shows 3 separate regions of high similarity. Similarly, for Iris data set, only one region of high similarity is visible in its topological map, corresponding to the single linearly separable cluster in the data set. The other region does not show a high similarity within it. The topological similarity map for Art_data_2 data set correctly shows the two clusters in the data set. CONCLUSIONS AND FUTURE WORK This study presents a comparative study of the clustering technique like Kmeans, SOM and DBSCAN based on the extensive simulation using the data set Art_data_1, Art_data_2 and IRIS. Our study indicate that DBSCAN is better than Kmeans and SOM in discovering nonconvex clusters and it is as good or better than Kmeans and SOM in extracting convex clusters. It can also detect noise. However, DBSCAN algorithm also takes much more CPU time for large data sets. When the clusters are of arbitrary shape (e.g., Art_data_1 data set), a density based clustering algorithm like DBSCAN is preferable. On the other hand, when the clusters are of hyperspherical or convex shape and well separated and the data set is large, then SOM or Kmeans may be preferable as they are faster. Future work includes comparing the performance of Kmeans, SOM, DBSCAN and hierarchical clustering algorithms using incomplete data sets.

REFERENCES 
BenDor, A. and Z. Yakhini, 1999. Clustering gene expression patterns. Proceedings of the 3rd Annual International Conference on Computational Molecular Biology, 1999, Lyon, France, pp: 1114
Blake, C.L. and C.J. Merz, 1998. UCI Repository of Machine Learning Databases. 1st Edn., University of California, Irvine, CA Direct Link 
Cadez, I.V., P. Smyth and H. Mannila, 2001. Probabilistic modeling of transaction data with applications to profiling, visualization and prediction. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2629, 2001, San Francisco, California, pp: 3746 CrossRef  Direct Link 
Cutting, D.R., D.R. Karger, J.O. Pedersen and J.W. Tukey, 1992. Scatter/gather: A clusterbased approach to browsing large document collections. Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, June 2124, 1992, Copenhagen, Denmark, pp: 318329 CrossRef  Direct Link 
Dhillon, I., J. Fan and Y. Guan, 2001. Efficient Clustering of Very Large Document Collections. In: Data Mining for Scientific and Engineering Applications, Grossman, R.L., C. Kamath, P. Kegelmeyer, V. Kumar and R.R. Namburu (Eds.). Kluwer Academic Publishers, USA.
Duda, R. and P. Hart, 1973. Pattern Classification and Scene Analysis. John Wiley and Sons, New York
Ester, M., H.P. Kriegel, J. Sander and X. Xu, 1996. A densitybased algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, August 24, 1996, Portland, pp: 226231 CrossRef  Direct Link 
Ester, M., A. Frommelt, H.P. Kreigel and J. Sander, 2000. Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support. Data Min. Knowl. Discovery, 4: 193216. CrossRef 
Fayyad, U.M., G. PiatetskyShapiro and P. Smyth, 1996. From Data Mining to Knowledge Discovery: An Overview. In: Advances in Knowledge Discovery and Data Mining, Fayyad, U.M., G. PiatetskyShapiro, P. Smyth and R. Uthurusamy (Eds.). AAAI/MIT Press, Menlo Park, CA., USA., pp: 134
Forgey, E., 1965. Cluster Analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics, 21: 768780. Direct Link 
Foss, A., W. Wang and O. Zaane, 2001. A nonparametric approach to web log analysis. Proceedings of the 1st SIAM ICDM Workshop on Web Mining, 2001, Chicago, IL., pp: 4150
Hartigan, J.A., 1975. Clustering Algorithms. 4th Edn., John Wiley, New York
Hartigan, J.A. and M.A. Wong, 1979. Algorithm AS 136: A Kmeans clustering algorithm. J. R. Stat. Soc. Ser. C (Applied Stat.), 28: 100108. CrossRef  Direct Link 
Heer, I. and E. Chi, 2001. Identification of web user traffic composition using multimodal clustering and information scent. Proceedings of the 1st SIAM ICDM Workshop on Web Mining, April 57, 2001, Chicago, pp: 5158
Jain, A.K., M.N. Murty and P.J. Flynn, 1999. Data clustering: A review. ACM Comput. Surv., 31: 264323. CrossRef  Direct Link 
Kohonen, T., 1990. The selforganizing maps. Proc. IEEE, 78: 14641480. Direct Link 
Steinbach, M., G. Karypis and V. Kumar, 2000. A comparison of document clustering techniques. Proceedings of the 6th ACM SIGKDD World Text Mining Conference, Volume 400, August 2023, 2000, Boston, pp: 12
Xu, X., M. Ester, H.P. Kriegel and J. Sander, 1998. A distributionbased clustering algorithm for mining in large spatial databases. Proceedings of the 14th International Conference on Data Engineering, Feb. 2327, Orlando, FL., pp: 324331 CrossRef 
Sander, J., M. Ester, H.P. Kriegel and X. Xu, 1998. Densitybased clustering in spatial databases: The algorithm GDBSCAN and its applications. Int. J. Data Min. Knowl. Discov., 2: 169194. CrossRef  Direct Link 



