Abstract: Data clustering is an unsupervised task that can generate different shapes of clusters for a particular type of data set. Hence choosing an algorithm for a particular type of data set is a difficult problem. This study presents the choice of an appropriate clustering algorithm by a comparative study of three representative techniques like K-means, Kohonen`s Self Organizing Map (SOM) and Density Based Spatial Clustering of Applications with Noise (DBSCAN) based on the extensive simulation studies. Comparison is performed on the basis of cluster quality index `ß`, percentage of samples correctly classified and CPU time. The experimental results show that if the clusters are of arbitrary shape, a density based clustering algorithm like DBSCAN is preferable, where as if the clusters are of hyper spherical or convex shape and well-separated then the SOM or K-means is preferable.
INTRODUCTION
Recently there has been an enormous growth in the amount of commercial and scientific data, such as protein sequence, retail transaction and web logs (Fayyad et al., 1996). The questions arise, why so much of data? People store data because they think some valuable assets are implicitly coded within it. But raw data is rarely of direct benefit. Discovering groups of similar data items in the data set, known as cluster analysis, is an interesting and challenging issue. There are many different data clustering algorithms emerged from several disciplines such as statistics, pattern recognition and machine learning (Jain et al., 1999). They give different types of clusters for a particular type of data set. Some of them require some additional prior knowledge about the nature of the data. Each clustering algorithm works well only for certain types of data and with certain applications. The choice of an appropriate clustering algorithm can be made from a comparative study of the clustering algorithms.
Furthermore, the comparison is necessary as it is widely used in many areas like information retrieval and text mining (Cutting et al., 1992; Steinbach et al., 2000; Dhillon et al., 2001), spatial database applications, e.g., Geographical Information System (GIS) or astronomical data (Xu et al., 1998; Sandar et al., 1998; Ester et al., 2000), sequence and heterogeneous data analysis (Cadez et al., 2001), web applications (Heer et al., 2001; Foss et al., 2001), DNA analysis in computational biology (Ben-Dor et al., 1999) and many others.
The clustering problem is the problem of dividing a given set {x1,..., xN} of N data points into several non-overlapping homogenous groups. Each such group or cluster should contain similar data items and data items from different groups should not be similar. An k clustering of a data set X is the partition of X into k clusters, C1,...,Ck such that the following three conditions are met:
i) | Ci ≠ φ, i = 1,...,k |
ii) | ∪ki = 1 (except outliers) |
iii) | Ci ∩ Cj = φ, i ≠ j, i, j = 1,...,k. (sometimes violated) |
The various data clustering algorithms can be classified into partitional, density based and clustering using artificial neural networks. In this work, the three representative algorithms like K-means (Hartigan et al., 1979; Hartigan, 1975), DBSCAN (Ester et al., 1996) and Kohonen Self-organizing Map (SOM) (Kohonen, 1990) corresponds to partitional, density based and artificial neural networks are taken for comparative study.
The comparison is performed on the basis of cluster quality index β, percentage of samples correctly clustered and CPU time. Mathematically, it is defined as
where, ni is the number of points in the ith (i = 1,...,k) cluster,
Xij is the feature vector of the jth pattern (j = 1,...,ni)
in cluster I,
CLUSTERING ALGORITHMS
This technique produces clusters by optimising a criterion function defined either locally or globally. The user should have prior knowledge of the number of clusters. The algorithm runs multiple number of times with different starting points and the best configuration obtained from all of the runs is used as the output clustering. Let us discuss each of the clustering algorithms.
K-means: The k-means algorithm (Hartigan et al., 1979; Hartigan,
1975) is the most popular clustering tool used in scientific and industrial
applications. The name comes from representing each of k clusters cj
by the mean (or weighted average) xj of its points, the so-called
centroid. While this obviously does not work well with categorical attributes,
it has the good geometric and statistical sense for numerical attributes. The
sum of discrepancies between a point and its centroid expressed through appropriate
distance measure is used as the criterion function. For example, the L2-norm
based objective function, the sum of the squares of errors between the points
and the corresponding centroids, is equal to the total intra-cluster variance
The sum of the squares of errors can be rationalized as (a negative of) log-likelihood for normally distributed mixture model and is widely used in statistics (SSE). Therefore k-means algorithm can be derived from general probabilistic framework. The inter-cluster variance is measured as
where, k is the number of clusters. Two version of k-means iterative optimisation are known. The first version is known as Forgys algorithm (Forgey, 1965) and consists of the following steps:
1. | Choose the number of clusters, k. | |
2. | Set initial centres of clusters, c1,c2,.... ck to the arbitrarily selected k vectors from the dataset. | |
3. | Classify each vector xl = [xl1 ,xl2 ,......xld]
(d is the dimension of the input vectors) into the closest centre ci
by Euclidean distance measure:
|
|
4. | Recompute the estimates for the cluster centres ci. Let ci=
[cii,ci2,
.cid], cim is
computed by
|
|
5. | If none of the cluster centres (ci, i = 1,2, ,k) changes in step 4, stop; otherwise go to step 3. |
The second version (classic in iterative optimisation) of k-means iterative optimisation reassigns points based on more detailed analysis of effects on the objective function caused by moving a point from its current cluster to a potentially new one. If a move has a positive effect, the point is relocated and the two centroids are recomputed. It is not clear that this version is computationally feasible, because the outlined analysis requires an inner loop over all member points of involved clusters affected by centroids shifts. However, in L2 case it is known (Duda et al., 1973) that all computations can be algebraically reduced to simply computing a single distance! Therefore, in this case both versions have the same computational complexity.
The wide popularity of k-means algorithm is well deserved. It is simple, straightforward and is based on the firm foundation of analysis of variances. The K-means algorithm also suffers from all the usual suspects: i) the result strongly depends on the initial guess of centroids (or assignments), ii) it is not obvious what is a good k to use, iii) the process is sensitive with respect to outliers, iv) the algorithm lacks scalability, iv) only numerical attributes are covered and v) resulting clusters can be unbalanced.
Kohonens Self-Organizing Map (SOM): Kohonen self-organizing map (Kohonen, 1990) is an unsupervised, competitive learning, clustering network, in which only one neuron (or only one neuron in a group) is on at a time. Self-organizing maps (SOMs) can be used as a data visualization technique as they can reduce the dimensions of data through the use of self-organizing neural networks while preserving their topological structure. The problem that data visualization attempts to solve is that humans simply cannot visualize high dimensional data as is so techniques are created to help us understand this high dimensional data. The way SOMs go about reducing dimensions is by producing a map of usually one or two dimensions, which plot the similarities of the data by grouping similar data items together. The plot is drawn by giving each output neuron a gray scale colour that is proportionate to its weight vectors average distance from it immediate neighbours weight vectors. Thus SOMs accomplish two things, they reduce dimensions and display similarities.
Basic architecture: Figure 1 shows the architecture
of Kohonen SOM. It contains a single layer of neurons in addition to an input
layer of branching nodes. There are k neurons in the neural layer
and each has a parametric weight vector wk of dimension n,
which is same as the dimension of the input feature vectors
• | In the first strategy called winner-take-all strategy, the winning neuron updates its parametric weight vector. All other neurons keep their old values. |
• | In second strategy, update positively (reinforce, or reward) all neurons that are close to the winning neuron and to update negatively all of those neurons that are farther way from the winner. |
• | In the third strategy, when a vector |
Fig. 1: | Kohonen self-organizing map |
Fig. 2: | Linear topology |
Fig. 3: | 2-D rectangular grid topology |
Neighbourhoods of a unit designated by # of radii R = 2, 1 and 0 in a one-dimensional topology with 5 cluster units is shown in Fig. 2.
A neighbourhood for a 2D planar rectangular grid topology is shown in Fig. 3.
Algorithm SOM
1. | Fix the number of output neurons k, each of which represents a cluster. |
2. | Initialise the weight vector for each output neuron with i) random values, or ii) the first k input vectors. |
3. | Set topological neighbourhood parameters and learning rate parameter α. |
4. | While stopping condition is false |
5. | For each input x |
6. | Update weight vector wj of the closest output neuron and its
neighbors as follows: wji(new) = wji(old) +α.[xi wji(old)], i = 1,..., n. |
7. | End for |
8. | Reduce learning rate |
9. | Reduce radius of neighbourhood at specified intervals. |
10. | End while |
The learning process has two separate but related phases, that is, the ordering phase and the convergence phase. During the ordering phase, the learning rate should be set close to unity and then gradually decreased, but not allowed to go below a certain threshold value (usually 0.1). It is during this part of the learning process that the topological ordering of the weight vectors is carried out. The convergence phase is the second phase of the learning that is generally the longest part of the network learning (typically 80% of the epochs). The remaining iterations are necessary during this phase for carrying out fine adjustments of the map i.e., the weight vectors. In this phase the learning parameter should have very small values for a long time. Typically, it is slowly reduced from 0.1 to zero. The neighbourhood radius is kept relatively large at the beginning of the training and then shrunk monotonically with the epochs to zero. The learning rule drags the weight vector associated with the winning unit towards the input vector and it also drags the weights of the closest unit (i.e., those within it neighbourhood radius) along with it. We can imagine the working of the SOM as an elastic net in the input space that wants to come as close as possible to the inputs of the network. The elastic net has the topology of the output array and the points of the net can be thought of as having the weights as coordinates.
DBSCAN: In the Euclidean space a set can be divided into a subset of connected components. The implementation for partitioning a set of points requires concepts of density, connectivity and boundary. A cluster, defined as a connected dense component, grows in any direction that density leads. Therefore density-based algorithms are capable of discovering clusters of arbitrary shapes. Also this provides a natural protection against outliers. Figure 4 shows some cluster shapes that present a problem for partitioning relocation clustering (e.g., k-means), but are handled properly by density-based algorithms. They also have good scalability.
Fig. 4: | Irregular shapes of clusters |
The algorithm DBSCAN (Density Based Spatial Clustering of Applications with Noise) (Ester et al., 1996) targeting low dimensional spatial data is the major representative in density-based connectivity. The other algorithm of this category includes GDBSCAN (Sandar et al., 1998) and DBCLASD (Ester et al., 2000). The basic ideas of density-based clustering involve a number of new definitions. We intuitively present these definitions and then follow up with a high-level algorithm.
• | The ε-neighbourhood of a point p, denoted by p and Nε (P) is defined as Nε (P) = {q∈D | d(p, q) ≤ ε }. |
• | There are two kinds of points in a cluster, points inside of the cluster (core points) and points on the border of the cluster (border points). In general, ε-neighbourhood of a border point contains significantly less points than ε-neighbourhoods of a core point. |
• | A point p is directly density-reachable from a point q with respect to ε, MinPts if |
i. | p ∈ Nε (q) and |
ii. | |Nε (q)| ≥ MinPts (core point condition). |
• | A point p is density reachable from a point q with respect to ε and MinPts if there is a chain of points p1,...,pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi. |
• | A point p is density connected to a point q with respect to ε and MinPts if there is a point O such that both, p and q are density-reachable from O with respect to ε and MinPts. |
To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-reachable from p with respect to ε and MinPts. If p is a core point, this procedure yields a cluster. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the dataset.
Algorithm DBSCAN
Input: i) A dataset denoted as SetOfPoints of points that are UNCLASSIFIED.
ii) The global density parameters ε and MinPts.
Output: Dense clusters and noise points.
1. | ClusterId = 1; |
2. | For i = 1 to SetOfPoints.size do |
3. | Point = SetOfPoints.get(i); |
4. | If Point.Clid = UNCLASSIFIED then |
5. | If ExpandCluster(SetOfPoints, Point, ClusterId, Eps, MinPts) then |
6. | ClusterId = ClusterId + 1; |
7. | End If |
8. | End If |
9. | End For |
ExpandCluster(SetOfPoints, Point, Clid, ε, MinPts) | |
1. | seeds = SetOfPoints.regionQuery(Point, ε); |
2. | If seeds.size < MinPts then |
3. | SetOfPoints.changeClid(Point, NOISE); |
4. | Return FALSE; |
5. | Else |
6. | SetOfPoints.changeClid(seeds, Clid); |
7. | seeds.delete(Point); |
8. | While seeds ≠ Empty do |
9. | currentP = seeds.first(); |
10. | result = SetOfPoints.regionQuery(currentP, ε); |
11. | If result.size ≥ MinPts then |
12. | For i = 1 to result.size do |
13. | resultP = result.get(i); |
14. | If resultP.Clid in (UNCLASSIFIED, NOISE) then |
15. | If resultP.Clid = UNCLASSIFIED then |
16. | seeds.append(resultP); |
17. | End If |
18. | SetOfPoints.changeClid(resultP, Clid); |
19. | End If |
20. | End For |
21. | End If |
22. | seeds.delete(currentP); |
23. | End While |
24. | Return TRUE; |
25. | End If |
EXPERIMENTAL STUDIES
Description of the dataset: Experimental studies are performed on two artificially created data sets named as Art_data_1 and Art_data_2 and one real life dataset named as IRIS (Blake et al., 1998).
Art_data_1: This data set contains two non-convex clusters in 2D. Cluster 1 contains 7534 samples and cluster 2 contains 7346 samples. Each sample has two features. The underlying dataset contains two non-overlapping clusters. Figure 5 shows the distribution of dataset.
Art_data_2: This dataset contains two linearly separable clusters of 500 samples each. The samples are a vector of three numeric feature values i.e. all the samples are in 3D. The classes are cubic shaped, with one having its center at (4,4,4) and sides of length 1 and the other having its centre at (6.5,6.5,6.5) and sides of length 1.5. The two clusters are fairly close to each other but not overlap.
Fig. 5: | Art_data_1 artificial data set |
Fig. 6: | Art_data_2 artificial data set |
Fig. 7: | IRIS data set |
They are also of much different densities. The Art_data_2 is shown in Fig. 6.
Iris data set: The Iris data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. There are 4 attributes-sepal length, sepal width, petal length and petal width. One class is linearly separable from the other 2; the latter are not linearly separable from each other. To visualize the IRIS data set, the 4 attributes have been plotted in 2 triplets, as shown in Fig. 7.
RESULTS
The percentage of samples correctly clustered is calculated by using the true class labels of the samples as given in the data set. Each cluster discovered is assigned with the class label of the class to which the majority of the samples in it belong. Then the percentage of samples in clusters with the same class label as the actual class of the sample is calculated. The CPU time is obtained on a Pentium III 500MHz workstation with 256MB RAM.
K-means: Table 1 gives the results of the K-means algorithm obtained from all three data sets by taking 1000 iterations.
Self organizing feature map: Figure 8-10 shows the topographical map of 30 by 30-rectangular grid of output neurons obtained from three datasets. In this method, initially all the weight vectors are initialised to the first sample of the dataset and in the successive iterations the weights are adjusted. Table 2 gives the results of the SOM algorithm using all three data sets by taking 1000 iterations.
DBSCAN: The sorted k-dist graphs for the three data sets are shown in Fig. 11-13.
Based on the estimate made using the sorted k-dist graphs and some trail and error search around that value, fairly good density parameter values were chosen for each data set. The results of clustering by DBSCAN using those parameter values are shown in Table 3.
Comparative analysis: Table 4 shows the comparative performance of the three clustering algorithms, K-Means, SOM and DBSCAN, for each of the three data sets. DBSCAN could discover the clusters in the Art_data_1 data set with good accuracy while K-means and SOM were unable to cluster it properly. This shows that DBSCAN can detect non-convex clusters while K-means and SOM cannot. However, the DBSCAN algorithm takes considerably more time than K-means and SOM.
Table 1: | Results of K-Means algorithm |
Table 2: | Results of SOM algorithm |
Table 3: | Results of DBSCAN algorithm |
Fig. 8: | Topological similarity map for Art_data_1 (Number of epochs: 1000) |
Fig. 9: | Topological similarity map for Art_data_2 (Number of epochs: 1000) |
Fig. 10: | Topological similarity map for IRIS (Number of epochs: 1000) |
Fig. 11: | Sorted 4-dist graph for Art_data_1 |
Fig. 12: | Sorted 4-dist graph for Art_data_2 |
Fig. 13: | orted 4-dist graph for IRIS |
Table 4: | Comparative performance of clustering algorithms |
The cluster quality index β for DBSCAN is lower because the actual clusters were not very homogeneous. The well separated convex clusters in Art_data_2 data set were discovered by all three algorithms equally well with DBSCAN taking the most CPU time followed by SOM and then K-means. Again DBSCAN has a slightly lower β value but that resulted in a higher percentage of samples getting correctly clustered.
For the Iris data set, where two of the clusters are not linearly separable, DBSCAN has the least percentage of samples correctly classified as it misclassified some of the sample as noise when there is no noise in the data set. However, its β value is much higher than the others because it removed some of the input data that were increasing the non-homogeneity as noise. So, DBSCAN is able to cluster non-linearly separable clusters better than or as well as K-means and SOM, particularly when the clusters are not convex or linearly separable.
It is difficult to determine the actual number of clusters for K-means and SOM particularly when the clusters are non-convex or non-linearly separable. The topological maps produced by the SOM can be used to determine the number of clusters when the clusters are well separated and linearly separable, but they are unable to show clusters that are not linearly separable. The number of clusters in Art_data_1 is not apparent from its topological map, which shows 3 separate regions of high similarity. Similarly, for Iris data set, only one region of high similarity is visible in its topological map, corresponding to the single linearly separable cluster in the data set. The other region does not show a high similarity within it. The topological similarity map for Art_data_2 data set correctly shows the two clusters in the data set.
CONCLUSIONS AND FUTURE WORK
This study presents a comparative study of the clustering technique like K-means, SOM and DBSCAN based on the extensive simulation using the data set Art_data_1, Art_data_2 and IRIS. Our study indicate that DBSCAN is better than K-means and SOM in discovering non-convex clusters and it is as good or better than K-means and SOM in extracting convex clusters. It can also detect noise. However, DBSCAN algorithm also takes much more CPU time for large data sets. When the clusters are of arbitrary shape (e.g., Art_data_1 data set), a density based clustering algorithm like DBSCAN is preferable. On the other hand, when the clusters are of hyperspherical or convex shape and well separated and the data set is large, then SOM or K-means may be preferable as they are faster.
Future work includes comparing the performance of K-means, SOM, DBSCAN and hierarchical clustering algorithms using incomplete data sets.