
Research Article


Performance Evaluations of κApproximate Modal Haplotype Type Algorithms for Clustering Categorical Data 

Ali Seman,
Azizian Mohd Sapawi
and
Mohd Zaki Salleh



ABSTRACT

The effectiveness of the performance of κApproximate Modal Haplotype (κAMH)type algorithms for clustering Yshort tandem repeats (YSTR) of categorical data has been demonstrated previously. However, newly introduced κAMHtype algorithms, including the new κAMH I (NκAMH 1), the new κAMH II (NκAMH II) and the new κAMH III (NκAMH III), are derived from the same κAMH optimization and fuzzy procedures but with the inclusion of two new methods, namely, new initial center selection and new dominant weighting methods. This study evaluates and presents the performance of κAMHtype algorithms for clustering five categorical data setsnamely, soybean, zoo, hepatitis, voting and breast. The performance criteria include accuracy, precision and recall analyses. Overall, κAMHtype algorithms perform well when clustering all of the categorical data sets mentioned above. Specifically, the N κAMH I algorithm exhibits the best performance when clustering the five categorical data sets; this algorithm obtained the highest combined mean accuracy score (at 0.9130), compared to those of κAMH (0.8971), N κAMH II (0.8885) and N κAMH III (0.9011). This high score is associated with the newly introduced initial center selection, combined with the original dominant weighting method. These results present a new and significant benchmark, indicating that κAMHtype algorithms can be generalized for any categorical data.





Received: July 14, 2015;
Accepted: September 11, 2015;
Published: September 22, 2015


INTRODUCTION Clustering algorithms have been used for categorical data since, Huang (1998) introduced the hard κModes algorithm. A year later, an algorithm called the fuzzy κModes algorithm (Huang and Ng, 1999) based on fuzzy clustering procedures was introduced. These algorithms use the same optimization procedure as the κmeans algorithm. However, they replace the mean with the mode as the center of clusters and they replace the Euclidean distance with a simple dissimilarity measure. Consequently, the κModetype algorithm became a pillar algorithm for clustering categorical data. Many extended κModetype algorithms were later developed e.g., κModes with four new weighting attributes (He et al., 2007), κModes with a new dissimilarity measure (Ng et al., 2007), κPopulation (Kim et al., 2005) and the new fuzzy κModes algorithm (Ng and Jing, 2009).
Recently, an algorithm called κApproximate Modal Haplotype (κAMH) was introduced exclusively for clustering YShort Tandem Repeat (YSTR) categorical data (Seman et al., 2012a). This algorithm is also based on the fuzzy clustering procedure; however, it is quite different. The main difference is the use of objects (also known as medoids) as the center of clusters for the κAMH algorithm instead of the mode mechanism (known as centroid), imposed by the κModestype algorithms. Besides that, the algorithm uses the difference optimization technique, which is the maximization of its cost function. As a result, the medoid mechanism of κAMH algorithm improved the overall accuracy scores for clustering YSTR categorical data; particularly the minimum and maximum accuracy scores improved compared to its competitor, the κModestype algorithms. The detailed comparisons are presented by Seman et al. (2012a). Furthermore, the κAMH algorithm was also tested for clustering other categorical data setse.g., soybean, voting, breast, zoo, mushroom, lymphography, credit, hepatitis and adults with promising results (Seman et al., 2013b). In fact, in some cases (e.g., for clustering the zoo data set with a higher number of clusters (seven clusters)), the κAMH algorithm was more accurate than κModetype algorithms such as the fuzzy κModes and the new fuzzy κModes algorithm. From both the results discussed above, it is clear that the main advantage of the κAMH algorithm is that it is not overly sensitive to the initial center selections, even though they are randomly selected. It was already known that the initial center selection is one of the factors contributing to the performance of clustering results, particularly for κmeanstype algorithms (Li et al., 2008).
Moreover, in a recent attempt to improve the clustering results of YSTR data, three new κAMH algorithms were developed as extended κAMH algorithms. These include the new κAMH I (NκAMH I), the new κAMH II (NκAMH II) and the new κAMH III (NκAMH III) (Seman et al., 2015). These κAMHtype algorithms are derived from the κAMH algorithm by maintaining the same clustering procedure with two newly introduced methods: (1) New initial center selection method and (2) New dominant weighting method. As a result, the NκAMH III algorithm demonstrated a 2% improvement to the clustering accuracy when clustering YSTR categorical data, as compared to that of the κAMH algorithm and the other two κAMHtype algorithms (viz., NκAMH I and NκAMH II). This improvement is, in fact, contributed by the two methods above. Detailed results for κAMHtype algorithms can be found in Seman et al. (2015).
This study aims to evaluate the performance of the newly introduced κAMHtype algorithms (NκAMH I, II and III) when clustering five categorical data sets namely; soybean, zoo, hepatitis, voting and breast comparing the respective performance of these new algorithms with the original κAMH algorithm. These new κAMH algorithms and in particular NκAMH III, have demonstrated superiority when clustering YSTR data. However, the characteristics of YSTR categorical data are essentially different from common categorical data such as soybean, zoo, hepatitis, voting and breast. The main difference is that the degree of similarity among common categorical data is not high as it is with YSTR data. Common categorical data sets typically comprise both similar and distinct objects but not identical objects. By contrast, YSTR data sets comprise identical, similar and distinct objects, because each identical object refers to a different individual with the same DNA characteristics. Further details regarding YSTR data are available in Seman et al. (2013a). It is important to know that the new κAMH algorithms with the two new methods designed specifically for YSTR categorical can also be applied efficiently for the common categorical data listed above. Thus, the favorable results obtained in this study present a new and significant benchmark for clustering categorical data.
MATERIALS AND METHODS κAMH algorithm: The original κAMH algorithm begins to randomly find k clusters in X categorical objects as the initial center selection H (the medoid). For each medoid, h∈H is tested for each object x∈X onebyone. The final center cluster H is obtained by replacing h by x if the cost function P(W, D) is maximized, as in Eq. 1: P(W, D) is calculated in Eq. 2 as: Where:
• 
∈ W is a (k×n) fuzzy membership matrix that denotes the degree of fuzzy membership for the object i in the l^{th} cluster, which contains a value between 0 and 1, as in Eq. 3: 
where, k (≤n) is a known number of clusters, H is the medoid, α∈[1, ∞) is a weighting exponent and d(X_{i}, H_{z}) is the distance measured between the object X_{i} and the medoid H_{z}:
• 
d_{li}∈D is another (k×n) partition matrix with a dominant weighting value of 1.0 or 0.5. The dominant weighting value d_{li} is calculated as follows: 
Subject to: The descriptions above represent the simplified version of the κAMH algorithm reported by Seman et al. (2015). However, the original descriptions of the κAMH algorithm that are similar to those described above can be found in Seman et al. (2012a).
κAMHtype algorithms: NκAMH I, II and III are the new κAMH algorithms that maintain the original κAMH optimization procedure but replace two methods:
• 
New initial center selection method: This method replaces the randomized initial center by selecting any identical or similar center selections. The detailed steps and descriptions of this method can be found in Seman et al. (2015) 
• 
New dominant weighting method: This method replaces the original dominant weighting method for the κAMH algorithm, based on a 50 or 100% probability with a weighting method based on three probabilities, 50, 75 and 100%. An additional value of 0.75 (75%) was proposed to appropriately handle any maximum membership values ranging from 1/κ to 1.0, where κ is the number of clusters. The new dominant weighting value method is described in Eq. 7 
The detailed steps and descriptions for this method can be found in Seman et al. (2015). For clarity, Table 1 shows the main differences between κAMHtype algorithms. Benchmark data sets: For bench marking the results, five categorical data sets were used to compare the clustering performances of the algorithms. They include the soybean, zoo, hepatitis, voting and breast data sets from the UCI repository (Lichman, 2013). Table 2 provides a summary of all the data sets and presents the number of objects, classes and attributes for each set.
Evaluation methods: The analyses were primarily based on mean accuracy scores obtained from experiments that were run 100 times (100 run) for each algorithm and data set. Each data set was randomly reordered before each run.
Table 1: 
κAMHtype algorithms 

Table 2: 
Summary of categorical data sets 

In order to calculate the accuracy scores, a misclassification matrix was used. The accuracy was calculated and defined by Huang (1998) as:
where, κ is the number of clusters, a_{l} is the number of instances occurring in cluster l and its corresponding group and n is the number of instances in the data sets. As a secondary analysis, the precision and recall scores were calculated as: where, b_{l }is the number of incorrectly classified objects and c_{1} is the number of objects in a given class but not in a cluster. The accuracy, precision and recall scores close to 1 indicate the best matching for each cluster and corresponding class pair. RESULTS
Table 3 lists the clustering accuracy for each data set. In general, clustering categorical data using these κAMHtype algorithms produced very promising results. Overall, the score differences among the κAMHtype algorithms were merely in the range 12%. However, the NκAMH I algorithm obtained the highest combined mean accuracy score at 0.9130 which is slightly higher than κAMH (0.8971), NκAMH II (0.8885) and NκAMH III (0.9011).
Table 3: 
Mean, minimum and maximum clustering accuracy scores for all data sets after 100 experimental runs performed on each data set 

This excellent performance was essentially demonstrated by its consistency, whereby, from the 100runs, the NκAMH I obtained identical accuracy scores for data set 1, 2, 3 and 5 (Table 3). This was because of the incorporation of the new initial center selection method (a fixed selection) into the NκAMH algorithm. From the experimental results, it can be seen that the new initial center selection method, when combined with the original dominant weighting method, resulted in significantly improved clustering results. In fact, this combination significantly increased the highest minimum combined mean accuracy score (0.9126).
Table 4 lists the results from a secondary analysis, using precision and recall scores. This analysis was used to provide insight into the clustering results. From the precision analysis, the NκAMH I algorithm produced the highest mean precision scores for all data sets except data set 3. This indicates that the algorithm was also more precise than the other new κAMH type algorithms, because approximately 85.71% of the objects were correctly clustered but with a slightly lower precision score than the κAMH algorithm (0.8639). Thus, it yielded only 14.29% of objects that were not supposed to be in the clustersthe secondbest score after κAMH (13.61%) and considerably better than NκAMH II (17.59%) and NκAMH III (16.51%). By contrast, in the recall analysis, the NκAMH I algorithm had a slightly higher recall score (0.8712) than the κAMH algorithm (0.8674). This means that NκAMH I yielded the lowest score, with only 12.88% of the objects that were not supposed to be in clusters, compared to κAMH (13.26%), NκAMH II (16.37%) and NκAMH III (15.21%). Table 5 shows the results of a group comparison of clustering accuracy scores between NκAMH I and the other three κAMHtype algorithms using oneway ANOVA analysis. The assumption of homogeneity of variance was violated (Levene F(3, 1996) = 20.14, p< 0.05); therefore, the Welch Fratio was reported. There was a significant variance in clustering accuracy scores among the four algorithms (F (3, 1107.42) = 8.382, p<0.05); therefore, the Games Howell procedure was used for a multiple group comparison. The performance of NκAMH I (M = 0.9130, 95% CI [0.9062, 0.9199]) differed significantly from the κAMH and NκAMH II (p<0.05) algorithms. However, the results from the NκAMH I algorithm did not differ significantly from those of NκAMH III (p = 0.0647).
Table 4:  Clustering precision and recall scores for all data sets and the combined data set 

Table 5:  Oneway ANOVA comparison between N κAMH I and the other three κAMHtype algorithms 

 Fig. 1:  Box plot comparison of the clustering accuracy performance of κAMHtype algorithms. The NκAMH I algorithm shows excellent performance, with the highest median and minimum accuracy scores 
Statistically, the NκAMH I algorithm outperformed both κAMH and NκAMH II and its performance was comparable to that of NκAMH III. In addition, Fig. 1 shows a visual representation of the accuracy scores for the κAMHtype algorithms. The NκAMH I algorithm was clearly superior, with the highest median and minimum accuracy scores in comparison with the other κAMHtype algorithms. The highest median and minimum accuracy scores were effectively backed by the identical accuracy scores obtained for data sets 1, 2, 3 and 5. However, the NκAMH I and NκAMH III algorithms appeared to have similar box plot diagram representations, which indicate similar mean performances as proven by the ANOVA analysis above, although the median value was relatively low. In fact, the diagram shows that both algorithms (NκAMH I and III) are more consistent than the other two algorithms (κAMH and NκAMH II), without producing any outliers and they resulted in the highest values in terms of their minimum accuracy scores. This indicates that the new initial center selection significantly contributes to the overall performances of both algorithms. DISCUSSION The overall performance indicates that κAMHtype algorithms, including the original κAMH (Seman et al., 2012a) and the extended κAMH viz., NκAMH I, II and III (Seman et al., 2015), can cluster any categorical data. For the common categorical data used in the evaluation above, new κAMH algorithms such as NκAMH I and III demonstrated an improvement of 12% in terms of the clustering accuracy when compared with the original κAMH algorithm. This indicates that the improved methods in the new κAMH algorithms significantly contributed to the overall improvement. In general, the new initial center selection method and the new dominant weighting method introduced by Seman et al. (2015) were not restricted by highly similar categorical data (such as YSTR data (Seman et al., 2013a) that had previously been applied and reported by Seman et al. (2010a, b, c, d, e, f, 2012a, b, 2015). Meanwhile, the applications of these methods can also be extended for use with common categorical data and reported with promising results, particularly for a high number of clusters such as the seven clusters of zoo data set (Seman et al., 2013b). In addition, for the new κAMHtype algorithm represented by NκAMH I, which combines the new initial center selection method with the original dominant weighting method, achieved a clustering accuracy score that was 2% higher than the other algorithms, as evinced by the experiment above. In contrast, when clustering YSTR data, the NκAMH III combining the new initial center selection method with the new weighting dominant method achieved the highest clustering accuracy score overall (Seman et al., 2015). This new initial center selection is, in reality, a fixed initial selection method. Thus, it differs fundamentally from the original procedure of seeding the centroids with a randomized initial selection method. Moreover, from the experimental results, it can be concluded that the original dominant weighting method with two conditions (100 and 50%) incorporated in NκAMH 1 is the most suitable method for common categorical data. However, the new dominant weighting method with three conditions (100, 75 and 50%) incorporated in NκAMH III and combined with the fixed initial center selection is the most suitable method for clustering YSTR categorical data. This indicates that the new initial center selection method (i.e., the fixed initial center selection method) plays an important role in contributing to the performance of clustering categorical data, regardless of the type of data (whether YSTR or common categorical data). The previous results reported for YSTR data, coupled with the experimental results presented above, confirm the algorithms’ generality when clustering categorical data. In other words, the algorithms are not exclusively limited to single applications using YSTR data; rather, they can be used for any application of categorical data. The κAMHtype algorithms are a type of approximation algorithm that require an optimal solution, therefore, an improvement of approximately 12%, as obtained by NκAMH I and III, is considerably good. Therefore, both NκAMH I and NκAMH III can be generalized and used for clustering any categorical data. Notably, all κAMHtype algorithms can be used to further develop clustering tools. This is because κAMHtype algorithms, characterized by the use of the object as the center cluster and highlighted in Seman et al. (2013b), have their own advantages over κModetype algorithms (e.g., the κmode (Huang, 1998), fuzzy κmode algorithms (Huang and Ng, 1999) and new fuzzy κmode algorithms (Ng and Jing, 2009), which are characterized by the mode mechanism.
CONCLUSION From the analyses of the experimental results, it can be seen that the κAMH variants significantly contributed to the process of clustering categorical data. The combination of the newly introduced methods incorporated in the κAMH procedures demonstrated their advantages over certain categorical data. Although, the combination of the two methods represented by the NκAMH III algorithm did not result in a consistently superior performance for all data sets, as it did with the YSTR data sets, this merely indicates that each method had its own advantages. Thus, the κAMH algorithm and its variants offer unique advantages for clustering categorical data. These algorithms have the potential to be generalized that is, they can be used to cluster YSTR data and any categorical data, such as the data presented above. ACKNOWLEDGMENT This study was supported by the Fundamental Research Grant Scheme, Ministry of Education, Malaysia. We would like to thank IRMI and UiTM for their support with this research. We also extend our gratitude to those who have contributed toward the completion of this study.

REFERENCES 
1: He, Z., X. Xu and S. Deng, 2007. Attribute value weighting in Kmodes clustering. Report Number Tr060615, Cornell University Library, Cornell University, Ithaca, NY., USA. http://arxiv.org/abs/cs/0701013.
2: Huang, Z., 1998. Extensions to the kmeans algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery, 2: 283304. CrossRef  Direct Link 
3: Huang, Z. and M.K. Ng, 1999. A Fuzzy kModes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst., 7: 446452. CrossRef  Direct Link 
4: Kim, D.W., K.Y. Lee, D. Lee and K.H. Lee, 2005. A kpopulations algorithm for clustering categorical data. Pattern Recognition, 38: 11311134. CrossRef  Direct Link 
5: Li, M.J., M.K. Ng, Y.M. Cheung and J.Z. Huang, 2008. Agglomerative fuzzy Kmeans clustering algorithm with selection of number of clusters. IEEE Trans. Knowledge Data Eng., 20: 15191534. CrossRef  Direct Link 
6: Lichman, M., 2013. UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA.
7: Ng, M.K. and L. Jing, 2009. A new fuzzy kModes clustering algorithm for categorical data. Int. J. Granular Comp. Rough Sets Intell. Syst., 1: 105119. CrossRef  Direct Link 
8: Ng, M.K., M.J. Li, J.Z. Huang and Z. He, 2007. On the impact of dissimilarity measure in kModes clustering Algorithm. IEEE Trans. Pattern Anal. Machine Intelli., 29: 503507. CrossRef  Direct Link 
9: Seman, A., Z.A. Bakar and A.M. Sapawi, 2010. Centrebased clustering for YShort Tandem Repeats (YSTR) as numerical and categorical data. Proceedings of the International Conference on Information Retrieval and Knowledge Management, March 1718, 2010, Shah Alam, Selangor, pp: 2833.
10: Seman, A., Z.A. Bakar and A.M. Sapawi, 2010. Attribute value weighting in kmodes clustering for YShort Tandem Repeats (YSTR) surname. Proceedings of the International Symposium on Information Technology, June 1517, 2010, Kuala Lumpur, Malaysia, pp: 15311536.
11: Seman, A., Z.A. Bakar and A.M. Sapawi, 2010. Hard and soft updating centroids for clustering YShort Tandem Repeats (YSTR) data. Proceedings of the IEEE Conference on Open Systems, December, 57, 2010, Kuala Lumpur, Malaysia, pp: 611.
12: Seman, A., Z.A. Bakar and A. M. Sapawi, 2010. Modeling centrebased hard and soft clustering for Y chromosome short tandem Repeats (YSTR) data. Proceedings of the International Conference on Science and Social Research, December 57, 2010, Kuala Lumpur, Malaysia, pp: 6873.
13: Seman, A., Z.A. Bakar and A.M. Sapawi, 2010. Centrebased hard and soft clustering approaches for YSTR data. J. Genet. Geneal., 6: 19. Direct Link 
14: Seman, A., Z.A. Bakar and A.M. Sapawi, 2010. Centrebased hard clustering algorithm for YSTR data. Malaysian J. Comput., 1: 6273. Direct Link 
15: Seman, A., Z.A. Bakar and M.N. Isa, 2012. An efficient clustering algorithm for partitioning Yshort tandem repeats data. BMC Research Notes, Vol. 5. 10.1186/175605005557
16: Seman, A., Z.A. Bakar and M.N. Isa, 2012. Evaluation of kModesType Algorithms for Clustering YShort Tandem Repeats Data. Trends Bioinf., 5: 4752. CrossRef  Direct Link 
17: Seman, A., Z.A. Bakar and M.N. Isa, 2013. First Yshort tandem repeat categorical dataset for clustering applications. Dataset Papers Sci., 10.7167/2013/364725
18: Seman, A., Z.A. Bakar, A.M. Sapawi and I.R. Othman, 2013. A Medoidbased method for clustering categorical data. J. Artif. Intell., 6: 257265. CrossRef  Direct Link 
19: Seman, A., A.M. Sapawi and M.Z. Salleh, 2015. Towards development of clustering applications for largescale comparative genotyping and kinship analysis using Yshort tandem repeats. J. Integr. Biol., 19: 361367. CrossRef  Direct Link 



