INTRODUCTION
Clustering categorical data is one of the main clustering areas focused by
many researchers. It begun when Huang (1998) published
a new algorithm, called kModes algorithm designed specifically for categorical
data. The kModes algorithm was introduced due to the ineffectiveness of kMeans
algorithm (MacQueen, 1967) for clustering categorical
data. Another effort, proposed by Ralambondrainy (1995)
using a hybrid numericsymbolic method also faced a problem of computational
cost. As a result, the kModes algorithm has become the center of intention
for solving categorical data.
A lot of variations of kModestype algorithms have been extended and proposed
for improving their clustering performances. A modification on the existing
simple matching dissimilarity measure is one of the targets to improve the clustering
results. For examples, He et al. (2007) have modified
the original kModes algorithm (Later the algorithm is denoted as the Extended
kModes algorithm) with four attribute weighting values called Relative Value
Frequency (RVF), Uncommon Attribute Value Matches (UAVM), Hybrid method from
RVF and UAVM (Here the method is denoted as Hybrid 1 or HI) and a simplified
version of Hybrid I (Here the method is denoted as Hybrid II or HII). A similar
idea of RVF method is also proposed by Ng et al.,
(2007). The results reported by He et al. (2007)
show the improvement on the average clustering accuracy scores at almost greater
than the average clustering accuracy scores of the original kModes algorithm
for voting, breast cancer, mushroom, soybean, lymhography and zoo data sets.
On top of that, another significant result is made by Ng
and Jing (2009). They proposed a new Fuzzy kModes algorithm with relative
value frequency in fuzzy form and reported better clustering results for four
categorical data sets: Soybean, Zoo, Credit and Adult. Furthermore, Kim
et al. (2005) proposed a different idea through kPopulation algorithm
by introducing the soft centroids in the updating process of kPopulation algorithm.
No exact centroid is assigned explicitly like the other kModetype algorithms.
This idea has also produced impressive results in overall performances for categorical
data sets such as Soybean, Zoo, Credit and Hepatitis.
Recently, a new algorithm, called kApproximate Model Haplotype (kAMH) algorithm
has been proposed for clustering YShort Tandem Repeat (YSTR) categorical data
(Seman et al., 2012). The algorithm was proposed
due to the existing kModetype algorithms could not handle the YSTR data well.
This is because the YSTR data are unique and different with the other categorical
data e.g., soybean, voting, mushroom, zoo, etc. The YSTR data are composed
of many similar and almost similar objects in inter or intra classes. This uniqueness
of the YSTR data has led the existing clustering algorithms to two problems.
The first problem is the obtained centroids are not unique, thus resulting empty
clusters. The second problem is the obtained centroids are insufficient to represent
their clusters and therefore it leads to local minima problem. See the detailed
description of the problems in clustering YSTR data in Seman
et al. (2012).
From observations during the experiments of clustering YSTR data, it was found
that the mode mechanism as the centroid is ineffective for high similarity data.
The centroidbased method (the mode/mean mechanism) is always prone to bad initial
center selection especially in handling high similarity data. The problem of
initial center selections is typically the main issue, inherited from the kMean
algorithm (Li et al., 2008). As a solution for
clustering YSTR data, a medoidbased method has been introduced and associated
in the kAMH algorithm for solving the two problems above. As a result, the
kAMH algorithm has successfully produced better clustering results for clustering
YSTR categorical data e.g., the kAMH algorithm recorded the highest mean accuracy
score of 0.93 overall, compared to that of other algorithms: kPopulation (0.91),
kModesRVF (0.81), New Fuzzy kModes (0.80), kModes (0.76), kModesHybrid
1 (0.76), kModesHybrid 2 (0.75), Fuzzy kModes (0.74) and kModesUAVM (0.70).
See the detailed results of the kAMH algorithm in Seman et
al. (2012).
This study reports the performance of the medoidbased of kAMH algorithm against
the other categorical data that were commonly used and experimented by the kModetype
algorithms e.g., the extended kModes algorithms (He et
al.,2007), kPopulation algorithm (Kim et al.,
2005) and New Fuzzy kModes algorithm (Ng and Jing,
2009). These selected algorithms were chosen for a comparison with the kAMH
algorithm because they had obtained impressive results when clustering several
common categorical data sets.
METHODS
Classically, clustering is divided into hierarchical and nonhierarchical methods.
The nonhierarchical method is also known as partitional method. The main difference
between these two methods is the hierarchical method breaks up the data into
a hierarchy of clusters, whereas the partitional method divides the data set
into mutually disjoint partitions. The partitional methods often adopt two popular
heuristic methods: Centroidbased technique such as kMeans and Representative
objectbased technique (Medoid) such as kMedoid algorithm (Han
and Kamber, 2001; Tan et al., 2006).
Centroidbased methods: In centroidbased methods, a centroid of a cluster
can be statistically represented by central of tendency measurements of data
points referred as mean, mode and median. As a consequence, there are three
categories of algorithms exist e.g., the kMeanstype algorithm, the kModestype
algorithm and the kMediantype algorithm. However, the most wellknown centroidbased
technique is the kMeans algorithm (Han and Kamber, 2001).
The kMeans algorithm begins in initializing cluster, k where k is a predefined
number of clusters and uses the mean values to calculate the distance between
objects and the k clusters. The distance measure is normally based on Euclidean
distance. The algorithm allows recalculation of the means for each cluster of
the objects that belong to it and minimizes the intra cluster dissimilarity.
The updating centroid by the means is calculated as Eq. 1:
where, c_{i} is the centroid cluster, l_{i}, x is an object
and m_{i} is the number of objects in ith cluster.
When Huang (1998) proposed the kMode algorithm for
clustering categorical data, he used a mode mechanism as the centroid. Thus,
the updating centroids by mode is calculated as Eq. 2 and
subject to (2a):
where, a_{j}^{(r)} is the mode of attribute values of A_{j}
in cluster C_{l} such that:
Medoidbased methods: Instead of using mean, mode and median, the medoid
is based on object as the chosen center of a cluster. The most well known medoid
method is the kMedoid algorithm or Partitioning Around Medoids (PAM) introduced
by Kaufman and Rousseeuw (1987). The algorithm plays
around the objects which are the most centrally located object in a cluster.
The basic idea of this algorithm is to find k cluster in n objects by first
arbitrarily finding a representative object or often known as the medoid for
each cluster. The next step is to iteratively replace one of the medoid by one
of the nonmedoid as long as the process can improve the clustering accuracy.
The swapping technique allows exchange the current medoid, t_{i} and
the nonmedoid, t_{h}. The replacement of new medoid must satisfy the
total cost, TC_{ih}<0 as Eq. 3:
where, C_{jih }is the cost change for an item t_{j} while swapping
medoid, t_{i} with nonmedoid, t_{h}.
However, the medoid mechanism used by the kMedoid algorithm has faced a main
problem of high computational cost (Ng and Han, 1994;
Han et al., 2001; Dunham, 2003).
As a consequence, many extended medoidbased methods have been proposed to improve
the idea of kMedoids algorithm e.g., Clustering Large Applications (CLARA)
by Kaufman and Rousseeuw (1990) and Clustering Algorithm
based on Randomized Search (CLARAN) by Ng and Han (1994).
All efforts above are proposed for clustering numerical data. No effort of medoidbased
method has been found for clustering categorical data except kAMH algorithm.
kAMH algorithm: The main differences between the kAMH algorithm and
the other kModetype algorithms are (1) The objects (the data themselves) are
used as the center (medoid) instead of modes and (2) Maximization process of
the cost function is required instead of minimizing it as in the kmodetype
algorithms. The basic idea of the kAMH algorithm is to find k clusters in n
objects by first randomly selecting an object to be the medoid, h for each cluster.
The next step is to iteratively replace the objects x onebyone towards the
medoid, h. The replacement is based on Eq. 4 if the cost function,
P(À) as described in Eq. 5 and subject to Eq.
5a, 6, 6a, b, 7,
3, 4a, 7b and 7c
is maximized:
The P(À) is a cost function as described in Eq. 5:
Subject to:
where, W^{α}_{li} is a (kxn) partition matrix that denotes
the degree of membership of object i in the lth cluster that contains a value
of 01 as described in Eq. 6, subject to Eq.
6a and b:
subject to:
and:
Where:
• 
k (≤ n) is a known number of clusters 
• 
H is the medoid such that [H_{1}, H_{2},…,H_{k}]εX 
• 
α∈[1, ∞) is a weighting exponent. Note that this alpha
is typical based on 1.1 until 2.0 as introduced by Huang
and Ng (1999) 
• 
d(X_{i}, H_{z}) is the distance measure between the object
X_{i} and the medoid, H_{l} 
D_{li} is another (kxn) partition matrix in which d_{li} contains
a dominant weighting value of 1.0 or 0.5. The dominant weighting values are
based on the value of w^{α}_{li} above. D_{li}
is described in Eq. 7, subject to Eq. 7ac:
subject to:
The detailed descriptions and the proof of the convergence of the medoidbased
of kAMH algorithm have been provided in Seman et al.
(2012).
RESULTS AND DISCUSSION
Evaluation methods: Since, the results of the chosen kModetype algorithms
have previously been reported and published; the results and discussions here
are only restricted to compare the kAMH algorithm and the reported results
as follows:
• 
The results obtained by the extended kModes algorithm as
reported in He et al. (2007) for Voting, Breast
Cancer, Mushroom, Soybean, Lymhography and Zoo data sets 
• 
The results obtained by the kPopulation algorithm as reported in Kim
et al. (2005) for Soybean, Zoo, Credit and Hepatitis data sets 
• 
The results obtained by the New Fuzzy kModes algorithm as reported in
Ng and Jing (2009) for Soybean, Zoo, Credit and
Adult data sets 
This evaluation is based on the average clustering accuracy scores obtained
from 100 experiments for each data set.
Table 1: 
Average clustering accuracy scores between the kAMH algorithm
and the extended kmodes algorithms 

RVF: Relative value frequency, UAVM: Uncommon attribute value
matches, HI: Hybrid Ia combination method of RVF and UAVM and HII: Hybrid
IIa simplified version of hybrid I 
Table 2: 
Average clustering accuracy scores between the kAMH algorithm
and the kpopulation algorithm 

Thus, the misclassification matrix proposed by Huang (1998)
was used for obtaining the clustering accuracy scores and the performance of
the algorithms was measured by the clustering accuracy, r defined by Huang
(1998) as described in Eq. 8:
where k, is the number of clusters, a_{i} is the number of instances
occurring in both cluster i and its corresponding class and n is the number
of instances in the data sets.
kAMH vs extended kmode algorithms: Table 1 shows
a comparison result between the kAMH algorithm and the extended kModes algorithm.
This result was based on six categorical data sets: Voting, Breast Cancer, Mushroom,
Soybean, Lymphography and Zoo. However, most of the data sets are 2 classes
only. In overall, the kAMH algorithm performs better than the extended kModes
algorithm. The kAMH algorithm obtained the highest clustering accuracy scores
for four out of six data sets and produced an equal performance for a data set
called, Lymphography. However,the kAMH algorithm produced a bit lower of clustering
accuracy score for a data set called, Breast cancer. The kAMH algorithm also
looks better performance when clustering a data set with large number of cluster
e.g., Zoo data set (Seven clusters). A conclusion can be made; the medoidbased
method implemented by the kAMH algorithm has shown impressive performance over
the data sets.
KAMH vs kpopulation algorithms: Table 2 shows a
comparison result between the kAMH algorithm and the kPopulation algorithm.
This result was based on four categorical data sets: Soybean, Zoo, Credit and
Hepatitis data sets. In overall, the kPopulation algorithm performs better
than the kAMH algorithm. The kAMH algorithm only obtained the highest clustering
accuracy scores for one out of four data sets. The highest score obtained by
kAMH algorithm is 0.94 compared to only 0.87 obtained by kPopulation algorithm
for Zoo data set. This result clearly indicates that the kAMH algorithm can
perform better for the large number of cluster of the data set. In contrast,
the algorithm faced a problem to cluster the data sets with two clusters/classes
such as Credit and Hepatitis data sets.
KAMH vs new kmode algorithms: Table 3 shows a comparison
result between the kAMH algorithm and the New Fuzzy kModes algorithm. This
result was based on four categorical data sets: Soybean, Zoo, Credit and Adult
data sets. In overall, the New Fuzzy kModes algorithm performs better than
the kAMH algorithm. The kAMH algorithm only obtained the highest clustering
accuracy scores for one out of four data sets. The highest score is also for
Zoo data set (0.94) compared to only 0.82 obtained by the New Fuzzy kModes
algorithm. For Soybean data set, the New Fuzzy kModes algorithm also obtained
the optimum score of 100% of clustering accuracy score; the same performance
as obtained by the kPopulation algorithm.
Big O comparison: The algorithm performances can also be evaluated in
terms of the Big O analysis. It is to measure the efficiency of the execution
time for the algorithm to run as a function of the input size. For clustering
categorical data, usually the time efficiency of the algorithms may involve
the following factors:
• 
The number of clusters, k 
• 
The number of attributes, m 
• 
The number of categories, j 
• 
The number of iterations, t 
• 
The number of objects, n 
Table 4 shows a comparison of clustering categorical algorithms
including the kAMH algorithm. It shows that the kAMH algorithm required only
three factors which are k, m and n without factors j and t in its clustering
process.
Table 3: 
Average clustering accuracy scores between the kAMH algorithm
and the new fuzzy kmodes algorithm 

Table 4: 
Big O comparison between the kAMH algorithm and the other
algorithms 

The absence of j is because it uses the medoidbased method. Furthermore, the
t factor is not required because the iteration is fixed from (nk). It clearly
indicates that the time complexity of kAMH algorithm is still linear. Therefore,
the algorithm is acceptable for clustering any categorical data.
CONCLUSION
From the experimental results, in general the kAMH algorithm can be used for
clustering any categorical data. Furthermore, the algorithm shows its better
clustering performance when dealing with a large cluster number. This scenario
can be seen from the result of Zoo data set with 7 clusters. In fact, the kAMH
algorithm has produced the best clustering accuracy scores over to the other
two algorithms. This performance of the algorithm has been proven for clustering
YSTR categorical data. The kAMH algorithm produced the highest clustering
accuracy scores even though for clustering eight and 14 clusters of YSTR data
sets over the other eight clustering algorithms. In addition, the kAMH algorithm
is also efficient for clustering categorical data as its time complexity is
only O(km(nk)). As a conclusion, the kAMH algorithm has its significant contribution
on clustering any categorical data.
ACKNOWLEDGMENT
This research was supported by Fundamental Research Grant Scheme, Ministry
of Higher Education, Malaysia. We would like to thank RMI, UiTM for their support
for this research. We would like to extend our gratitude to many contributors
toward the completion of this study our research assistants; Syahrul, Azhari,
Kamal, Hasmarina, Nurin, Soleha, Mastura, Fadzila, Syukriah and Hazira.