INTRODUCTION
Data mining, a synonym to data knowledge discovery, is an analytical
process of exploring data (usually large amounts of data) from various perspectives
and summarizing it so that we can focus on the most important information. Data
mining is also useful to discover the various patterns between the data and
establish the relations between them. Data mining techniques are used in wide
variety of research areas including but not limited to biology, medicine, mathematics,
cybernetics, engineering, economics and marketing (Ayanzadeh
et al., 2011a; Webb, 2002). Data mining problems
including regression, prediction, classification, clustering, pattern mining,
feature selection and feature extraction have been subject of extensive research
for several decades (Theodorodis and Koutroumbus, 2006;
Webb, 2002). Clustering and classification are two significant
and frequently used techniques which are extremely powerful in their respective
goals (Ayanzadeh et al., 2011a; Theodorodis
and Koutroumbus, 2006; Webb, 2002).
In the terminology of data mining, the core objective of classification and
clustering is to utilize a data set to train a system. Although classification
and clustering are often mentioned in the same breath, they are different analytical
methods. Most of the learning algorithms used in data mining are either supervised
learning or unsupervised learning. Clustering is one form of unsupervised learning
and involves grouping data into categories based on some measure of inherent
similarity such a way that objects in the same category are more similar to
each other than to those in other categories. Classification is the corresponding
supervised procedure. In classification, a set of predefined categories exists
and the problem is identifying to which categories the new observation belongs
(Theodorodis and Koutroumbus, 2006; Webb,
2002).
Clustering algorithms are sorted out into two major categories with respect
to determining the number of clusters in data set. In first group, the number
of clusters is assumed to be a fixed parameter which is specified in advance
by the analyst and the clusters are formed from either an initial guess at the
cluster centres, or from random initial conditions. On the contrary, in the
second group, the number of clusters is not specified beforehand. These clustering
algorithms detect the number of clusters concerning the structure and nature
of training data set (Theodorodis and Koutroumbus, 2006;
Webb, 2002).
There are large number of clustering methods introduced in the literature.
Hierarchical clustering algorithms are renowned clustering methods used in wide
range of applications in which nested clusters are produced. Hierarchical clustering
can be either agglomerative, with fewer clusters at the higher level, or divisive
which separate the n objects into more and finer groups in sequential steps
(Webb, 2002). Due to high computational complexity of
hierarchical clustering methods which makes them too slow for large data sets,
optimal efficient must be found in order to cluster the large data sets with
high dimensionality (Du, 2010; Chen,
2013; Nanda and Panda, 2013; Hongwei
et al., 2011). Some of these good methods are single connection hierarchical
algorithm, complete connection hierarchical algorithm and sum of squares hierarchical
algorithm (Nanda and Panda, 2013; Hongwei
et al., 2011).
Although the proposed methods improve the efficiency of clustering, they still
suffer from undesirable functionalities in real world scenarios in which very
large data sets are supposed to be processed (Nanda and
Panda, 2013; Hongwei et al., 2011). To address
this issue, intelligent numerical methods are used to handle high complexity
and dimensionality of real world applications (Ayanzadeh
et al., 2009b; Setayeshi and Fadaei, 2011;
Shahamatnia et al., 2011). Evolutionary algorithms
have been proven to be a very powerful mechanism in finding good solutions to
difficult problems (Hongwei et al., 2011). The
flexibility associated with Genetic Algorithms (GA) is one important aspect
to take into account. GAs are the most well known intelligent techniques that
are applied for clustering problems (Hongwei et al.,
2011). With regard to the quality of the solutions that evolutionary algorithms
have shown in different types of fields and problems, it makes sense to try
to use them in clustering applications more seriously.
Artificial Neural Networks (ANN) are graph based numerical and intelligent
approach that have served significant functionalities in data mining more preciously
in classification and clustering applications. Owing to the desirable performance
of ANN in dealing with large scale data sets, they are widely used for real
world clustering applications (Du, 2010).
In the same way, Neuro-Fuzzy systems are also utilized for data mining purposes.
In fact, Neuro-Fuzzy systems are ANN based implementation of fuzzy systems that
can represent the knowledge more properly. On this basis, ANFIS serves ourstaning
performance in real world clustering applications (Chen,
2013). However, its worth noting that ANFIS can not handle large scale
data setes and size reduction pre-processes are required (Engelbrecht,
2007; Jang et al., 2005; Chiou
and Lan, 2001). Cellular Automata (CA) and Learning Automata are also being
tried for data-mining applications (Torkestani and Meybodi,
2010; Ayanzadeh et al., 2009a; Ayanzadeh
et al., 2010).
Honey Bees Foraging Optimization (HBFO) is a novel swarm intelligence approach
that can be employed for solving many types of problems. In other words, due
to the harmony between local and global search in HBFO, it potentialy can serve
phenomenal functionality in real world problem solving (Ayanzadeh
et al., 2011b). Furthermore, fuzzy sets provide linguistic variables
to represent the knowledge more properly (Ayanzadeh et
al., 2012). In this sense, in this study honey bees foraging optimization
algorithms are enhanced using fuzzy concepts in evaluating the nectar amount
of food sources to propose a novel hybrid intelligent clustering method. From
another perspective fuzzy operators are injected into the structure of HBFO
algorithm to explore more accurate clusters. Regarding to the stable infrastructure
of HBFO and due to the capability of fuzzy linguistic variables in knowledge
representation, proposed clustering approach provides more desirable results
comparing to other meta-heuristics.
HONEY BEES FORAGING OPTIMIZATION
Honey bees are one of the most popular and beneficial insects all over the
world thanks to their capabilities to produce honey which is used in various
foods and beverages as a sweetener and flavorings and in various medicinal traditions
for treatment. Honey bees build their hives in forests, mountains and crack
of rocks. Each hive includes vertical parts which are made up of wax. These
vertical parts consist of several regular hexagons chambers (Afshar
et al., 2007; Pham et al., 2006a).
Honey bees are social insects and have complex social life. Each hive might
host hundreds or thousands of individuals. There is only one queen in each hive.
More interestingly, in honey bee societies, only queens can be fertilized and
lay eggs for reproduction purposes. This fertilization is performed during a
mating dance which is occurred once in a lifetime (Ayanzadeh
et al., 2011b; Sunil and Craig, 2004).
Other than queen and few hundred males, main fraction of the population is
unproductive worker bees. For security reasons, queen, males and workers must
identify each other not to let stranger insects to come in the hive. This also
helps honey bees to communicate, share knowledge about the location of food
sources and alert any attacks by foreign bees. All these message passing operations
are done by some special dances (Afshar et al., 2007;
Ayanzadeh et al., 2011b).
Queen and male bees are responsible only for reproduction. On the other side,
worker bees that have no mailting capability are supposed to do a considerable
amount of duties including foraging, cleaning, building and maintaining hive,
taking care of infant bees and so on. Foraging appears as one of the most important
aspects of the system of animate nature which must be undertaken perfectly by
workers (Afshar et al., 2007). To this end, honey
bees have developed two different responsibilities among worker bees in regards
to food production: Scout bees and forager bees. These bees use various body
language techniques to communicate with one another. Scout bees are selected
to search whole environment to seek food sources. Afterward, scout bees return
to the hive and share the exploration findings among other worker bees. It is
worth to note that scout bees have no initial information about the environment;
so, they look for new food sources by chance and stochastically. During the
exploration flight, scout bees test the quality of food sources (flowers). Then,
food sources are evaluated based on different parameters like quality and amount
of nectars, distance from the hive and type of the flowers (Pham
et al., 2006b; Haddad and Afshar, 2004).
After the first exploration flight, each scout bee is followed by some worker
bees to gather the nectar. Apparently, number of followers is directly related
to the quality of food sources that have been found by scouts. That is, those
scout bees that find better food sources are followed by more worker bees. Scout
bees with the help of worker bees gather nectar of flowers in the area (Ayanzadeh
et al., 2011b; Haddad and Afshar, 2004).
Worker bees that accompany the scout bees when they are collecting nectar,
exploit the neighbourhood to find better food soutces in that area. Accordingly,
if they discover places with better food sources during the exploitation, they
would play the role of scout bees. On the other side, worker bees also share
their knowledge about the environment when they return to the hive. Thus, these
worker bees will be followed by other worker bees in the next flight (Afshar
et al., 2007).
Similar to all other bio-inspired optimization algorithms, honey bees foraging
optimization is an optimisation algorithm inspired by the natural foraging behaviour
of honey bees to find the optimal solution. Both honey bees optimization algorithms
(maiting and foraging) are considered swarm intelligence that have been extensively
used as search and optimization tools in various problem domains (Ayanzadeh
et al., 2011b). Honey bees foraging optimization technique is a computational
model of honey bees behaviour when they are searching for food sources (Afshar
et al., 2007).
In this method, solutions are assumed as food sources and artificial honey
bees explore and exploit the search spave to find optimum solution. In the same
way, with traditional meta-heuristic methods, such as genetic algorithms, potential
solutions are formulated by vectors of numbers. These food sources are initialized
randomly usually with uniform distribution (Afshar et
al., 2007; Ayanzadeh et al., 2011b).
Afterward, iterative optimization process is begun. The iteration process will
continue until one of the criterias for termination is met. Food sources are
evaluated based on nectar amount (fitness value). Afterward, food sources are
categorized into two desired and undesired groups. From another perspective,
first K best food sources are assumed as desired and rests are called undesired
(Ayanzadeh et al., 2011b; Jaberi
et al., 2012).
Desirable food sources are obtained by a local search algorithm like hill climbing
or simulated annealing heuristics and in the opposite side, undesirable food
sources are explored by global search techniques. In both local and global searches,
when a better food source is found, it is substituted with primary food sources.
In better words, initial food sources can be replaced by the output of exploration
and exploitation search (Ayanzadeh et al., 2011b;
Jaberi et al., 2011; Khosravani-Rad
et al., 2014; Marzban et al., 2014).
Local and global searches will be conducted until one termination condition
is met. Termination condition might be either the number of iterations or finding
a food source with predefined nectar amount. The number of worker bees that
follow scout bees during exploring has a direct relationship with the quality
of food sources that have been found by scouts. Parameters of local and global
search algorithms in standard simulations of honey bees foraging optimization
are supposed to be the same for desirable and undesirable food sources (Ayanzadeh
et al., 2011b; Haddad and Afshar, 2004).
FUZZY HONEY BEES FORAGING OPTIMIZATION BASED CLUSTERING
As mentioned earlier, owing to the important functionality of clustering in
solving real world data processing problems, wide variety of statistical and
intelligent clustering methods have been introduced. On this basis, different
branches of computer science and artificial intelligence including hierarchical
clustering algorithms, Genetic Algorithm, Particle Swarm Optimization, Artificial
Neural Networks and Neuro-Fuzzy systems are applied for clustering purposes.
In this section, a novel hybrid clustering method based on honey bees foraging
optimization and fuzzy sets is proposed.
Structure: Proposed clustering method is founded on honey bees foraging
optimization algorithm. Indeed, this method is employed for clustering with
pre-defined and fixed number of clusters. This algorithm seeks problem state
space to find optimum clusters canters. Cluster numbers, as a system parameter
specifies the size of food sources. In order to find the optimum positions of
C clusters in a problem in Rn state space, food sources will consist
of Cxn elements. Food sources are initialized by random values. Then, food sources
are evaluated in each iteration using nectar amount function which is performed
similar to fitness function in genetic algorithms (Ayanzadeh
et al., 2011b; Jaberi et al., 2013).
More specifically, food sources are divided into C groups in which each group
represents the position of a cluster. Apparently, in this sense, the clustering
problem is considered as an optimization problem. Thus, honey bees foraging
optimization algorithm is supposed to find the minimum error (maximum likelihood)
of nectar amount function). Undoubtedly, nectar amount function serves key functionality
and changes an optimization process to a clustering paradigm.
Nectar amount function: The main difference between proposed method
and standard honey bees foraging optimization lies in examining the amount of
nectar. To this end, fuzzy operators are employed to map qualitative values
into a quantitative scale. In other words, linguistic variables are used to
enhance the evaluation of nectar amount.
In traditional clustering methods, each input pattern is assigned to be blong
to the nearest cluster. In proposed method, however, input patterns are belong
to all clusters with different membership rates. To achieve this goal, Gaussian
fuzzifier is utilized to calculate the membership values. For example, to assess
amount of nectar in vector X1x6 in R3 state space, three
cluster centers are represented as food sources. Thus, membership value of belonging
X to C1, C2 and C3 are calculated by Eq.
1:
where, Xi is ith pattern and Cj represents the centre
of jth cluster. After calculating Euclidean distance between training data and
centres of clusters, ith input sample will be assigned to the cluster with maximum
nectar amount (maximum membership value).
Apparently, this type of clustering is completely different from fuzzy clustering
aspects. More narrowly, multi-belonging is taken into account in evaluating
the amounts of nectar and every pattern is finally assigned to a unique cluster.
Total clustering error for each cluster is summation of Euclidean distances
between each pair of patterns in the specified cluster (total internal distances).
Finally, clustering error for an instance food source will be the summation
of all corresponding clusters errors (internal distances). Using this
technique, the algorithm will be converged to food sources with minimum clustering
error.
EXPERIMENTAL RESULTS
In this section, results of implementing and testing the proposed clustering
algorithm along with its comparison with the results of an enhanced version
of Genetics Algorithm are mentioned. In this work, three clustering problems
of UCI data repository are used to evaluate the performance of algorithm. The
selected datasets are Iris, Glass and Wine in which, all patterns have class
title.
Iris: In this test, Iris data set is used because of its familiarity
and remarkable fame in classification and clustering methods. Nearly every data
mining package comes with this data set. The Iris data set consists of three
types of plants which are class labels (Setosa, Versicolor and Virginica), with
50 instances per each class, represented by 4 features.
In this experiment, patterns are categorized into three clusters. Clustering
is performed using proposed method and an enhanced version of genetic algorithm.
Figure 1 and 2 illustrate convergence of
genetic algorithm and fuzzy honey bees foraging optimization clustering method
for clustering, respectively.
|
Fig. 1: |
Convergence of enhanced genetic algorithm in clustering of
Iris data set |
|
Fig. 2: |
Convergence of fuzzy honey bees foraging optimization clustering
in clustering of Iris data set |
|
Fig. 3: |
Convergence of enhanced genetic algorithm in clustering of
Glass data set |
Table 1: |
Maximum clustering errors obtained by enhanced GA and FHBFO
methods for Iris data set |
|
Table 2: |
Maximum clustering errors obtained by enhanced GA and FHBFO
methods for Glass data set |
|
Maximum clustering errors are also given in Table 1. It is
obvious that both enhanced genetic algorithm and fuzzy honey bees foraging optimization
clustering techniques demonstrate similar behaviour in clustering Iris data
set.
Glass: In this simulation another benchmark dataset from UCI repository
named Glass is chosen to evaluate the functionality of proposed clustering method.
The Glass dataset has 10 attributes which are features and 7 types of glasses
that here are considered to be class labels.
For this dataset patterns were categorized into 6 clusters. Figure
3 and 4 show convergence of enhanced genetic algorithm
and fuzzy honey bees foraging optimization clustering methods for clustering,
respectively. Again, maximum clustering errors for both methods are demonstrated
in Table 2. As can be seen, the proposed method tends to be
more robust and efficient in clustering than enhanced genetic algorithm.
Wine: Performance of proposed algorithm was tested based on Wine data
set from UCI repository. The decision for choosing this database was motivated
by the fact that this dataset has greater number of dimensions compare to Iris
and Wine. In this way, validation is carried out on more complicated data set.
Wine dataset is known to have 3 classes with 13 features and 187 numbers of
instances.
In this evaluation, patterns are supposed to be clustered into three groups.
Figure 5 and 6 depict convergence of enhanced
genetic algorithm and the proposed hybrid method for clustering of Wine data
set.
|
Fig. 4: |
Convergence of fuzzy honey bees foraging optimization clustering
in clustering of Glass data set |
|
Fig. 5: |
Convergence of enhanced genetic algorithm in clustering of
Wine data set |
|
Fig. 6: |
Convergence of fuzzy honey bees foraging optimization clustering
in clustering of Wine data set |
Table 3: |
Maximum clustering errors obtained by enhanced GA and FHBFO
methods for Wine data set |
|
Table 3 presents the maximum clustering errors of clustering
the Wine data set using both aforementioned methods. Apparently, fuzzy honey
bees foraging optimization clustering serves more desirable functionality and
provides more accurate clusters than enhance genetic algorithm.
CONCLUSION
Considering the importance of clustering techniques for exploratory data analysis
in solving real world applications, in this study a novel hybrid method for
clustering was proposed. In this method fuzzy operators were injected into the
infrastructure of honey bees foraging optimization algorithm. In some sense,
clustering challenges are formulated as optimization problems.
In this proposed method, nectar amount function which serves same functionality
as fitness function in genetic algorithms, is in charge of evaluating clustering
error to convert clustering problem in to optimization problem. In other words,
food sources represent centres of clusters and amount of nectar was supposed
to be total clustering error. From another attitude, for each food source, nectar
amount function calculates the summation of euclidean distances between every
two patterns for all patterns in a cluster and clustering error was assumed
as total value of internal distances. Furthermore, fuzzy operators were utilized
to map qualitative linguistic variables into quantitative values.
To test the efficiency and reliability of our proposed clustering method, several
experiments were conducted with three datasets from the UCI database ranging
from a small number of classes and small-dimensional features to a large number
of classes and large-dimensional features. The experiments were intended to
assess the clustering accuracy of the proposed method and compare the results
with an enhanced version of genetic algorithms. The experimental results show
that the proposed hybrid method produces more accurate and higher quality clusters
in comparison with an enhanced genetic algorithm.