Clustering is one of the most important steps in data mining; it is known for its phenomenal functionalities in complex real world applications including biology, basic science, medicine, engineering and social science. In this sense, owing to the remarkable effects of clustering on data mining area, wide varieties of clustering approaches have been introduced to cluster data into significant subsets in order to obtain useful information. In this study, a novel clustering method based on honey bees foraging optimization algorithm and fuzzy rules is proposed. In the proposed method, fine shade of local and global search in honey bees optimization algorithm is schemed to be applied to improve the clustering efficiency. Furthermore, fuzzy operators are employed to enhance the performance of new proposed approach and prevent premature convergence. To verify and validate the functionality proposed of method, new method is run on three known data sets of the UCI Machine Learning Repository. Results of clustering reveal that proposed method estimate more desirable clusters compared to the state of the art clustering methods. Moreover, this method appears very stable in multiple tests.
PDF Abstract XML References Citation
How to cite this article
Data mining, a synonym to data knowledge discovery, is an analytical process of exploring data (usually large amounts of data) from various perspectives and summarizing it so that we can focus on the most important information. Data mining is also useful to discover the various patterns between the data and establish the relations between them. Data mining techniques are used in wide variety of research areas including but not limited to biology, medicine, mathematics, cybernetics, engineering, economics and marketing (Ayanzadeh et al., 2011a; Webb, 2002). Data mining problems including regression, prediction, classification, clustering, pattern mining, feature selection and feature extraction have been subject of extensive research for several decades (Theodorodis and Koutroumbus, 2006; Webb, 2002). Clustering and classification are two significant and frequently used techniques which are extremely powerful in their respective goals (Ayanzadeh et al., 2011a; Theodorodis and Koutroumbus, 2006; Webb, 2002).
In the terminology of data mining, the core objective of classification and clustering is to utilize a data set to train a system. Although classification and clustering are often mentioned in the same breath, they are different analytical methods. Most of the learning algorithms used in data mining are either supervised learning or unsupervised learning. Clustering is one form of unsupervised learning and involves grouping data into categories based on some measure of inherent similarity such a way that objects in the same category are more similar to each other than to those in other categories. Classification is the corresponding supervised procedure. In classification, a set of predefined categories exists and the problem is identifying to which categories the new observation belongs (Theodorodis and Koutroumbus, 2006; Webb, 2002).
Clustering algorithms are sorted out into two major categories with respect to determining the number of clusters in data set. In first group, the number of clusters is assumed to be a fixed parameter which is specified in advance by the analyst and the clusters are formed from either an initial guess at the cluster centres, or from random initial conditions. On the contrary, in the second group, the number of clusters is not specified beforehand. These clustering algorithms detect the number of clusters concerning the structure and nature of training data set (Theodorodis and Koutroumbus, 2006; Webb, 2002).
There are large number of clustering methods introduced in the literature. Hierarchical clustering algorithms are renowned clustering methods used in wide range of applications in which nested clusters are produced. Hierarchical clustering can be either agglomerative, with fewer clusters at the higher level, or divisive which separate the n objects into more and finer groups in sequential steps (Webb, 2002). Due to high computational complexity of hierarchical clustering methods which makes them too slow for large data sets, optimal efficient must be found in order to cluster the large data sets with high dimensionality (Du, 2010; Chen, 2013; Nanda and Panda, 2013; Hongwei et al., 2011). Some of these good methods are single connection hierarchical algorithm, complete connection hierarchical algorithm and sum of squares hierarchical algorithm (Nanda and Panda, 2013; Hongwei et al., 2011).
Although the proposed methods improve the efficiency of clustering, they still suffer from undesirable functionalities in real world scenarios in which very large data sets are supposed to be processed (Nanda and Panda, 2013; Hongwei et al., 2011). To address this issue, intelligent numerical methods are used to handle high complexity and dimensionality of real world applications (Ayanzadeh et al., 2009b; Setayeshi and Fadaei, 2011; Shahamatnia et al., 2011). Evolutionary algorithms have been proven to be a very powerful mechanism in finding good solutions to difficult problems (Hongwei et al., 2011). The flexibility associated with Genetic Algorithms (GA) is one important aspect to take into account. GAs are the most well known intelligent techniques that are applied for clustering problems (Hongwei et al., 2011). With regard to the quality of the solutions that evolutionary algorithms have shown in different types of fields and problems, it makes sense to try to use them in clustering applications more seriously.
Artificial Neural Networks (ANN) are graph based numerical and intelligent approach that have served significant functionalities in data mining more preciously in classification and clustering applications. Owing to the desirable performance of ANN in dealing with large scale data sets, they are widely used for real world clustering applications (Du, 2010).
In the same way, Neuro-Fuzzy systems are also utilized for data mining purposes. In fact, Neuro-Fuzzy systems are ANN based implementation of fuzzy systems that can represent the knowledge more properly. On this basis, ANFIS serves ourstaning performance in real world clustering applications (Chen, 2013). However, its worth noting that ANFIS can not handle large scale data setes and size reduction pre-processes are required (Engelbrecht, 2007; Jang et al., 2005; Chiou and Lan, 2001). Cellular Automata (CA) and Learning Automata are also being tried for data-mining applications (Torkestani and Meybodi, 2010; Ayanzadeh et al., 2009a; Ayanzadeh et al., 2010).
Honey Bees Foraging Optimization (HBFO) is a novel swarm intelligence approach that can be employed for solving many types of problems. In other words, due to the harmony between local and global search in HBFO, it potentialy can serve phenomenal functionality in real world problem solving (Ayanzadeh et al., 2011b). Furthermore, fuzzy sets provide linguistic variables to represent the knowledge more properly (Ayanzadeh et al., 2012). In this sense, in this study honey bees foraging optimization algorithms are enhanced using fuzzy concepts in evaluating the nectar amount of food sources to propose a novel hybrid intelligent clustering method. From another perspective fuzzy operators are injected into the structure of HBFO algorithm to explore more accurate clusters. Regarding to the stable infrastructure of HBFO and due to the capability of fuzzy linguistic variables in knowledge representation, proposed clustering approach provides more desirable results comparing to other meta-heuristics.
HONEY BEES FORAGING OPTIMIZATION
Honey bees are one of the most popular and beneficial insects all over the world thanks to their capabilities to produce honey which is used in various foods and beverages as a sweetener and flavorings and in various medicinal traditions for treatment. Honey bees build their hives in forests, mountains and crack of rocks. Each hive includes vertical parts which are made up of wax. These vertical parts consist of several regular hexagons chambers (Afshar et al., 2007; Pham et al., 2006a). Honey bees are social insects and have complex social life. Each hive might host hundreds or thousands of individuals. There is only one queen in each hive. More interestingly, in honey bee societies, only queens can be fertilized and lay eggs for reproduction purposes. This fertilization is performed during a mating dance which is occurred once in a lifetime (Ayanzadeh et al., 2011b; Sunil and Craig, 2004).
Other than queen and few hundred males, main fraction of the population is unproductive worker bees. For security reasons, queen, males and workers must identify each other not to let stranger insects to come in the hive. This also helps honey bees to communicate, share knowledge about the location of food sources and alert any attacks by foreign bees. All these message passing operations are done by some special dances (Afshar et al., 2007; Ayanzadeh et al., 2011b).
Queen and male bees are responsible only for reproduction. On the other side, worker bees that have no mailting capability are supposed to do a considerable amount of duties including foraging, cleaning, building and maintaining hive, taking care of infant bees and so on. Foraging appears as one of the most important aspects of the system of animate nature which must be undertaken perfectly by workers (Afshar et al., 2007). To this end, honey bees have developed two different responsibilities among worker bees in regards to food production: Scout bees and forager bees. These bees use various body language techniques to communicate with one another. Scout bees are selected to search whole environment to seek food sources. Afterward, scout bees return to the hive and share the exploration findings among other worker bees. It is worth to note that scout bees have no initial information about the environment; so, they look for new food sources by chance and stochastically. During the exploration flight, scout bees test the quality of food sources (flowers). Then, food sources are evaluated based on different parameters like quality and amount of nectars, distance from the hive and type of the flowers (Pham et al., 2006b; Haddad and Afshar, 2004).
After the first exploration flight, each scout bee is followed by some worker bees to gather the nectar. Apparently, number of followers is directly related to the quality of food sources that have been found by scouts. That is, those scout bees that find better food sources are followed by more worker bees. Scout bees with the help of worker bees gather nectar of flowers in the area (Ayanzadeh et al., 2011b; Haddad and Afshar, 2004).
Worker bees that accompany the scout bees when they are collecting nectar, exploit the neighbourhood to find better food soutces in that area. Accordingly, if they discover places with better food sources during the exploitation, they would play the role of scout bees. On the other side, worker bees also share their knowledge about the environment when they return to the hive. Thus, these worker bees will be followed by other worker bees in the next flight (Afshar et al., 2007).
Similar to all other bio-inspired optimization algorithms, honey bees foraging optimization is an optimisation algorithm inspired by the natural foraging behaviour of honey bees to find the optimal solution. Both honey bees optimization algorithms (maiting and foraging) are considered swarm intelligence that have been extensively used as search and optimization tools in various problem domains (Ayanzadeh et al., 2011b). Honey bees foraging optimization technique is a computational model of honey bees behaviour when they are searching for food sources (Afshar et al., 2007).
In this method, solutions are assumed as food sources and artificial honey bees explore and exploit the search spave to find optimum solution. In the same way, with traditional meta-heuristic methods, such as genetic algorithms, potential solutions are formulated by vectors of numbers. These food sources are initialized randomly usually with uniform distribution (Afshar et al., 2007; Ayanzadeh et al., 2011b).
Afterward, iterative optimization process is begun. The iteration process will continue until one of the criterias for termination is met. Food sources are evaluated based on nectar amount (fitness value). Afterward, food sources are categorized into two desired and undesired groups. From another perspective, first K best food sources are assumed as desired and rests are called undesired (Ayanzadeh et al., 2011b; Jaberi et al., 2012).
Desirable food sources are obtained by a local search algorithm like hill climbing or simulated annealing heuristics and in the opposite side, undesirable food sources are explored by global search techniques. In both local and global searches, when a better food source is found, it is substituted with primary food sources. In better words, initial food sources can be replaced by the output of exploration and exploitation search (Ayanzadeh et al., 2011b; Jaberi et al., 2011; Khosravani-Rad et al., 2014; Marzban et al., 2014).
Local and global searches will be conducted until one termination condition is met. Termination condition might be either the number of iterations or finding a food source with predefined nectar amount. The number of worker bees that follow scout bees during exploring has a direct relationship with the quality of food sources that have been found by scouts. Parameters of local and global search algorithms in standard simulations of honey bees foraging optimization are supposed to be the same for desirable and undesirable food sources (Ayanzadeh et al., 2011b; Haddad and Afshar, 2004).
FUZZY HONEY BEES FORAGING OPTIMIZATION BASED CLUSTERING
As mentioned earlier, owing to the important functionality of clustering in solving real world data processing problems, wide variety of statistical and intelligent clustering methods have been introduced. On this basis, different branches of computer science and artificial intelligence including hierarchical clustering algorithms, Genetic Algorithm, Particle Swarm Optimization, Artificial Neural Networks and Neuro-Fuzzy systems are applied for clustering purposes. In this section, a novel hybrid clustering method based on honey bees foraging optimization and fuzzy sets is proposed.
Structure: Proposed clustering method is founded on honey bees foraging optimization algorithm. Indeed, this method is employed for clustering with pre-defined and fixed number of clusters. This algorithm seeks problem state space to find optimum clusters canters. Cluster numbers, as a system parameter specifies the size of food sources. In order to find the optimum positions of C clusters in a problem in Rn state space, food sources will consist of Cxn elements. Food sources are initialized by random values. Then, food sources are evaluated in each iteration using nectar amount function which is performed similar to fitness function in genetic algorithms (Ayanzadeh et al., 2011b; Jaberi et al., 2013).
More specifically, food sources are divided into C groups in which each group represents the position of a cluster. Apparently, in this sense, the clustering problem is considered as an optimization problem. Thus, honey bees foraging optimization algorithm is supposed to find the minimum error (maximum likelihood) of nectar amount function). Undoubtedly, nectar amount function serves key functionality and changes an optimization process to a clustering paradigm.
Nectar amount function: The main difference between proposed method and standard honey bees foraging optimization lies in examining the amount of nectar. To this end, fuzzy operators are employed to map qualitative values into a quantitative scale. In other words, linguistic variables are used to enhance the evaluation of nectar amount.
In traditional clustering methods, each input pattern is assigned to be blong to the nearest cluster. In proposed method, however, input patterns are belong to all clusters with different membership rates. To achieve this goal, Gaussian fuzzifier is utilized to calculate the membership values. For example, to assess amount of nectar in vector X1x6 in R3 state space, three cluster centers are represented as food sources. Thus, membership value of belonging X to C1, C2 and C3 are calculated by Eq. 1:
where, Xi is ith pattern and Cj represents the centre of jth cluster. After calculating Euclidean distance between training data and centres of clusters, ith input sample will be assigned to the cluster with maximum nectar amount (maximum membership value).
Apparently, this type of clustering is completely different from fuzzy clustering aspects. More narrowly, multi-belonging is taken into account in evaluating the amounts of nectar and every pattern is finally assigned to a unique cluster.
Total clustering error for each cluster is summation of Euclidean distances between each pair of patterns in the specified cluster (total internal distances). Finally, clustering error for an instance food source will be the summation of all corresponding clusters errors (internal distances). Using this technique, the algorithm will be converged to food sources with minimum clustering error.
In this section, results of implementing and testing the proposed clustering algorithm along with its comparison with the results of an enhanced version of Genetics Algorithm are mentioned. In this work, three clustering problems of UCI data repository are used to evaluate the performance of algorithm. The selected datasets are Iris, Glass and Wine in which, all patterns have class title.
Iris: In this test, Iris data set is used because of its familiarity and remarkable fame in classification and clustering methods. Nearly every data mining package comes with this data set. The Iris data set consists of three types of plants which are class labels (Setosa, Versicolor and Virginica), with 50 instances per each class, represented by 4 features.
In this experiment, patterns are categorized into three clusters. Clustering is performed using proposed method and an enhanced version of genetic algorithm. Figure 1 and 2 illustrate convergence of genetic algorithm and fuzzy honey bees foraging optimization clustering method for clustering, respectively.
|Fig. 1:||Convergence of enhanced genetic algorithm in clustering of Iris data set|
|Fig. 2:||Convergence of fuzzy honey bees foraging optimization clustering in clustering of Iris data set|
|Fig. 3:||Convergence of enhanced genetic algorithm in clustering of Glass data set|
|Table 1:||Maximum clustering errors obtained by enhanced GA and FHBFO methods for Iris data set|
|Table 2:||Maximum clustering errors obtained by enhanced GA and FHBFO methods for Glass data set|
Maximum clustering errors are also given in Table 1. It is obvious that both enhanced genetic algorithm and fuzzy honey bees foraging optimization clustering techniques demonstrate similar behaviour in clustering Iris data set.
Glass: In this simulation another benchmark dataset from UCI repository named Glass is chosen to evaluate the functionality of proposed clustering method. The Glass dataset has 10 attributes which are features and 7 types of glasses that here are considered to be class labels.
For this dataset patterns were categorized into 6 clusters. Figure 3 and 4 show convergence of enhanced genetic algorithm and fuzzy honey bees foraging optimization clustering methods for clustering, respectively. Again, maximum clustering errors for both methods are demonstrated in Table 2. As can be seen, the proposed method tends to be more robust and efficient in clustering than enhanced genetic algorithm.
Wine: Performance of proposed algorithm was tested based on Wine data set from UCI repository. The decision for choosing this database was motivated by the fact that this dataset has greater number of dimensions compare to Iris and Wine. In this way, validation is carried out on more complicated data set. Wine dataset is known to have 3 classes with 13 features and 187 numbers of instances.
In this evaluation, patterns are supposed to be clustered into three groups. Figure 5 and 6 depict convergence of enhanced genetic algorithm and the proposed hybrid method for clustering of Wine data set.
|Fig. 4:||Convergence of fuzzy honey bees foraging optimization clustering in clustering of Glass data set|
|Fig. 5:||Convergence of enhanced genetic algorithm in clustering of Wine data set|
|Fig. 6:||Convergence of fuzzy honey bees foraging optimization clustering in clustering of Wine data set|
|Table 3:||Maximum clustering errors obtained by enhanced GA and FHBFO methods for Wine data set|
Table 3 presents the maximum clustering errors of clustering the Wine data set using both aforementioned methods. Apparently, fuzzy honey bees foraging optimization clustering serves more desirable functionality and provides more accurate clusters than enhance genetic algorithm.
Considering the importance of clustering techniques for exploratory data analysis in solving real world applications, in this study a novel hybrid method for clustering was proposed. In this method fuzzy operators were injected into the infrastructure of honey bees foraging optimization algorithm. In some sense, clustering challenges are formulated as optimization problems.
In this proposed method, nectar amount function which serves same functionality as fitness function in genetic algorithms, is in charge of evaluating clustering error to convert clustering problem in to optimization problem. In other words, food sources represent centres of clusters and amount of nectar was supposed to be total clustering error. From another attitude, for each food source, nectar amount function calculates the summation of euclidean distances between every two patterns for all patterns in a cluster and clustering error was assumed as total value of internal distances. Furthermore, fuzzy operators were utilized to map qualitative linguistic variables into quantitative values.
To test the efficiency and reliability of our proposed clustering method, several experiments were conducted with three datasets from the UCI database ranging from a small number of classes and small-dimensional features to a large number of classes and large-dimensional features. The experiments were intended to assess the clustering accuracy of the proposed method and compare the results with an enhanced version of genetic algorithms. The experimental results show that the proposed hybrid method produces more accurate and higher quality clusters in comparison with an enhanced genetic algorithm.
- Afshar, A., O.B. Haddad, M.A. Marino and B.J. Adams, 2007. Honey-bee mating optimization (HBMO) algorithm for optimal reservoir operation. J. Franklin Inst., 344: 452-462.
- Torkestani, J.A. and M.R. Meybodi, 2010. Clustering the wireless Ad Hoc networks: A distributed learning automata approach. J. Parallel Distrib. Comput., 70: 394-405.
- Ayanzadeh, R., K. Hassani, Y. Moghaddas, H. Gheiby and S. Setayeshi, 2009. Innovative approach to generate uniform random numbers based on a novel cellular automata. J. Applied Sci., 9: 4071-4075.
- Ayanzadeh, R., E. Shahamatnia and S. Setayeshi, 2009. Determining optimum queue length in computer networks by using memetic algorithms. J. Applied Sci., 9: 2847-2851.
- Ayanzadeh, R., A.S.Z. Mousavi and S. Setayeshi, 2011. Fossil fuel consumption prediction using emotional learning in Amygdala. Proceedings of the 19th Iranian Conference on Electrical Engineering, May 17-19, 2011, Tehran, Iran, pp: 1-6.
- Ayanzadeh, R., A.S.Z. Mousavi and E. Shahamatnia, 2012. Fuzzy cellular automata based random numbers generation. Trends Applied Sci. Res., 7: 96-102.
- Ayanzadeh, R., A.S.Z. Mousavi and H. Navidi, 2011. Honey bees foraging optimization for mixed Nash equilibrium estimation. Trends Applied Sci. Res., 6: 1352-1359.
- Du, K.L., 2010. Clustering: A neural network approach. Neural Networks, 23: 89-107.
- Jaberi, A., R. Ayanzadeh, E. Raisi and A. Sadighi, 2013. Central limit theorem based cellular automata for generating normal random numbers. Inform. Technol. J., 12: 2440-2446.
- Jaberi, A., R. Ayanzadeh and A.S.Z. Mousavi, 2012. Two-layer cellular automata based cryptography. Trends Applied Sci. Res., 7: 68-77.
- Chen, M.Y., 2013. A hybrid ANFIS model for business failure prediction utilizing particle swarm optimization and subtractive clustering. Inform. Sci., 220: 180-195.
- Pham, D.T., A. Ghanbarzadeh, E. Koc, S. Otri, S. Rahim and M. Zaidi, 2006. The bees algorithm-a novel tool for complex optimization problems. Proceedings of the 2nd International Virtual Conference on Intelligent Production Machines and Systems, July 3-14, 2006, Cardiff University, UK., pp: 454-459.
- Nanda, S.J. and G. Panda, 2013. Automatic clustering algorithm based on multi-objective Immunized PSO to classify actions of 3D human models. Eng. Appl. Artifi. Intell., 26: 1429-1441.
- Shahamatnia, E., R. Ayanzadeh, R.A. Ribeiro and S. Setayeshi, 2011. Adaptive imitation scheme for memetic algorithms. Adv. Inform. Commun. Technol., 349: 109-116.
- Chiou, Y.C. and L.W. Lan, 2001. Genetic clustering algorithms. Eur. J. Operat. Res., 135: 413-427.
- Hongwei, Z., Y Zhenyu and Z Shurong, 2011. Multi-objective supervised clustering GA and wintry climate forecast. Proc. Eng., 15: 1580-1584.