Clustering is one of the most effective methods for extracting texts which is in the direction of the effective organization and summarization of developed documents.
Two general classifications of clustering methods are available: Agglomerative
hierarchical method (Jain and Dubes, 1998) and partitional
method (Li and Luo, 2008). In initialization, the algorithms
of Agglomerative Hierarchical Clustering (AHC) considers each document as a
cluster and use different kinds of distance function to calculate the similarity
between pairs of clusters then merges them together. The step of merging is
repeated so that a suitable number is obtained. In comparison with the bottom-up
method of AHC, the family of k-means algorithms which belongs to partitional
algorithms creates a single-surface division of documents. Every document is
assigned to a cluster based on a distance criterion (between the document and
each central k) after selecting k of initial center and the central k is calculated
again. We repeat this step so that we can obtain an optimum collection of k
cluster based on a criterion function.
Considering the high volume of operations of hierarchical algorithms on large
text database, partitional clustering algorithms are used due to high quality
and low computational requirements. a key feature of partitional clustering
is that as criterion function is general the optimization obtained is a product
of the whole clustering process. The purpose of the criterion function is to
optimize different aspects of intra-cluster similarity and inter-cluster distinction
and their combinations.
A famous similarity criterion is cosine function which is widely used in the
clustering algorithms of documents (Steinbach et al.,
2000). A cosine function can be used in k-means algorithms for assigning
every document to a cluster or to the most similar cluster center. As the cosine
function measures just the similarity between two documents, when clusters are
not well separated they will not have suitable performance. To prevent these
problems we use the concepts of neighbor and link to cluster documents (Guha
et al., 1999).
Lack of attention of the existing retrieval solutions to semantic similarity
among words the use of just similar words among documents are among big challenges
in clustering texts. We need background knowledge for overcoming this problem
and enriching the existing algorithms which our source of such knowledge is
ontology (Bloehdom and Hotho, 2004; Hirst
and St-Onge, 1997; Hotho et al., 2003; Mao
and Chu, 2002; Jurisica et al., 2004). Ontology
is a description of concepts and relationships which was defined for sharing
In this study, we used semantic relations among words in weighing words frequency again in the vector space model of documents and then applied the concepts of neighbor and link on the obtained model and studied its performance on k-means algorithms in different aspects. The results obtained from applying the proposed method on a set of real data show that its performance is very suitable in relation to previous methods.
The issue of clustering documents is defined as follows: we plan to divide a collection of documents into a predetermined collection of clusters in a way that the documents assigned to each cluster have the most similarity compared with the documents assigned to other different clusters. In other words, the documents of a cluster present a similar issue and the documents of different clusters present different issues.
Vector space model: In most document clustering algorithms, we put forward
vector space model using documents (Van Rijsbergen, 1979).
In this model every document d is considered as a vector in a word space and
presented with word frequency vector:
We can define center vector cj using a set Cj of documents and its corresponding vector as follows:
Cosine similarity criterion: One of the most famous criteria in obtaining
the similarity among documents is cosine function (Steinbach
et al., 2000) which its value is zero when the two documents are
completely similar and is one if there is no similarity and in other cases the
value is between zero and one. If we suppose the two documents di
and dj it can be calculated as follows:
|| K-means algorithm for text document clustering
Concepts of neighbor and link: To calculate the neighbor of two documents,
we say the two documents are neighbors if their similarity is more than a threshold
limit like this (Guha et al., 1999):
In which the threshold value O is a parameter which is defined by the user.
Then we use a neighbor matrix for a set of data containing n document which
is a neighbor matrix nxn every input M (i,j) is either one or zero depending
on the point that whether di and dj are neighbors or not.
The value of link function link(di,dj) is defined as number
of joint neighbor between di and dj based on the following
relation (Guha et al., 1999):
K-means algorithm for text clustering: Figure 1 shows operation of K-means algorithm on text document clustering briefly.
Semantic similarity between documents based on ontology
Semantic vector space model: Disuse of semantic relations among words in presenting text data is the main difficulty of vector space model based on word. In this model, the importance of each word is weighted based on its existence in the documents and words behave independently, however they are semantically related. To overcome the weakness of the above model and for better organization of documents, we used semantic vector space model.
Conceptual features were introduced in (Bloehdom and Hotho,
2004; Hotho et al., 2003) before. It is a
collection of a few words which describes the same high level concept.
Three methods are introduced (Bloehdom and Hotho, 2004;
Hotho et al., 2003) to apply the conceptual features
in the vector model as below:
||Adding conceptual features to the term space (term + concepts)
||Substituting the related terms with conceptual features
||Reducing the vector space model dimension by only conceptual
features instead of terms
Then a phrase based model is used by Jurisica et al.
(2004). It includes extracting the phrases of the documents and calculating
the similarity between every couple of phrases. The both methods (based on the
term and based on the phrase) change the main dimension of vector space model.
The results of the previous researches show that only the (term + concepts)
method between the term based models improves the performance rather than the
base model but this method has some problems such as redundancy of the information
and the dimensions. Besides that the problem of the phrase based model is extracting
meaningful terms of the document which is a difficult process practically.
The model used in this article doesnt change the main dimensions of the
vector space model (Jing et al., 2010). In contrast,
it uses semantic knowledge effectively in weighing words again in the vector
space based on the term. So the new weight of each word in the ith document
is calculated as follows:
where, χij1 is the main weight of the word tj1 and χij2 is the main weight of all word tj2 in the vector space model which are related to tj1 and δj1j2 is the similarity between the word tj1 and tj2.
This relationship shows that if a word is related to other words a lot, its weight increases considerably. So based on this information, the dependence among words is created in semantic vector space model. Now we describe how to calculate δ.
Calculation of dependence of two words: We calculated the semantic similarity
between two words using WordNet ontology which is an internet semantic dictionary
(Miller, 1995). We need the similarity among concepts
related to each word to obtain the similarity among words. The semantic similarity
between c1 and c2 can be calculated through tree hierarchical structure in WordNet
which there are different attitudes about it. Our proposed method in this article
is a combination of the weight of the position of the concepts (Zhao,
1996) and the weight of the distance among concepts (Kolodner,
1993) which you can see in Eq. 7:
where, cd(c1,c2) equals the depth of common parent of c1 and c2 and l(c1,c2) is the length of the distance between the two concepts in ontology tree. Obviously, the more the depth of the common parent of the two concepts and the less the distance between them, the more the similarity. The denominator of the fraction is for making distinctions between the states with equal distance but with different positions relating to their common parent. Look at Fig. 2 for better explaining this relation. we obtain similarity of three couple (F,O) (K,N) and (M,N) according to the formula (7).
|| A sample ontology
As it is clear in this tree, couples (F, O) and (K, N) have the same common parent depth and the same length path. But they are different as the amount of the similarity. Therefore to make difference between these models, we valuate the denominator of the fraction by the maximum length of the node to the common parent.
The couple nodes (M, N) are more similar rather than the other couple nodes because they are in the lower level of the ontology tree and therefore they have more special values. The diversity between the different models are shown in the concluded results clearly.
Finally we make them normal to use the obtained similarity value better in
Calculating the similarity among words is the next step. The semantic similarity among words is presented using the maximum similarity of the concepts related to those words . If we suppose that words w1 and w2 contain a and b concepts, respectively the relationship (9) can show the semantic similarity between the two words:
And if the two documents d1 and d2 have n and m words of WordNet, their semantic similarity is defined as follows:
Using semantic neighbors in clustering with k-means algorithm
Semantic neighbors: As in the semantic vector space model, we took into consideration the semantic relationships among words in the vector space, so this similarity among documents is not just based on their common words but contains their common concepts despite the existence of different words. Therefore the obtained neighbor is more accurate than the previous methods. The results obtained from applying semantic vector space model in calculating inter-document neighbor, called as semantic neighbor, in order to use them in k-means clustering algorithm, shows that it has a better performance than the previous methods.
Use of neighbor and link in k-means clustering: In this section we explain how to use suggested method for applying in the k-means clustering algorithm steps in short.
Determining the centers of initial clusters: As the efficiency of k-means algorithms are very sensitive to the selection of initial centers, their suitable distribution will be very effective in the obtained results.
We suppose that a clusters documents are more similar to each other rather than different clusters documents in the clustering. Therefore a good choice of a point to be the clusters center, should be close enough to a special group of the documents. More ever, it should be separated from other centers truly. To evaluate the number of the documents which are more similar to the current document, we use the neighbors' numbers of a document in a data set with adjusting a proper similarity threshold θ.
As both cosine and link functions can calculate the amount of the similarity of two documents, we apply both of them to evaluate the dissimilarity of two documents which are the candidates of the initial centers.
So selecting initial centers is done based on three values: pairwise similarity
calculated by cosine function, link function value and number of the neighbors
of the documents in the set of data. This combination leads to a high quality
selection of initial centers (Luo et al., 2009).
Determining similarity criterion: In order to determine the nearest cluster center for each document during the step of cluster identification, a criterion of assessment is needed. This similarity criterion is a combination of the functions of cosine and link. If we add the link criterion, the pairwise similarity of documents is improved and the documents available in the neighbor can improve the precision between the document and the cluster center. This relation is shown in formula (11):
where, Lmax is the largest value of link(di,cj)
and α, that is between 0 and 1, is the coefficient set by the user (Luo
et al., 2009).
EXPERIMENTS AND RESULTS
We used a set of real-world text data to show the improved performance of our
proposed method and compared the results of the obtained clustering with the
basic k-means algorithm and previous methods based on neighbors.
|| Features of data sets
We also used Visual C# 64-bit operating system of windows vista with a 6.00
GB memory and a 2.40 GHz processor.
Data sets: We used 7 sets of data obtained from two different kinds
of text databases. The first two sets were chosen from the database of Reuter-21578
Distribution 1.0 (Lewis, 2004) with the subject of TOPIC.
The next 5 sets were extracted from a Classic text database which the first
two sets related to MEDLINE and the next 3 sets related to the dataset CISI.
Their specifications are summarized in Table 1.
In addition we used a number of steps for preprocessing data which are as follows:
conversion of documents into a set of words, omission of stop words and finding
the origin of the words using porter stemmer (Porter, 1980).
Moreover, any word stem that occurred fewer than two documents in one collection
was eliminated. Finally, we use word stems as features or dimensions of the
document vector space model.
Clustering results: We used F-measure criterion for the assessment of
the accuracy of the proposed algorithm and for its comparison with the previous
methods. The F-measure is a harmonic combination of the precision and recall
values used in information retrieval (Van Rijsbergen, 1979).
Precision of cluster Ci is defined to be the number of texts correctly
assigned divided by the total number of texts in cluster Ci. We define
Recall of cluster Ci to be the number of texts correctly assigned
divided by the total number of texts that should be assigned. Let P be average
precision. Let R be average recall. F-measure is defined to be 2 RP/(R+P).
The obtained results are shown in Fig. 3 in which we had an average of 10 runs for each method and each dataset.
The first column of the diagram (KM) is the results of basic k-means algorithm and the next three columns are related to the previous studies in the field of using neighbor and link in clustering through k-means algorithm. the first (NC), second (NS) and third (NCS) columns are related to the use of neighbor in determining initial centers, similarity criterion and determining initial centers and similarity criterion in combination, respectively.
The last three columns show the results obtained from our proposed method. The fifth column (SNC) is the result of using semantic neighbors in determining initial centers, the sixth column (SNS) the result of using semantic neighbors in similarity criterion and the last column (SNCS) is the result of using semantic neighbors in the two methods mentioned in combination.
As you can see in the diagram, in all cases, applying the concept of semantic neighbor in determining initial centers and also in its combination with similarity criterion, a considerable improvement was observed compared with the all mentioned methods. Using semantic neighbor alone in determining similarity criterion is comparable with the best previous work in most of cases. And it is more accurate than other previous methods.
||Comparison of the result of suggested method with basic K-means
algorithm and previous methods
In this study, we presented semantic neighbor method to improve the performance of k-means algorithms in which we used the concept of semantic similarity in reweighing word frequency in the vector space model of documents. The results obtained from our proposed method in the set of real-world text data shows a considerable increase in the precision of clustering compared with the basic clustering algorithm and previous methods. In future studies, we plan to investigate the concept of neighbor more detailed and present a method to make it more accurate in order to improve the clustering results as much as possible.