INTRODUCTION
Text Categorization (TC) is one of the important problems in Information
Retrieval (IR) and data mining communities. This is because of the significance
of natural language text, the huge amount of text stored on the internet
and the available information libraries and document corpus. Further,
TC importance rises up since it concerns with natural language text processing
and classification using different techniques, in which it makes the retrieval
and other text manipulation processes easy to execute.
They are many TC techniques exists such as: decision trees (Quinlan,
1993), Support Vector Machine (SVM) (Joachims, 1998), rule induction (Moulinier
et al., 1996), Neural Network (Wiener et al., 1995) and
k-nearest neighbor (KNN) (Yang, 1999).
In this study, we focus on a text similarity strategy, known as VSM in
order to compute the similarity between incoming text (new test cases)
and the pre-categorized text in the training data set. We use KNN algorithm
to classify incoming text to one of categories. Generally, TC based on
text similarity goes through two steps: Similarity measurement and classification
assignment.
Term weighting is one of the known concepts in TC, which can be defined
as a factor given to a term in order to reflect the importance of that
term. There are many term weighting approaches, including, Inverse Document
Frequency (IDF) and Weighted Inverse Document Frequency (WIDF) (Tokunaga
and Iwayama, 1994). IDF and WIDF focus on terms occurrences inside a text
corpus. WIDF distinguishes between two terms that have different occurrences,
whereas, IDF treats both terms equally.
Since TC stands at the cross junction to modern IR and ML, Several research
papers have focused on it but each of which has concentrated on one or
more issues related to such task. There are some research works (Deng
et al., 2004; Debole and Sebastiani, 2003), which have focused
on the different term weighting approaches related to TC such as Term
Frequency (TF), WIDF, IDF, Chi-square (Deng et al., 2004) and ITF
(Leopold and Kindermann, 2002). For example, the researchers of Tokunaga
and Iwayama (1994) have achieved good improvement with reference to the
retrieval accuracy using WIDF on Japanese language if compared to TF.
IDF approach using KNN (Yang, 1999) and Bayesian Model (Tzeras and Hartman,
1993). Specifically, the KNN. WIDF implementation achieved 7.4% higher
than that of the TF.IDF.
Yang and Liu (1999) have tested five categorization algorithms SVM (Joachims,
1998), KNN (Yang, 1999), NNet (Wiener et al., 1995), LLSF (Yang
and Chute, 1994) and NB based on Network (Tzeras and Hartman, 1993) on
the Reuters-21578 TC data set. The results showed that SVM, KNN and LLSF
outperformed NNet and NB-network when the number of positive training
instances per category is less than ten. Further, all the methods performed
well when the categories are well distributed in the training data set.
The researchers of Lan et al. (2005) proposed a term weighting
method called tf*rf and compared their method using the traditional SVM,
with other term weighting methods, i.e., (tf.x2, tf.ig, tf.or), on two
widely used data sets from (20News, 1999). The experimental results showed
that methods based on information theory, i.e., (tf.x2, tf.ig, tf.or),
perform poorly if compared with their proposed term-weighted method in
terms of accuracy. In Lan et al. (2006) a comprehensive comparative
study conducted on different term weighting methods using SVM showed that
the term weighting method developed in Lan et al. (2005) achieved
better accuracy than other term weighting methods such as tf.ig and tf.or.
Finally, Hadi et al. (2007) conducted a comparative study on IDF
and WIDF term weighting with KNN, Experimental results against eight different
20 newsgroup data sets provide evidence that the Cosine Coefficient outperformed
Jaccard and Dice Coefficient approaches with regards to F1 measure results
and the Cosine-based IDF achieved the highest average scores.
In this study, we compare different variations of VSM (Dice, Jaccard,
Cosine) with KNN (Yang, 1999) algorithm using IDF and WIDF. The base of
our comparison between the different implementations of KNN is the F1
measure (Van Rijsbergan, 1979). In other words, we want to determine the
best VSM, which if merged with KNN produces good results with reference
to F1 measure results.
To the best of the authors knowledge, there are no comparisons which
have been conducted against English language data collections using different
variations of VSM with KNN.
TEXT CATEGORIZATION PROBLEM
TC, also known as text classification, is the task of automatically
sorting a set of documents into categories (or classes, or topics) from
a predefined set. Such task is related to IR and ML communities. Automated
text classification tools are attractive since they free organizations
from the need of manual categorization of document, which can be too expensive,
or simply not feasible given the constraints of the application or the
number of documents involved (Sebastiani, 2005).
TC involves many applications such as automated indexing of scientific
articles according to predefined thesauri of technical terms, filing patents
into patent directories, selective dissemination of information to information
consumers, automated population of hierarchical catalogues of web resources,
spam filtering, identification of document genre, authorship attribution,
survey coding and even automated essay grading.
After preprocessing, indexing and transforming documents into the vector
model representation, we will have a data set D = {d1,...,dn}
represented as a set of vectors for n documents, also we have a set of
category C = {C1,...,Cm}.
TC problem can be defined according to Sebastiani (1999) as follows: The documents
divided in two datasets, for training and testing. Let training data set = {d1,d2,...,dg},
where, g documents are used as examples for the classifier and must contain
sufficient number of positive examples for all the categories involved.
Table 1: |
Representation of text categorization problem |
 |
The testing data set {dg+1,dg+2,...,dn} used
to test the classifier effectiveness. The following matrix represents data splitting
into training and testing parts, A document dk is considered a positive
example to Cy if Cky =1 and a negative example if cky
= 0.
Term Weighting
Term weighting is one of the important issues in TC, which has been
widely investigated in IR (Salton and McGill, 1983; Salton, 1988). Term
weighting corresponds to a value given to a term in order to reflect the
importance of that term in a document.
Although there are different weighting approaches for text indexing,
they all share the following two observations:
• |
The more the number of times a term occurs in documents
that belong to some category, the more it is relative to that category. |
• |
The more the term appears in different documents representing different
categories, the less the term is useful for discriminating between
documents as belonging to different categories (Table
1). |
Term Frequency (TF)
One of the simplest term weighting methods that used to measure the
importance of each term in a given document is TF (Tokunaga and Iwayama,
1994). Using this method, each term is assumed to have a value proportional
to the number of times it occurs in a text. Generally, for a document
d and a term t, the weight of t in d is given as:
TF can help in improving an IR and TC evaluation measure named recall
(Van Rijsbergan, 1979) since frequent terms tend to appear in many documents,
such terms have little discriminative power. Recall is the fraction of
the relevant documents which has been retrieved and is represented in
Eq. 11 according to Table 4. To some
extend, we can say that TF follows the normal distribution curve with
regards to the importance of terms to the retrieval process, which means
too much frequency or less frequency does not improve the retrieval process.
Figure 1 shows that the term frequencies in the interval
[0, 5] and the interval [15, 20] are too low, so we remove them from the
term list. We also remove the stop words, which often has high frequencies.
We keep the interval [5, 8.5] and the interval [11.5, 15] since the term
frequencies are ideal for the retrieval process.
Inverse Document Frequency (IDF)
TF reflects the importance of the term in a single document, however,
what if we are interested in the frequency of a term in the set of documents.
This is called the Inverse Document Frequency
Table 2: |
The relation between n documents and the importance
of terms they contain |
 |
(IDF), meaning the importance of each term inversely proportional to
the number of documents that contain that term (Sparck, 1972). Table
2 shows that for a given corpus of documents, when a given term frequency
increases within a document, the importance of that term decreases according
to the IDF. In other words, when the term occurs in a small number of
documents, this signifies it (when n equal ten). Whereas, when the term
occurs frequently within a large number of documents, then it has insignificant
importance according to IDF.
For a given N documents, if n documents contain the term t, IDF is given
as follows:
Sometimes n is replaced by the document frequency (the number of documents
that contain t) i.e., df(t). This approach follows Slaton`s definition
(Salton, 1988), which combined TF and IDF to weight the terms and he showed
that his approach gives better performance with reference to accuracy
than IDF and TF. The product of TF and IDF is given in Eq.
3 as:
Weighted Inverse Document Frequency (WIDF)
One of the IDF drawbacks is that all documents containing a certain
term are treated equally due to the binary counting. In other words, if
a term sea occurred in 4 documents with different frequencies in each
of these documents, the IDF does not consider the number of times in which
sea has occurred in these 4 documents, rather it mainly considers the
fact that sea has occurred. WIDF of a term t in document d is given by:
where, TF(d, t) is the occurrence of t in d and i ranges over the documents
in the collection D. WIDF corresponds to the normalized term frequency
over the collection. The weight of a term with reference to WIDF is given
as:
Similarity Measurements
There are several well-known similarity techniques, such as: VSM and
Probabilistic Model (PM) (Tokunaga and Iwayama, 1994). In this study,
we focus on VSM by adapting Cosine as shown in Eq. 6,
Jaccard as shown in Eq. 7 and Dice as shown in Eq.
8.
where, Wik corresponds to the weight of the k-th element of
the term vector Vi, i.e., pre-categorized documents and Wjk
is the weight of K-th element of the term vector Vj i.e., incoming
text. The greater the value of Sim (Vi, Vj), the
more similar these two texts are.
KNN Algorithm
There are many approaches to assign category to incoming text such
as (Quinlan, 1993; Thabtah et al., 2004; Tzeras and Hartman, 1993).
In present study, we implemented Text-to-Text Comparison (TTC), which
is also known as the k-nearest neighbour (k-NN) (Yang, 1999). KNN is a
statistical classification approach, which has been intensively studied
in pattern recognition over four decades. KNN has been successfully applied
to TC problem, i.e., (Yang and Liu, 1999; Yang, 1999) and showed promising
results if compared with other statistical approaches such as Baysian
based Network (Tzeras and Hartman, 1993).
The KNN algorithm is quite simple: Given a test document to be classified,
the algorithm searches for the k nearest neighbors among the pre-classified
training documents based on some similarity measure and ranks those k-neighbors
based on their similarity scores, the categories of the k-nearest neighbors
are used to predict the category of the test document by using the ranked
scores of each as the weight of the candidate categories, if more than
one neighbor belong to the same category then the sum of their scores
is used as the weight of that category, the category with the highest
score is assigned to the test document provided that it exceeds a predefined
threshold, more than one category can be assigned to the test document.
One draw back in KNN is the difficulty to determine the value of k, a
series of experiments with different k values should be conducted to determine
the best value of k, another disadvantage of KNN is the complexity of
computation time needed to traverse all the training documents.
EXPERIMENT RESULTS
Experiments on the 20NewsGroups data sets (20NG) (20News, 1999)
using three TC techniques based on vector model similarity (Cosine, Jaccard,
Dice) have been conducted. We used F1 evaluation measure as the base of
our comparison, where F1 is computed based on the following equation:
Precision and recall are widely used evaluation measures in IR and ML, where
according to Table 3.
To explain precision and recall, let`s say someone has 5 blue and 7 red
tickets in a set and he submitted a query to retrieve the blue ones. If
he retrieves 6 tickets where 4 of them are blue and 2 that are red, it
means that he got 4 out of 5 blue (1 false negative) and 2 red (2 false
positives). Based on these results, precision = 4/6 (4 blue out of 6 retrieved
tickets) and recall = 4/5 (4 blue out of 5 in the initial set).
Three TC techniques based on vector model similarity (Cosine, Jaccard
and Dice) have been compared in term of F1 measure. These methods use
same strategy to classify incoming text i.e., KNN. We have several options
to construct a text classification method; we compared techniques using
different term weighting IDF and WIDF. KNN was implemented using VB.NET
on 2.8 Pentium IV machine with 256 RAM.
The dataset is organized into 20 different newsgroups, each corresponding
to a different topic. Some of the newsgroups are very closely related
to each other (e.g., comp.sys.ibm.pc.hardware/ comp.sys. mac.hardware),
while others are highly unrelated (e.g., misc.forsale/soc.religion.christian).
The dataset is sorted by date into training (60%) and test (40%) sets,
does not include cross-posts (duplicates) and does not include newsgroup-identifying
headers (Xref, Newsgroups, Path, Followup-To, Date) (20News, 1999).
Table 4 and 5 shows the F1 results
of the text categorizers generated against the 20 data sets, K parameter
in the KNN algorithm is varied from 3 to 11 by 2.
After analyzing Table 4 and 5, we
found out that we discovered that there is consistency between Dice based
on TF.IDF and Jaccard based on TF.IDF algorithm in which both of them
outperformed Cosine based TF.IDF, Cosine based WIDF, Dice based WIDF and
Jaccard based WIDF. Particularly, Dice based TF.IDF outperformed Dice
based WIDF, Jaccard based WIDF, Cosine based TF.IDF and Cosine based WIDF
on 10, 10, 19 and 12 data sets, respectively.
There are similarities between (1) Dice based TF.IDF and Jaccard based
TF.IDF and (2) Dice based WIDF and Jaccard based WIDF with respect to
the average results of F1 measure.
Finally, larger values of k reduce the effect of noise on the classification;
this result is entirely consistent with that in (Tokunaga and Iwayama,
1994).
Table 3: |
Documents possible sets based on a query in IR |
 |
Table 4: |
F1 results of the Cosine implementations with KNN |
 |
Table 5: |
F1 results of the Jaccard and Dice implementations with
KNN |
 |
CONCLUSIONS
In this study, we investigated different variations of VSM using
KNN algorithm, these variations are: Cosine coefficient, Jacaard coefficient
and Dice coefficient, using IDF and WIDF term weighting measures. The
base of our comparisons is the F1 evaluation measure. The average F1 results
obtained against 20 data sets indicated that Dice based TF.IDF and Jaccard
based TF.IDF outperformed Cosine based TF.IDF, Cosine based WIDF, Dice
based WIDF and Jaccard based WIDF. We plan in near future to experiment
other TC data collections especially Arabic data sets. Also we plan to
propose a new TC technique based on association rule mining.