HOME JOURNALS CONTACT

Asian Journal of Information Management

Year: 2008 | Volume: 2 | Issue: 1 | Page No.: 14-22
DOI: 10.3923/ajim.2008.14.22
A Comprehensive Comparative Study Using Vector Space Model with K-Nearest Neighbor on Text Categorization Data
Wa`el Musa Hadi, Fadi Thabtah, Salahideen Mousa, Samer Al Hawari, Ghassan Kanaan and Jafar Ababnih

Abstract: On 20 text categorization data sets, the research investigated different variations of VSM using KNN algorithm and different term weighting approaches compared in term of F1 measure. The experimental results provide evidence that Dice and Jaccard Coefficient outperformed the Cosine Coefficient approach with regards to F1 results and the Dice-based TF. IDF achieved the highest average scores.

Fulltext PDF Fulltext HTML

How to cite this article
Wa`el Musa Hadi, Fadi Thabtah, Salahideen Mousa, Samer Al Hawari, Ghassan Kanaan and Jafar Ababnih, 2008. A Comprehensive Comparative Study Using Vector Space Model with K-Nearest Neighbor on Text Categorization Data. Asian Journal of Information Management, 2: 14-22.

Keywords: vector space model, term weighting, Data mining and text categorization

INTRODUCTION

Text Categorization (TC) is one of the important problems in Information Retrieval (IR) and data mining communities. This is because of the significance of natural language text, the huge amount of text stored on the internet and the available information libraries and document corpus. Further, TC importance rises up since it concerns with natural language text processing and classification using different techniques, in which it makes the retrieval and other text manipulation processes easy to execute.

They are many TC techniques exists such as: decision trees (Quinlan, 1993), Support Vector Machine (SVM) (Joachims, 1998), rule induction (Moulinier et al., 1996), Neural Network (Wiener et al., 1995) and k-nearest neighbor (KNN) (Yang, 1999).

In this study, we focus on a text similarity strategy, known as VSM in order to compute the similarity between incoming text (new test cases) and the pre-categorized text in the training data set. We use KNN algorithm to classify incoming text to one of categories. Generally, TC based on text similarity goes through two steps: Similarity measurement and classification assignment.

Term weighting is one of the known concepts in TC, which can be defined as a factor given to a term in order to reflect the importance of that term. There are many term weighting approaches, including, Inverse Document Frequency (IDF) and Weighted Inverse Document Frequency (WIDF) (Tokunaga and Iwayama, 1994). IDF and WIDF focus on terms occurrences inside a text corpus. WIDF distinguishes between two terms that have different occurrences, whereas, IDF treats both terms equally.

Since TC stands at the cross junction to modern IR and ML, Several research papers have focused on it but each of which has concentrated on one or more issues related to such task. There are some research works (Deng et al., 2004; Debole and Sebastiani, 2003), which have focused on the different term weighting approaches related to TC such as Term Frequency (TF), WIDF, IDF, Chi-square (Deng et al., 2004) and ITF (Leopold and Kindermann, 2002). For example, the researchers of Tokunaga and Iwayama (1994) have achieved good improvement with reference to the retrieval accuracy using WIDF on Japanese language if compared to TF. IDF approach using KNN (Yang, 1999) and Bayesian Model (Tzeras and Hartman, 1993). Specifically, the KNN. WIDF implementation achieved 7.4% higher than that of the TF.IDF.

Yang and Liu (1999) have tested five categorization algorithms SVM (Joachims, 1998), KNN (Yang, 1999), NNet (Wiener et al., 1995), LLSF (Yang and Chute, 1994) and NB based on Network (Tzeras and Hartman, 1993) on the Reuters-21578 TC data set. The results showed that SVM, KNN and LLSF outperformed NNet and NB-network when the number of positive training instances per category is less than ten. Further, all the methods performed well when the categories are well distributed in the training data set.

The researchers of Lan et al. (2005) proposed a term weighting method called tf*rf and compared their method using the traditional SVM, with other term weighting methods, i.e., (tf.x2, tf.ig, tf.or), on two widely used data sets from (20News, 1999). The experimental results showed that methods based on information theory, i.e., (tf.x2, tf.ig, tf.or), perform poorly if compared with their proposed term-weighted method in terms of accuracy. In Lan et al. (2006) a comprehensive comparative study conducted on different term weighting methods using SVM showed that the term weighting method developed in Lan et al. (2005) achieved better accuracy than other term weighting methods such as tf.ig and tf.or.

Finally, Hadi et al. (2007) conducted a comparative study on IDF and WIDF term weighting with KNN, Experimental results against eight different 20 newsgroup data sets provide evidence that the Cosine Coefficient outperformed Jaccard and Dice Coefficient approaches with regards to F1 measure results and the Cosine-based IDF achieved the highest average scores.

In this study, we compare different variations of VSM (Dice, Jaccard, Cosine) with KNN (Yang, 1999) algorithm using IDF and WIDF. The base of our comparison between the different implementations of KNN is the F1 measure (Van Rijsbergan, 1979). In other words, we want to determine the best VSM, which if merged with KNN produces good results with reference to F1 measure results.

To the best of the authors knowledge, there are no comparisons which have been conducted against English language data collections using different variations of VSM with KNN.

TEXT CATEGORIZATION PROBLEM

TC, also known as text classification, is the task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set. Such task is related to IR and ML communities. Automated text classification tools are attractive since they free organizations from the need of manual categorization of document, which can be too expensive, or simply not feasible given the constraints of the application or the number of documents involved (Sebastiani, 2005).

TC involves many applications such as automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of web resources, spam filtering, identification of document genre, authorship attribution, survey coding and even automated essay grading.

After preprocessing, indexing and transforming documents into the vector model representation, we will have a data set D = {d1,...,dn} represented as a set of vectors for n documents, also we have a set of category C = {C1,...,Cm}.

TC problem can be defined according to Sebastiani (1999) as follows: The documents divided in two datasets, for training and testing. Let training data set = {d1,d2,...,dg}, where, g documents are used as examples for the classifier and must contain sufficient number of positive examples for all the categories involved.

Table 1: Representation of text categorization problem

The testing data set {dg+1,dg+2,...,dn} used to test the classifier effectiveness. The following matrix represents data splitting into training and testing parts, A document dk is considered a positive example to Cy if Cky =1 and a negative example if cky = 0.

Term Weighting
Term weighting is one of the important issues in TC, which has been widely investigated in IR (Salton and McGill, 1983; Salton, 1988). Term weighting corresponds to a value given to a term in order to reflect the importance of that term in a document.

Although there are different weighting approaches for text indexing, they all share the following two observations:

The more the number of times a term occurs in documents that belong to some category, the more it is relative to that category.
The more the term appears in different documents representing different categories, the less the term is useful for discriminating between documents as belonging to different categories (Table 1).

Term Frequency (TF)
One of the simplest term weighting methods that used to measure the importance of each term in a given document is TF (Tokunaga and Iwayama, 1994). Using this method, each term is assumed to have a value proportional to the number of times it occurs in a text. Generally, for a document d and a term t, the weight of t in d is given as:

W (d, t) = TF (d, t)
(1)

TF can help in improving an IR and TC evaluation measure named recall (Van Rijsbergan, 1979) since frequent terms tend to appear in many documents, such terms have little discriminative power. Recall is the fraction of the relevant documents which has been retrieved and is represented in Eq. 11 according to Table 4. To some extend, we can say that TF follows the normal distribution curve with regards to the importance of terms to the retrieval process, which means too much frequency or less frequency does not improve the retrieval process.

Figure 1 shows that the term frequencies in the interval [0, 5] and the interval [15, 20] are too low, so we remove them from the term list. We also remove the stop words, which often has high frequencies. We keep the interval [5, 8.5] and the interval [11.5, 15] since the term frequencies are ideal for the retrieval process.

Inverse Document Frequency (IDF)
TF reflects the importance of the term in a single document, however, what if we are interested in the frequency of a term in the set of documents. This is called the Inverse Document Frequency

Fig. 1: Term frequency

Table 2: The relation between n documents and the importance of terms they contain

(IDF), meaning the importance of each term inversely proportional to the number of documents that contain that term (Sparck, 1972). Table 2 shows that for a given corpus of documents, when a given term frequency increases within a document, the importance of that term decreases according to the IDF. In other words, when the term occurs in a small number of documents, this signifies it (when n equal ten). Whereas, when the term occurs frequently within a large number of documents, then it has insignificant importance according to IDF.

For a given N documents, if n documents contain the term t, IDF is given as follows:

IDF (t) = log (N/n)
(2)

Sometimes n is replaced by the document frequency (the number of documents that contain t) i.e., df(t). This approach follows Slaton`s definition (Salton, 1988), which combined TF and IDF to weight the terms and he showed that his approach gives better performance with reference to accuracy than IDF and TF. The product of TF and IDF is given in Eq. 3 as:

W (d, t) =TF (t).IDF (t)
(3)

Weighted Inverse Document Frequency (WIDF)
One of the IDF drawbacks is that all documents containing a certain term are treated equally due to the binary counting. In other words, if a term sea occurred in 4 documents with different frequencies in each of these documents, the IDF does not consider the number of times in which sea has occurred in these 4 documents, rather it mainly considers the fact that sea has occurred. WIDF of a term t in document d is given by:

(4)

where, TF(d, t) is the occurrence of t in d and i ranges over the documents in the collection D. WIDF corresponds to the normalized term frequency over the collection. The weight of a term with reference to WIDF is given as:

W (d, t) =WIDF (d, t)
(5)

Similarity Measurements
There are several well-known similarity techniques, such as: VSM and Probabilistic Model (PM) (Tokunaga and Iwayama, 1994). In this study, we focus on VSM by adapting Cosine as shown in Eq. 6, Jaccard as shown in Eq. 7 and Dice as shown in Eq. 8.

(6)

(7)

(8)

where, Wik corresponds to the weight of the k-th element of the term vector Vi, i.e., pre-categorized documents and Wjk is the weight of K-th element of the term vector Vj i.e., incoming text. The greater the value of Sim (Vi, Vj), the more similar these two texts are.

KNN Algorithm
There are many approaches to assign category to incoming text such as (Quinlan, 1993; Thabtah et al., 2004; Tzeras and Hartman, 1993). In present study, we implemented Text-to-Text Comparison (TTC), which is also known as the k-nearest neighbour (k-NN) (Yang, 1999). KNN is a statistical classification approach, which has been intensively studied in pattern recognition over four decades. KNN has been successfully applied to TC problem, i.e., (Yang and Liu, 1999; Yang, 1999) and showed promising results if compared with other statistical approaches such as Baysian based Network (Tzeras and Hartman, 1993).

The KNN algorithm is quite simple: Given a test document to be classified, the algorithm searches for the k nearest neighbors among the pre-classified training documents based on some similarity measure and ranks those k-neighbors based on their similarity scores, the categories of the k-nearest neighbors are used to predict the category of the test document by using the ranked scores of each as the weight of the candidate categories, if more than one neighbor belong to the same category then the sum of their scores is used as the weight of that category, the category with the highest score is assigned to the test document provided that it exceeds a predefined threshold, more than one category can be assigned to the test document. One draw back in KNN is the difficulty to determine the value of k, a series of experiments with different k values should be conducted to determine the best value of k, another disadvantage of KNN is the complexity of computation time needed to traverse all the training documents.

EXPERIMENT RESULTS

Experiments on the 20NewsGroups data sets (20NG) (20News, 1999) using three TC techniques based on vector model similarity (Cosine, Jaccard, Dice) have been conducted. We used F1 evaluation measure as the base of our comparison, where F1 is computed based on the following equation:

(9)

Precision and recall are widely used evaluation measures in IR and ML, where according to Table 3.

(10)

(11)

To explain precision and recall, let`s say someone has 5 blue and 7 red tickets in a set and he submitted a query to retrieve the blue ones. If he retrieves 6 tickets where 4 of them are blue and 2 that are red, it means that he got 4 out of 5 blue (1 false negative) and 2 red (2 false positives). Based on these results, precision = 4/6 (4 blue out of 6 retrieved tickets) and recall = 4/5 (4 blue out of 5 in the initial set).

Three TC techniques based on vector model similarity (Cosine, Jaccard and Dice) have been compared in term of F1 measure. These methods use same strategy to classify incoming text i.e., KNN. We have several options to construct a text classification method; we compared techniques using different term weighting IDF and WIDF. KNN was implemented using VB.NET on 2.8 Pentium IV machine with 256 RAM.

The dataset is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g., comp.sys.ibm.pc.hardware/ comp.sys. mac.hardware), while others are highly unrelated (e.g., misc.forsale/soc.religion.christian). The dataset is sorted by date into training (60%) and test (40%) sets, does not include cross-posts (duplicates) and does not include newsgroup-identifying headers (Xref, Newsgroups, Path, Followup-To, Date) (20News, 1999).

Table 4 and 5 shows the F1 results of the text categorizers generated against the 20 data sets, K parameter in the KNN algorithm is varied from 3 to 11 by 2.

After analyzing Table 4 and 5, we found out that we discovered that there is consistency between Dice based on TF.IDF and Jaccard based on TF.IDF algorithm in which both of them outperformed Cosine based TF.IDF, Cosine based WIDF, Dice based WIDF and Jaccard based WIDF. Particularly, Dice based TF.IDF outperformed Dice based WIDF, Jaccard based WIDF, Cosine based TF.IDF and Cosine based WIDF on 10, 10, 19 and 12 data sets, respectively.

There are similarities between (1) Dice based TF.IDF and Jaccard based TF.IDF and (2) Dice based WIDF and Jaccard based WIDF with respect to the average results of F1 measure.

Finally, larger values of k reduce the effect of noise on the classification; this result is entirely consistent with that in (Tokunaga and Iwayama, 1994).

Table 3: Documents possible sets based on a query in IR

Table 4: F1 results of the Cosine implementations with KNN

Table 5: F1 results of the Jaccard and Dice implementations with KNN

CONCLUSIONS

In this study, we investigated different variations of VSM using KNN algorithm, these variations are: Cosine coefficient, Jacaard coefficient and Dice coefficient, using IDF and WIDF term weighting measures. The base of our comparisons is the F1 evaluation measure. The average F1 results obtained against 20 data sets indicated that Dice based TF.IDF and Jaccard based TF.IDF outperformed Cosine based TF.IDF, Cosine based WIDF, Dice based WIDF and Jaccard based WIDF. We plan in near future to experiment other TC data collections especially Arabic data sets. Also we plan to propose a new TC technique based on association rule mining.

REFERENCES

  • Debole, F. and F. Sebastiani, 2003. Supervised term weighting for automated text categorization. Proceedings of the 2003 ACM Symposium on Applied Computing. March 9-12, 2003, ACM Press, USA., pp: 784-788.


  • Deng, Z.H., S.W. Tang, D.Q. Yang, M. Zhang, L.Y. Li and K.Q. Xie, 2004. A comparative study on feature weight in text categorization. Lecture Notes Comput., 3007: 588-597.
    CrossRef    


  • Hadi, W., F. Thabtah and H. Abdeljaber, 2007. A comparative study using vector space model with K?nearest neighbor on text categorization data. Proceedings of the 2007 International Conference of Data Mining and Knowledge Engineering, October 28-31, 2007, London, UK, pp: 296-300.


  • Joachims, T., 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, April 21-23, 1998, Springer, Berlin, Heidelberg, pp: 137-142.


  • Lan, M., S.Y. Sung, H.B. Low and C.L. Tan, 2005. A comparative study on term weighting schemes for text categorization. Proceedings of the International Joint Conference on Neural Networks, May 10-14, 2005, ACM New York, USA., pp: 1032-1033.


  • Lan, M., C.L. Tan and H.B. Low, 2006. Proposing a new term weighting scheme for text categorization. Proceedings of the 21st National Conference on Artificial Intelligence, pp: 763-768.


  • Leopold, E. and J. Kindermann, 2002. Text categorization with support vector machines. How to represent texts in input space. Mach. Learn., 46: 423-444.
    Direct Link    


  • Moulinier, I., G. Raskinis and J. Ganascia, 1996. Text categorization: A symbolic approach. Proceedings of the 5th Annual Symposium on Document Analysis and Information Retrieval, pp: 87-99.


  • Quinlan, R., 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.


  • Salton, G. and M.J. McGill, 1983. Introduction to Modern Information Retrieval. McGraw Hill Inc., New York, USA., ISBN-10: 0070544840, Pages: 448


  • Salton, G., 1988. Automatic Text Processing. In: The Transformation, Analysis Retrieval of Information by Computer, Salton, G. (Ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA., pp: 450
    Direct Link    


  • Sebastiani, F., 1999. A tutorial on automated text categorisation. Proceedings of the ASAI﷓99, 1st Argentinian Symposium on Artificial Intelligence, pp: 7-35.


  • Sebastiani, F., 2005. Text Categorization. In: Text Mining and its Applications, Zanasi, A. (Ed.). WIT Press, Southampton, UK., pp: 109-129


  • Jones, K.S., 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation, 28: 11-21.
    CrossRef    Direct Link    


  • Thabtah, F., P. Cowling and Y. Peng, 2004. MMAC: A new multi-class, multi-label associative classification approach. Proceedings of the 4th International Conference on Data Mining, November 1-4, 2004, Brighton, UK., pp: 217-224.


  • Tokunaga, T. and M. Iwayama, 1994. Text categorization based on weighted inverse document frequency. Technical Report 94 TR0001, Department of Computer Science, Tokyo Institute of Technology: Tokyo, Japan.


  • Tzeras, K. and S. Hartman, 1993. Automatic indexing based on bayesian inference networks. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, June 27-July 01, 1993, ACM New York, USA., pp: 22-35.


  • Van Rijsbergan, C., 1979. Information Retrieval. 2nd Edn., Buttersmiths, London, Pages: 224


  • Wiener, E., J.O. Pedersen and A.S. Weigend, 1995. A neural network approach to topic spotting. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval, April 24-26, 1995, Las Vegas, Nevada, pp: 317-332.


  • Yang, Y. and C.G. Chute, 1994. An example-based mapping method for text categorization and retrieval. ACM Trans. Inform. Syst., 12: 252-277.
    CrossRef    Direct Link    


  • Yang, Y., 1999. An evaluation of statistical approaches to text categorization. Inform. Retrieval, 1: 69-90.
    CrossRef    


  • Yang, Y. and X. Liu, 1999. A re-examination of text categorization method. Proceedings of the22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, 1999, New York, USA., pp: 42-49.

  • © Science Alert. All Rights Reserved