HOME JOURNALS CONTACT

Information Technology Journal

Year: 2006 | Volume: 5 | Issue: 1 | Page No.: 7-12
DOI: 10.3923/itj.2006.7.12
Consolidation of Diversifying Terms Weighting Impact on IR System Performances
A. Kasam and Kwon Hyuk-Chul

Abstract: Search engines and internet crawlers present automatically and accurately user’s relevant data from a high dimensional words basket or databag which we named high dimensional vector space model, documents are stored into that large databag as a number of indexed vectors in the terms of spaces. When a document is searched, a query is given through the search engine. Mainly two major computing operations are done here, one for query vector and another document vector. The component of vectors is determined by the term weights, a function of the frequencies of the terms in the document or query as well as throughout the collection. So searching documents ultimately goes to the meaning of terms weighting into the high dimensional data space which is a difficult task in the data industries. Same times using a single method of terms weighting also suffers from certain limitations in application issues. So that highly diversified and consolidation of terms weighting approach can be applied as an interesting tool for improving retrieval performances. In this study, consolidation of diversifying terms weighting approach has been proposed as an argument of cost effective method for improving the retrieval performances. Under the proposed approach, a certain amount of Meta data has been tested and finally obtaining throughout results strongly recommend us that our approach is effective and has positive values, further applicable to promote retrieval performances.

Fulltext PDF Fulltext HTML

How to cite this article
A. Kasam and Kwon Hyuk-Chul , 2006. Consolidation of Diversifying Terms Weighting Impact on IR System Performances. Information Technology Journal, 5: 7-12.

Keywords: terms-weighing and Information retrieval

INTRODUCTION

Search engines and internet crawlers present automatically and accurately users relevant data from a high dimensional word basket or databag which we usually named high dimensional vector space model, the preprocessed documents are stored into that multi dimensional databag as a number of indexed vectors in the terms of spaces. The definition of a term is not inherent in the model, but terms are typically words and phrases[1]. If words are chosen as terms, then every word in the basket becomes an independent dimension in the high dimensional vector space. A relevance feedback policy is used to construct a personalized query or user profile[2], it means when a document is searched, a query is given through the search engine. Mainly two major computing operations are separately done here, one for query vector and another document vector (Table 1 and 2). The component of vectors is determined by the term weights, a function of the frequencies of the terms in the document or query as well as throughout the collection. So searching documents ultimately goes to the meaning of terms weighting into the high dimensional data space which is a difficult problem in the data industries[3,4]. People use different types of IR system in the forms of Internet search engine to retrieve the required informations[5,6]. As search engines or Internet crawlers are ultimately based on IR models, such as Boolean Model (BM), Probabilistic Model (PM) and Vector Space Model (VSM).

The state art of the IR assists the users to store, manipulate and retrieve a volume of useful data in the forms of documents[7]. Similarities between two documents are traditionally measured by the cosine of the angle between two vectors. It is based on the inner product operation and document length normalization[8]. It is useful to give a geometric interpretation to the vector space notion of similarity by considering the dot product equation.

In a vector space model, where, d is a document vector, q is a query vector and θ is the angle between them, if d and q are normalized so that the magnitudes are one, preceding equation then reduces to θ = d.q, so the similarity score is a measure of cosine of the angle between the vectors.

Table 1: Documents from the MED test collection

Table 2: Query from the MED test collection

If we rank the documents according to their similarity score from highest to lowest, the highest scoring document has the smallest angle between itself and the query.

Example 1: Here, the common terms between doc1 and the query are level (1050), low (1725), high (1820), glucose (2461) and lactoz (3560) and the similarity score of the Doc 1 is:

(0.18*0.30) + (0.09*0.30) + (0.09*0.30)
+ (0.09*0.30) + (0.07*0.36) = 0.1602

The common terms between Doc 2 and the query are high (51), fatty (1790), glucose (2168), levels (2450) and toxin (2591) and the similarity score of Doc 2 is:

(0.07*0.30) + (0.07*0.30) + (0.07*0.25)
+ (0.07*0.30) + (0.29*0.30) + (0.29*0.30) = 0.2545

Into the Doc 2 has a higher similarity score than the doc1, so the Doc 2 would be retrieved before the Doc 1.

METHODS OF TERMS WEIGHTING

Proper terms weighting greatly impacts on the IR system performances[9]. Here, we have explained briefly fundamental and modification of term weighting methods and retrieval contribution among the data set. A list of popular terms weighting methods is given in the Table 3.

Table 3: List of popular terms weighting

Generally, three different types of term weighting methods[10] local, global and length normalization are used for practical purposes. The term weighting is given by;

LijGiNj;

Where, Lij is the local weight for term i in document j, Gi is the global weight for term i and Nj is the length normalization factor for document j. Local weights are functions of how many times each term appears in a document, global weights are functions of how many times each term appears in the entire collection and the normalization factor compensates for discrepancies in the length of the documents. The local weight is computed according to the terms in the given document or the query. The global weights, however, is based on the document collection regardless of whether we are weighting documents or queries. The normalization is done after the local and global weighting of the document vectors but for the query vectors not necessary because it does not affect the relative order of the ranked document lists. The concept of the local term weighting schemes perform well, if they work on the basic principal that the terms with higher within-document frequency are more relevant to that document. Generally binary formats[11] are used for local weight and document frequency (FREQ), given respectively by;

Lij = fij; Where, fij is the frequency of term i in document j. These weights are typically used for query weighting, where terms appear only one or twice. The principal problem of the local weight is that BINARY does not differentiate between the terms that appear frequently or only once in the document and FREQ also gives too much weight to terms that appear frequently. The logarithm plays a middle ground to adjust the document frequency because a term may appear ten times in a document is not necessary ten times as important as a term that appears once in that document. In the bellow, two modify local weighting formulas are similar because each of them used logarithm.

Table 4: Initial local terms weighted formulas

Table 5: Global terms weighting formulas

They are (Table 4) (LOGA) and normalized log (LOGN), given respectively by;

Where, aj is the average frequency of the terms that appear in document j2. Because LOGN is normalized by the (1 + log aj) term; the weight given by LOGN will be always lower than the weight given by LOGA for the same term and document and both are suitable for local document and query weights. Another modify local weight that is a middle ground between binary and term frequency is argument normalized term frequency (ATF1)[12];

Where, xj is the maximum frequency of any term in document j, ATF1 awards weight to a term for appearing in the document and then awards additional weight for appearing frequently. The formula Lij varies only between 0.5 and 1 for terms that appear in the document. The global weights try to give a discrimination value to each term. Many schemes are based on the idea that the less frequently a term appears in the whole collection. A commonly used global weight is the inverse document frequency measure, or IDF, derived by Spark Jones. We have used two variations, IDFB[12] and IDFP[13] given respectively by;

And

Where, N is the number of documents in the collection and ni is the number of documents in which term i appears. IDFB is the logarithm of the inverse of the probability that term i appears in a random document. IDFP is the logarithm of the inverse of the odds that term i appears in random document. IDFB and IDFP are similar that they both award high weight for terms appearing in few documents and law weight for terms appearing many documents in the collection; however, they differ because IDFP actually awards negative weight for terms appearing in more than half of the documents in the collection and the lowest weight of IDFB is one. In addition, we have used the Entropy weight (ENPY)[14,15] given by;

Where, Fi is the frequency of term i throughout the enter collections. Entropy is a useful weight because it gives higher weight for terms that appear several times in a small number of documents.

We also use a global weight (Table 5) frequency IDF (IGFF) given by:

This weight often works better when combine with different global weights on the query vector (Table 8-11).

The third component of the weighting scheme is the normalization factor. It is useful to normalize the document vectors, so the documents are retrieved independently of their lengths.

Table 6: Diversify local terms weighting formulas

Table 7: Diversify global terms weighted formulas

Cosine normalization (COSN) is a popular form of normalization, given by;

With COSN, the longer documents are given as smaller individual term weights; so that smaller documents are favored over the longer ones in retrieval. Pivoted Unique Normalization (PUQN)[16] is relatively a new normalizing method that is used to correct the problems of favoring short documents given by;

The basic principle behind the PUQN is to correct the discrepancies based on document length between the probability that a document is relevant and the probability that the documents will be retrieved.

CONSOLIDATION OF THE DIVERSIFYING TERMS WEIGHTING FORMULAS

Here, diversifying and consolidation of various terms weighting methods and their impact on the datasets have been explained (Table 6). Tow local diversifying weighting formulas are changed coefficient ATF1 (ATFC) and augmented average term frequency (ATFA), given, respectively by:

ATFC is developed using a general version of ATF1

Changed-coefficient ATF1 works well because it assigns weight to a term merely for appearing in a document, then adds more weights, if the term appears frequently in the document. ATFA is similar to ATF1 but it is normalized differently. ATF1 is normalized by the maximum within document frequency of a particular document and ATFA is normalized by the average within-document frequency of a document also the coefficients are different. ATFA gives more weights to a term for just appearing and adds less weight if a terms appears frequently. It is noted that the maximum value of the ATFC is one, whereas one is the average value for ATFA. Another modify new local weight is augmented log (LOGG), a variation of ATFC, given by;

We simply modified (fij/xj) to log(fij+1) because log seems to be a better local weight than within-document frequency. Note that now Lij can be greater than one. Another modify new local weight is square root (SQRT); given by;

In the development of SQRT, we model the formula resembled (Table 4) that of LOGA, a top performer (Table 8-11) among established local weight formulas. We looked at the graph of LOGA and noted that the function would have a similar shape. As fij gets large and SQRT has a larger value than LOGA. We have three new global weights (Table 5), the first is log-global frequency IDF (IGFL), given by:


Table 8: MED data testing under VTWS

Table 9: CISI data testing under VTWS

IGFL is simply a combination of the IDF and IGFI weights. We also observed that the IGFI weight was working well (Table 8-11).

The second new global weight is square root global frequency IDF (IGFS), given by:


Table 10: CARN data testing under VTWS

Table 11: CACAM data testing under VTWS

Like IGFL, IGFS is a combination of formulas. We found that subtracting larger numbers from Fi/ni improved performance but didn’t subtract one because that could cause a global weight of zero for some terms.

The third new global weight is incremented global frequency IDF (IGFI) given by:

For improving the global weights, we found to see that adding one to a formula significantly improves its performances. So we thought it might carry over to the global weights. Since IGFS already performed good, we tried adding one to it and the result was IGFI.

RESULTS AND DISCUSSION

To evaluate the reliability of the consolidation of diversifying terms weighting approach, we have implemented the vector space model in C and tested the proposed method on several datasets which included the correct answer. For a given terms weighting method, we have computed the similarities between the documents and each query in the testing data collection and returned a list of ranked documents according to the order of their similarity scores.

Four sets of well known data (MED, CISI, CARN, CACAM) have been used to perform the whole experiments. We have computed two scores here; Interpolated Average Precision (IAP) and top twenty-three (from the highest to the lowest values). The diversifying new terms weighing formulas are denoted by an asterisk [*]. From the testing results (Table 8-11), we found that the consolidation of these diversifying new terms weighting methods offer the improvement over the fundamental, local and global methods.

The new weights work well combining with both the other new weights and with the fundamental weights. The combination of local and global weights makes different performances too. Because a particular local weight when combined with one particular global weight performs well but with different global weights perform poorly.

CONCLUSIONS

IR performance is always comparative and model selection is superlative. In the real world, designing an IR system means reformulating terms weighting and similarity measuring. Since last few years, retrieval performances were largely dependent on the upgrading conventional term-weighting schemes[8,9]. This concept does not deserve a good hypothetical recommendation for solving problems today. In the real world, many problems must be solved in the intelligent ways, from the testing results of various terms weighting schemes (Table 8-11), we found that the consolidation of the diversifying new terms weighting methods increased the retrieval performances dramatically. We believe that our proposed methods could be used as a complement for others that work better on high precision tasks. It is still important because if a user can understand how to intensify the selective model for data retrieval purposes, a great deal of time can be saved.

REFERENCES

  • Singhal, A., 2001. Modern Information Retrieval: A Brief Overview. Google Inc., California


  • Djoerd, H. and S. Robertson, 2001. Relevance Feedback for Best Match Term Weighting Algorithms in Information Retrieval. Microsoft Research Group, Cambridge, UK


  • Jung, Y., H. Park and D. Dul, 2000. An effective term weighting scheme for IR. CST Report, University of Minnesota.


  • Eamonn, K.C., S. Mehrotra and M. Pazzani, 2002. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst., 27: 188-228.
    Direct Link    


  • Moffat, A. and J. Zobel, 2002. Information retrieval systems from large document collections. NIST Special Publication No. 500225, pp: 85-93. http://cat.inist.fr/?aModele=afficheN&cpsidt=2484605.


  • Shankar, S. and G. Karypic, 2000. Weight adjustment schemes for centroid based classifier. CST Report: TR00-035, Univresity of Minnesota, Minnesota.


  • Kowalski, G., 1997. Information Retrieval Systems Theory and Implementation. Ist Edn., Kluer Academic Publishers, Norwell, MA., USA., pp: 296


  • Lee, D. and H. Chung, 1997. Document Ranking and the Vector Space Model. HKUST, Hong Kong, China


  • Sanderson, M. and C.J. van Rijsbergen, 1999. The Impact of IR Effectiveness of Skewed Frequency Distributions. ACM, USA., pp: 440-465


  • Chisholm, E. and T.G. Kolda, 1999. New term weighting formulas for the vector space method in information retrieval. Technical Report of Oak Ridge National Laboratory, Oak Ridge, TN 37831-6367.


  • Greogry, B., 1992. Information Space Gets Normal. Frakes and B. Vates, USA., pp: 372-375


  • Salton, G. and C. Buckley, 1988. Term-weighting approaches in automatic text retrieval. Inform. Process. Manage., 24: 513-523.
    CrossRef    Direct Link    


  • Croft, W.B. and D. Harper, 1979. Using probabilistic models of document retrieval without relevance information. J. Documentat., 35: 285-295.
    Direct Link    


  • Warren, R., G. Jay and M. Ponte, 2000. The maximum entropy approach and probabilistic IR models. ACM Trans. Inform. Syst., 18: 246-287.
    Direct Link    


  • Pavlov, D., H. Mannila and P. Smyth, 2000. Maximum entropy techniques for analyzing large transaction datasets. Project Report. University of California, Irvine, CA 92697-3425.


  • Sinhale, A., C. Buckley, M. Mitra and G. Salton, 1995. Pivoted document length normalization. Technical Report: TR 95-1560, Cornell Univrersity, Itahaca, NY, USA.

  • © Science Alert. All Rights Reserved