ABSTRACT
Present goal in this study was to develop a dictionary searching technique designed specifically for World Wide Web, which is called "Two-layer dictionary retrieval". Two-layer dictionary retrieval, through "signature extraction process" which extracts the "signature" from every vocabulary in the dictionary, generates the dictionary signature file and distributes to every user. The user want to looks up a vocabulary in the dictionary, through the same process that extracts the signature from which the vocabulary is desired, such that the user can search this vocabulary on the downloaded dictionary signature file. Our technique is not only able to minimize the searching area in the dictionary, but is also able to minimize the space required to establish an index so as to greatly improve the efficiency of dictionary retrieval over the Internet. According to the experimental results, our method is five times better than the other methods available in terms of efficiency.
PDF Abstract XML References Citation
How to cite this article
DOI: 10.3923/jas.2006.1066.1070
URL: https://scialert.net/abstract/?doi=jas.2006.1066.1070
INTRODUCTION
Nowadays more and more data files have been quickly created each day since 1995, the year in which the World Wide Web has begun to advance so rapidly. To quickly retrieve the desired information for users from such enormous data files is an important issue of dictionary retrieval.
There are currently a number of dictionary searching techniques that have been proposed (Angell et al., 1983; Burkhard and Keller, 1973; Doster, 1977; Hall and Dowling, 1980; Kohonen, 1987; Morgan, 1970; Owolabi, 1996; Owolabi and McGregor, 1988; Salton and Wong, 1978; Szanzer, 1969; Szanzer, 1973), which primarily engage themselves in dictionary organization. That is, they place their strategy upon organization for the index establishment. For example, the length of characters contained in a vocabulary is taken as a basis for classification (Morgan, 1970; Szanzer, 1969), or the first letter in a vocabulary is taken as a classification (Hall and Dowling, 1980; Szanzer, 1973). The above two classification methods, after undergoing the experiment (Owolabi, 1996) that O. Owolabi pointed out, result in an uneven distribution that impedes the searching efficiency. Thus O. Owolabi combined the two methods in which he first classifies the primary index by the length of characters in a vocabulary, then moves on to the secondary index and classifies by the first letter from the length of characters in the vocabulary. This way he can amend the uneven distribution, but this way brings up another issue leading to larger space required for the indexing. For example, assume in the primary index containing the length of characters from 2 to 50 and there are a total of 49 classifications, if he wants to classify each vocabulary by its first letter in the subordinate index and each length of characters contains 26 classifications (from A to Z), there will be a total of 1274 classifications needed to establish the secondary index and thus the resulting large file will make it difficult to be transferred over the Internet. Therefore our goal in this study is to make possible the even distribution in the dictionary and minimize the additional space required to establish the indexes. Furthermore, our technique is not only able to greatly improve the efficiency of dictionary retrieval over the Internet, but also to be applied on the spelling check that it can quickly retrieve stored lexicon to spell check the vocabulary does really correctness in the dictionary.
The concept of two-layer dictionary retrieval: Present study proposes a two-layer dictionary organization method also called two-layer dictionary retrieval, in which its architecture is shown in Fig. 1.
Table 1: | Statistical table of the selected medical dictionary containing the number of occurrences for variant characters |
Fig. 1: | The architecture of two-layer dictionary retrieval |
The architecture in Fig. 1. can be applied to any current types of dictionaries such as Medical Dictionary, Webster Dictionary, etc. During the pre-process for each dictionary, after the signature extraction process that extracts a dictionary signature file in the first step, the dictionary signature file can now be downloaded depending on users specific needs. When the user wants to look up the vocabulary, he will directly make his initial retrieval on the specific dictionary signature file at layer 1. Since the dictionary signature file after undergoing the signature extraction process is considered small-sized, this will greatly reduce the cost for sending the information back to the user over the Internet.
There are only two circumstances of retrievals. First circumstance is that the signature for which the vocabulary is to be retrieved after the same signature extraction process does not exist in the dictionary signature file. At this time, the user will receive a definite reply that the vocabulary to be retrieved does not exist in the original dictionary. Second circumstance is that the signature for which the vocabulary is to be retrieved after the signature extraction process does exist in the dictionary signature file, implying that the vocabulary to be retrieved may exist in the original dictionary. It will then remotely log onto the dictionary database at the back end for further retrieval at layer 2. Because the dictionary is pre-classified by the signature, that the searching can be achieved on a few specific areas in the dictionary without going through the entire dictionary. It is this way that makes our dictionary retrieval better prepared and applied to current networks.
The signature extraction process
The encoding table of dictionary: From the architecture of two-layer dictionary retrieval mentioned in the last Section, each vocabulary in the dictionary must be replaced by the group code in the encoding table of dictionary.
Table 2: | The encoding table of dictionary |
Fig. 2: | An example to illustration the process of replacing vocabularies with the group codes according to Table 2 |
To ensure that every vocabulary can be evenly distributed onto each group code, the number of occurrences for each character must be all recorded so every vocabulary can be evenly distributed onto each group code in the encoding table of dictionary. Table 1 is a statistical table for a certain medical dictionary containing the variant characters and their number of occurrences that is placed in a decreasing order. For the reason mentioned above, the upper and lower frequency characters are deemed as same code.
In the encoding table of dictionary of this paper, there are four group codes pre-determined which are 00, 01, 10 and 11, respectively. Next, according the sequence shown in Table 1, Characters are distributed onto each group code. The results are shown as Table 2 in the following. From Table 2, we know the difference is not obvious in total number of occurrences; in other words, the distribution of the characters onto each group code is quite even. How to replace the vocabulary in the dictionary with the group code is shown in Example 1. All of the aforementioned process introduced in this section as shown in Fig. 2.
Example 1: If east is a vocabulary in the dictionary. By replacing east with the group codes according to Table 2, we get e-00, a-01, s-10 and t-01. The group coded east is 00011001.
Fig. 3: | The progress of signature extraction algorithm |
Dictionary signature file: After all vocabularies in the dictionary are replaced with the group codes in the encoding table of dictionary and in order to reduce the space required for the dictionary signature files, we must move on to extract and partition these group coded vocabularies in terms of their signatures. Present design starts from partitioning the vocabularies and the number of partitions is C. Insufficient number of partitions lead to lower distinguishability; extravagant number of partition take up too much space and thus lose the meaning of signature extraction process.
Fig. 4: | Establishment of indexing for C = 8 |
Since the length of group codes of every vocabulary is different, the length of each partitioned sub-string from every vocabulary is not the same. A signature extraction algorithm is designed as below to extract the signature from each partitioned sub-string. The signature extraction version of dictionary is then composed by all signatures which are extracted from all group coded vocabularies. Our paper designs the signature extraction process as the following algorithm shown in Fig. 3. Example 2. is a simple illustration that explains the progress of the signature extraction algorithm.
Example 2: Let TX = 001001100110110101, set the number of partitions C = 4.
Step 1: | |TX| = 18. |
Step 2: | d = (4 (|TX| mod 4)) mod 4 = 2, TX = 001001100110110101 + 00 = 00100110011011010100. |
Step 3: | TX is evenly partitioned into four sub-strings, which are respectively: TX1 = 00100, TX2 = 11001, TX3 = 10110 and TX4 = 10100. |
Step 4: | Determine the number of occurrences of 0 and 1 separately for the four sub-strings in TX and take the one with more occurrences as signature code for each sub-string. If the number of occurrences of 0 and 1 are the same, 1 is preferred. The result is K = 0 1 1 0 . |
Step 5: | Signature of TX is 0 1 1 0. |
Figure 4 shows how a dictionary signature file is generated from the incorporation of signature of every vocabulary extracted from the signature extraction process and organizes these signatures containing the vocabulary to establish the index. When the signature for which the vocabulary the user wants to look up exists in the dictionary signature file, the user can retrieve from the established index according to the retrieved signatures; in this case, the overall dictionary searching efficiency is improved.
RESULTS AND DISCUSSION
The experimental data is collected from a medical dictionary that contains 142,709 vocabularies, 2,156,957 characters and the average length of characters of every vocabulary is 16. Determination of number of partitions, C, can be made by the size of dictionary or the average length of vocabulary considering the size of which the dictionary signature file is to be sent to the user over the Internet is appropriate. Here C ranges from 8 to 16 and the observation of various partitions which result in different outcomes is shown as Table 3 in the following. We randomly pick 3000 vocabularies for our experiment where 2000 vocabularies do exist and the other 1000 vocabularies do not exist in the dictionary, then we take the average value for these two circumstances of retrievals.
From Table 3, when C gets bigger, not only will the practical range to be searched be much less with bigger number of signatures indices but will the time be much less also and the search time is not even 1 millisecond. Yet this brings up another disadvantage-the memory space occupied is increased. How to determine the size of C value for the best coordination between the number of signature indices and memory space occupied is the critical part to which we should pay special attention in our method. From these experimental results, we obtain the best value for C is 14 (shadow part) considering that this value results in little memory space occupied under the minimum number of words searched and the number of words searched only take up 0.09% of overall vocabulary in the dictionary as well as that the search time is below 1 millisecond. Another reason for picking C = 14 is due to that 60% of the average length of vocabulary in the dictionary is between the length of characters being 7 and 16. When the number of partitions is set as 14, the partition point for the vocabulary exactly locates in the space between the characters to extract exactly the signature in each character so as to enhance the representation to which the index it belongs and the overall search efficiency is increased hence.
In addition, Table 4 is a comparison of our method with previously mentioned methods.
Table 3: | The impact of size of C value chosen |
Table 4: | Summary comparison of four methods |
These methods include Partition by String Length, Partition by First Letter, Two-Level Indexing and Two-Layer Dictionary Retrieval (C = 14). Partition by String Length and Partition by First Letter are commonly seen dictionary classification methods. Two-Level Indexing by Owolabi (1996) shows the best efficiency. Thats why we choose these three methods as subjects of comparison with our two-layer dictionary retrieval. In the comparison, we compare the average number of words searched, average search time and even the false drop probability for the four methods under the circumstance that the vocabulary to be retrieved does not exist in the dictionary to see whether its effective in filtering the dictionary and reducing the search range. The formula for false drop probability is shown as follows:
From Table 4 it may concluded that present method is far better than other three dictionary search methods considering the average range to be searched in the dictionary is less than 0.1% and average search time is not even 1 m sec. As for false drop probability, our method shows a lower value than others. Even when a false drop circumstance occurs, the range to be searched is greatly decreased because of better filtering capability so as to reduce the search time over the Internet.
CONCLUSIONS
In this study, we proposed a dictionary retrieval system appropriate for current distributive networks-Two-Layer Dictionary Retrieval. Through the signature extraction concept, the dictionary retrieval over the Internet will not need as much space for the indexing so the file to be transferred over the Internet is small, reduce the false drop probability and even reduce the range to be searched in the dictionary by means of dictionary filtering capability for much better overall efficiency of dictionary retrieval in terms of Internet application.
REFERENCES
- Angell, R.C., G.E. Freund and P. Willett, 1983. Automatic spelling correction using a trigram similarity measure. Inform. Process. Manage., 19: 255-261.
Direct Link - Burkhard, W.A. and R.M. Keller, 1973. Some approaches to best-match file searching. Commun. ACM, 16: 230-236.
Direct Link - Doster, W., 1977. Contextual postprocessing system for cooperation with a multiple-choice character-recognition system. IEEE Trans. Comput., 26: 1090-1101.
Direct Link - Hall, P.A.V. and G.R. Dowling, 1980. Approximate string matching. ACM Comput. Surv., 12: 381-402.
CrossRef - Owolabi, O., 1996. Dictionary organizations for efficient similarity retrieval. J. Syst. Software, 34: 127-132.
Direct Link - Salton, G. and A. Wong, 1978. Generation and search of clustered files. ACM Trans. Databases Syst., 3: 321-346.
Direct Link