Abstract: Background: A huge reliance on computer usage in everyday life leads to the continuous increase of large data applications in the form of textual data. The data are reposited to produce meaningful information. Therefore, databases become a backbone in most application software for organizing data into structured form. The structured information provides users with comprehensible knowledge. However, dealing with a large amount of textual data leads to two basic issues; insufficient query processing performance and inaccurate information retrieval. Attempts have been made to resolve both issues by database clustering techniques and textual document clustering. Nevertheless, most of the attempts require several stages of tedious programming scripts in constructing software applications that are external to databases. Materials and Methods: Therefore, this study proposes a Textual Virtual Schema Model (TVSM) to structure extracted textual data, while performing automatic column based information clustering in the internal structure of a relational database. Furthermore, a similarity measurement method is introduced to obtain high accuracy data clusters. An experiment has been conducted on textual Reuterss corpus, WAP and classic dataset. Then, the clustering results are validated by measuring F-measure, entropy and purity. Results: The results show linkages between structured textual data and unstructured information, high performance of query processing and time improvement in document clustering with accurate clusters. Conclusion: This model envisages a beneficial and useful approach for various domains that involves a large amount of textual data such as document clustering, topic detecting and tracking, document summarization, personal data management and information retrieval.
INTRODUCTION
A huge reliance on computer applications in everyday life leads to the continuous increase of large textual data usage. Commonly, textual data will increase in size at a tremendous rate1. The data is a source of knowledge transfer that can be found in online media such as news articles, discussion forums and personal pages. Such data normally contains unstructured information which is vital and necessary for many applications2. However, it is hard to discover, detect and extract the useful information.
Information extraction techniques play a significant role in performing the extraction of structured information (data). Many approaches have been introduced which can be categorized into a knowledge engineering/rule-based system and training-based3-5. In most cases, the extracted information are reposited in an organized form in relational databases6,7.
There are many issues in managing unstructured textual data in relational database8,4. The most critical issue is converting such huge unstructured data8 into structured1. There are several trends in managing unstructured data in a database that include developing new data model9-11 within the database and query based techniques in database applications. Query based techniques can be in the form of "Extract-Then-Query"12,13Query-Then-Extract" or "Keyword Search14,15. Recently, another trend has emerged for a new generation of databases called NoSQL (Not only SQL known as non-relational databases). These databases have been concerned with managing unstructured data in distributed environments. The examples of such databases include Cassandra and DBmond.
Relational database (RDB) stores structured information and provides significant knowledge to users. It is the best repository for data retrieval, deletion and modification8. Textual data (structured or unstructured) are stored in a systematic manner based on meta-data. However, the increase in textual data raises two problems; insufficiency query processing performance and inaccurate query results. First, insufficiency query performance has gained more attention in reducing access of a secondary storage (hard disk) for data retrieval. Second, inaccurate results in query due to a massive unstructured information within the textual data. Query performance can be enhanced by database partitioning techniques that concern on meta-data (attributes and records)16.
Currently, there are two methods in order to retrieve the required structured information of actual data in the database. The first method is via an extra application (external to the database) that performs a database query to retrieve the stored unstructured information. After gathering all the required information, document clustering techniques17 are carried out to cluster and convert such information into a structured form for later retrieval. The second method is by having a clustering application (internal to the database) that gathers all the information and performs document clustering in a batch mode for database query. Both methods are time consuming and perform same clustering process18,19 repeatedly6.
The common clustering techniques for document clustering are term frequent20-28, concept-based22,29-31, named entity-based32-34 and classical document clustering35. Clustering is desirable in document organization, especially for dynamic information environments such as the world wide web or stream of newspaper articles36,37.
In this study, we introduce a novel technique, namely a Textual Virtual Schema Model (TVSM) for handling textual data in a relational databases extended38,39. Any textual data will be automatically extracted from online information sources and at the same time, automatic document clustering is performed on the actual data within the database structure. Therefore, this study contributes in three folds. Firstly, extracted unstructured data is transformed into structured data. Secondly, the actual data in the database are clustered for faster text mining. Thirdly, a novel similarity measure is introduced in the document clustering for query accuracy. As a case study, an experiment is conducted on Reuters data set with more than one thousand documents. This study is one of its kind which able to automatically extract and dynamically cluster any information before storing such information in the database. It is part of the internal structure of a relational database management system. Thus, TVSM produces faster query performance with better data accuracy as compared to the previous approaches.
MATERIALS AND METHODS
This study proposes a novel approach for managing relationship between actual data of unstructured information in RDBMS. It is introduced as a layer to be added to the database schema design in order to organize textual data called a Textual Virtual Scheme Model (TVSM). It can perform automatic semantic textual data linking and clustering on a storage medium for relational databases. In addition, TVSM discovers hidden semantic relation between textual documents. It can be developed as a package to be used in any database scheme. Furthermore, TVSM provides quick extraction, data arrangement and data clustering based on pattern similarities.
Fig. 1: | TVSM and traditional methods |
The clustering process on the textual documents is executed by using a proposed similarity measurement. Additionally, it achieves high quality data clusters and improves the efficiency of query processing through back end query clustering. Moreover, TVSM converts unstructured information into a structured data form. Figure 1 illustrated the difference between a normal logical model and TVSM.
The TVSM model consists of three levels, user, application and database. At the user level, there is no difference between the two because the raw data from a user is always necessary. However, at the application level, the clustering application is not needed in the TVSM model due to its automatic clustering when any record is created. In the database level, the normal model only store and provide data for extraction while the TVSM performs all clustering activities based on name entities and frequent terms for future data extraction.There are some commercial databases perform internal data clustering such as Oracle and Sybase, which can referred to as traditional textual data clustering. Figure 2a and b illustrated the difference in clustering between traditional textual data and TVSM methods in relational databases.
The traditional textual clustering works as for batch mode, where it performs clustering iterations repeatedly on the same dataset. In addition, it needs a lot of coding and tedious steps to perform the clustering process. Furthermore, traditional methods suffer from the high dimensionality of data due to referring to all terms exist in the textual document. Additionally, the number of clusters should be set as a predefine parameter prior to the clustering process. In contrast, TVSM performs textual data clustering. In addition, it does not need any predefine parameters from users. Furthermore, TVSM uses Named entity and most term frequent with minimum support of words40. User can execute SQL query to display content of the cluster with the percentage of similarity between textual documents or with cluster description. Cluster description is a set of named entities and term frequents that represent all textual documents. Often, the last step of clustering is cluster representation; the cluster representation can be partitional or hierarchical shape by using some simple SQL commands.
Fig. 2(a-b): | Clustering methods with/without TVSM, (a) Clustering in traditional methods and (b) Clustering with TVSM |
System architecture: The TVSM consists of two main components which are Data Acquisition (DA) as the first phase and its second phase is Textual Data Management (TDM) as shown in Fig. 3. In DA, ordinary users enter any format of textual data (structured or unstructured) to database table. The storing process involves several steps in TDM. In TDM, there are two main steps, term mining and data clustering. Term mining performs document pre-processing, named entity extraction, frequent term search and semantic linkage. Next, data clustering clusters the textual input data by executing two processes, similarity measure and cluster description.
Document pre-processing is a process of cleaning and rearranging the textual data. In document cleaning filtration, stop words removal, stemming and document representation is carried out. Filtration removes any format or noise from textual document such as HTML tags or XML. Stop words removal removes the list of stopwords such as a, an, the according to standard list41. Stemming converts words to source by using porter algorithm42. Document representation uses vector space model, which used for represent document in form of words and its frequencies with added columns for named entity and frequent term. The output of this process will be used in named entity extraction process.
Named entity extraction is a process to extract the named entity such as person, organization and place, based on NER-Stanford43,44. The extracted named entity is stored along with its frequencies. Next, frequent term search process mines and selects the frequent terms from textual data according to its frequencies known as minimum support. All selected frequent terms are used in next stage to get the semantic.
Semantic phase deals with the synonym for selected frequent terms from word net database45 that enables TVSM to discover the hidden semantic relation between textual data (documents). Thus, the synonym of selected frequent terms and named entity are used as input for matching process to find an existing textual data cluster. If the cluster found, it is considered as the selected cluster for that textual document. Otherwise, a new textual data cluster is created. The matching process is based on the similarity measure which will be discussed in this study.
Fig. 3: | TVSM system architecture |
The last process is adding the selected frequent terms and named entities to the cluster description. Cluster description is the collection of selected frequent terms and named entities that describe the textual data cluster. It is used in the process of similarity measure for incoming text data.
Mathematical framework: Clustering textual documents requires similarity measure for producing a highly accurate results46,47. Traditionally, the similarity measure is usually performed by statistical methods that based on the VSM, such as cosine similarity48, Euclidean distance and Jaccard coefcient45. In addition, the similarity can also be measured by frequent item/term set in document25,27. However, term frequent suffers from a high dimensionality of data that is later reduced by maximal frequent terms25,49. All the aforementioned similarity measures concentrate to solve a high dimensionality problem but the "goodness" of data clusters quality is not considered. Therefore, Malik et al.50 and Shehata et al.51 focus on producing a high quality of textual data clusters. Malik et al.50 use a closed interesting terms while Shehata et al.51 use concept-based approach. Another similarity measures focus only on named enitiy32-34,52. Montalvo et al.52 use a named entity to cluster bilingual textual documents and their results show that named entity outperforms the results of traditional methods. Meanwhile, Cao et al.34 concentrate on the ambiguity of the named entity, specifically on geographical information to show that the mentioned name belongs to the appropriate entities. Unfortunately, relying only on the named entity as the similarity measure is not sufficient enough to produce a high accuracy of textual document clusters53. Therefore, we propose a new similarity measure based on all available named entities along with the frequent terms in textual document (data). Our approach ensures that the quality of textual document clusters can be obtained.
In the clustering process, there a collection of documents, denoted by D that needs to be clustered into one or more clusters represented by C. The documents in a cluster should similar properties. Thus, clustering process depends on similarity measurement between pair documents or document clusters. Our similar measurement can be represented by the following terms.
Definition 1: There exists a document cluster (C) and a collection of document (Dall) where, Dall can be clustered into one or several C. Therefore, Dall = C1, C2, C3, ......, Cn, where, n is the total number.
Definition 2: There exists a document cluster (Ci) and textual document (D) where, Ci consists of one or several D. Therefore, Ci = D1, D2, D3, ....., Dm, where, m is the total number of documents in a cluster.
Definition 3: There also exists a set of words (W) in Di that can be represented as Di = W1, W2, W3......, Wr, where, W is word/term and r is the number of words.
Definition 4: Some of the words (Wselected) have repeated occurrence in Dall and they can be Frequent Term (FT) or Name Entity (NE). All words that represent the FT is Wfrequent and words for name entity name is Wname. Therefore:
Wfrequent⊂Wselected and Wname⊂Wselected
FT = Wfrequent = {W1, W2, W3, ...., Ws} and NE = Wname = {W1, W2, W3, ...., Wt}
where, s and t are the total number of words in the respective sets FT∩NE = Ø. In other words:
∃W∈FT∪NE
Definition 5: Some of the frequent term (Wfrequent) have Maximal Frequent Terms (MFT) (WMaxFrequent) in textual document (Di), Therefore:
FT = Wfrequent = MFT = WMaxFrequent = {W1, W2, W3, ...., Wq}, WMaxFrequent⊂Wfrequent
where, q is the total number of maximum frequent words in the respective sets.
Definition 6: Based on the Maximal Frequent Terms (MFT) and Name Entity (NE), a cluster description (P) is constructed to describe the actual data cluster. According to definition 1, Dall can be clustered into one or more C. Thus, Dall has more than one cluster descriptions represented as follows:
Dall = {P1, P2, P3, ......, Py}
where, y is the total number of cluster description.
Based on the definitions the similarity measure is constructed.There are two phases in performing the similarity measure; creating and retrieving data clusters. In creating data clusters, two types of similarity measures are used for document-to-document and document-to-cluster descriptions, in which either one or both FT and NE can be applied. All similarity measures provide S (FT), which is the semantic (synonym) of frequently occurring words. The S (FT) is provided by using the WordNet database.
In measuring similarity of a document-to-document, clustering is executed on a pair of textual documents with minimum support of words (The number of words determine whether textual document belongs to the existing cluster, where the words should exist in both, Di and Dj/CDESj, this is called minimum support), as indicated in mathematical Eq. 1. The similarity measure between document (Di) and document (Dj) with all frequent terms can be represented as the following:
Sim (Di, Dj) = (S (TF) ∨ NE)Di∩ (S (TF) ∨ NE)Dj≤min_support |
(1) |
The clustering process can be executed in only a maximal number of occurrences in frequently terms to produce good-quality data clusters, as indicated in mathematical Eq. 2. The similarity measure between document (Di) and document (Dj) with maximal frequent terms can be represented as the following:
Sim (Di, Dj) = (S (TFMax) ∨ NE)Di∩ (S (TFMax) ∨ NE)Dj≤min_support |
(2) |
In document-to-cluster description, clustering process is executed on a single textual document with the cluster description that holds the FT and NE that has the minimum support of words, as indicated in mathematical Eq. 3. The similarity measure between documents (Di) and cluster description (Pj) with all frequent terms can be represented as the following:
Sim (Di, CDESj) = (S (TF) ∨ NE)Di∩ (S (TF) ∨ NE)Pj≤min_support |
(3) |
The clustering process can be executed in only a maximal number of frequently occurring terms to produce good-quality data clusters, as shown in mathematical Eq. 4. The similarity measure between documents (Di) and cluster description (Pj) with maximal frequent terms can be represented as the following:
Sim (Di, Pj) = (S (TFMax) ∨ NE)Di∩ (S (TFMax) ∨ NE)Pj≤min_support |
(4) |
Conversely to creating data clusters, retrieving data clusters can be used when users retrieve textual data. Equation 5 represents the percentage similarity between textual document description and cluster description. The percentage similarity between document and cluster description can be weighed as the following:
(5) |
where, v, g number are frequent terms and number of named entity respectively.
RESULTS AND DISCUSSION
There are two experiments conducted to evaluate performance and accuracy of textual document clustering for TVSM. The TVSM developed by Oracle 11G and Java Development Kit (JDK) 1.6.
Table 1: | Summary of data description |
The experiment is performed on a PC with AMD Phenom ii x4 B93 2.8 GHz and 4 GB RAM operating under windows 7. In order to compare TVSM with the most popular clustering algorithms for document clustering such as k-mean, bisecting k-mean (BKM), agglomerative hierarchical clustering with UPGMA, Oracle Text and frequent term-based hierarchical clustering (FTHC)21, the RapidMiner54 tool and CLUTO55,56 kit tool are used to extract and cluster textual documents from three different datasets.
Dataset: The experiment is conducted on three types of datasets: Reuters news stories (Reuters corpus, volume 1) (RCV1)57, a classic dataset58 and Web Application Project (WAP). The RCV1 contains hundreds of thousands of textual documents in many different categories that comprise news articles. The selected collection of textual documents can be classified into six categories: Economics, corporate (industrial), government (social), politics, market and sports. Classic datasets, which contain abstracts of papers are classified into four categories, namely, CACM, CRAN, CISI and MED. The WAP dataset, which contains a variety of news articles consisting of 20 classes. Table 1 shows a summary of all dataset descriptions. The evaluation process of VSM can be viewed from two perspectives: Quality (excellence) of textual data clusters and efficiency (performance).
Cluster quality: Cluster quality is measured by the overall F-measure (F-score)59, entropy46,51 and purity28,46, which are these the standard measures for cluster quality. Equation 6 represented the calculated overall F-measure on the basis of the obtained F-measure from Eq. 7, which is a mix of recall and precision as Eq. 8 and 9. These measures generate accurate results in information retrieval:
(6) |
(7) |
(8) |
(9) |
where, Mij denotes the number of documents of class i in cluster j, N i denotes the total number of documents in class i and Nj denotes the total number of documents in cluster j. The second measurements that used is entropy. Entropy measure homogeneous and distribution of documents over all data clusters. The total entropy can be calculated from mathematical Eq. 10, ej can be calculated from mathematical Eq. 11:
(10) |
(11) |
where, Pi, j is probability of member of cluster j belong to class i, the entropy is the precision. It can be calculated as the following mathematical Eq. 12:
(12) |
The third measures is purity, it assesses the purity of data clusters. It can be calculated by the following mathematical Eq. 13:
(13) |
where, nj is number of documents in cluster j and N denotes total number of documents in all data clusters and purityj is entropy of cluster j. The maximize F-measure and minimize entropy that shows the best and high quality of data clusters. In addition, the high value of purity is indicates better clustering process, where Purityj = Maxp (i, j).
Based on the mathematical equations the data cluster quality is measured. Table 2 shows the comparison of cluster qualities.
In the performance of data cluster query retrieval experiment, the textual documents are divided into seven groups to differentiate the performance of creating data clusters on each group. These groups contain 1000, 2000, 4000, 6000, 8000, 10,000 and 50,000 textual documents. All the textual data clustering algorithms used in the experiments predefine the number of clusters that is the user should first enter the number of clusters.
Fig. 4: | Quality of data clusters |
Table 2: | Assessment of data cluster quality |
In this study, the predefined number of clusters is 5, 10 and 20. However, TVSM does not require such predefinition. Performance is measured on the basis of the Reuters dataset and the unit of measurement for execution time is seconds. Table 3 shows the performance based on retrieving data clusters which created using TVSM and most common used textual data clustering algorithms.
The result of the experiment shows that the performance of TVSM outperforms K-mean, BKM, UPGMA, Oracle text and FICH clustering algorithms. Due to TVSM works as incremental clustering process, there is no need to re-cluster textual data every time. On the other hand, the goodness (quality of clusters) reach to better score as indicated in Fig. 4 and 5. Specifically, when dataset contains news articles, they often hold named entities. The TVSM has been executed used two similarity measures are document to document that represented by TVSM1 and document to cluster description that represented by TVSM2.
Figure 4 and 5 show the comparison of cluster qualities for the most popular algorithms and the proposed model.
Fig. 5(a-b): | Quality of data clusters, (a) Classic dataset and (b) WAP dataset |
Table 3: | Data cluster query retrieval performance |
Using the F-measure, entropy and purity, experiments are conducted on the basis of the two proposed similarity measures. In the first similarity, the results show that the TVSM has an F-measure higher than those of BKM and FICH when executing the classic dataset. The second similarity measure achieves a higher F-measure when executed on Reuters news articles due to it clustering based on words that are included in cluster description. Entropy and purity also show that TVSM produces high-quality clusters by generating high entropy and low purity results. All of the experimental measurements for cluster quality were conducted on three datasets (Reuters WAP and classic) with 1000 documents, which are a mix of all classes The BKM and UPGMA cannot function on large datasets.
CONCLUSION
In this study, we have presented the study of data extraction, unstructured data management and textual clustering on relational DBMS and desktop application. Due to the common phases in the data clustering process that requires tedious coding and scripting. In addition, relational database contains huge unstructured data thus user encounter difficulty in finding useful information. We propose a Textual Virtual Scheme Model (TVSM) for automatic clustering for textual data at column level in relational database. The TVSM will provide quick extraction, data arrangement and grouping for data similar pattern, find linkage between textual documents and improve the query processing performance in relational databases. In addition, converts unstructured information into structured by semantic linking. The system architecture can be applied in many application areas such as topic detection and tracking, textual document clustering, news clustering and web record. The results of experiments showed that quality of data clusters is acceptable when F-measure, entropy and purity are used. In addition, entropy and purity measurement are applied to evaluate quality of data clusters.
SIGNIFICANT STATEMENT
The proposed model focuses on an automatic and incremental strategy to transforms unstructured textual data into structured form. As a result of this strategy, the textual data is automatically represented by the most common words before storing it to database records. By using these common words the clustering assignment process can be achieved incrementally. In clustering assignment processes,the textual document belongs to an existing data cluster. Once the existing textual data cluster is determined, the data are considered as a cluster for the textual document. Otherwise, the process considers creating new textual data clusters. In addition, these common words can be a guide for the next textual document to be stored if the documents bear any similarity with textual data.
ACKNOWLEDGMENT
This study based on work support by Universiti Teknologi MARA (UiTM), Malaysia. The author would like to thanks UiTM.