Automated Text Categorization (ATC) is the task of building software tools
capable of classifying text (or hypertext) documents under predefined categories
or subject codes. ATC has witnessed a booming interest in recent times, due
to the availability of ever larger numbers of text documents in digital form
and to the ensuing need to organize them for easier use. The dominant approach
is nowadays one of building text classifiers automatically by learning the characteristics
of the categories from a training set of pre-classified documents (http://mason.gmu.edu/~kersch/JIIS/Special/Issues/TextCategory.html).
Data mining, is a new technology aims to find patterns in data. Similarly, text mining aims to find patterns in text. Some authors defined it as the analysis of text in order to extract useful information for different applications. Mostly, text is unstructured, formless and relatively difficult to deal with in comparison to the data stored in databases.
Natural language corpora are primary sources of information about language use. They represent a huge linguistic knowledge bank that can be tapped through the use of various data analysis tools to discover trends, patterns or other linguistic phenomena which may be incorporated into other language processing tasks. For example, corpora can support detailed studies of how particular words are used, by providing extensive examples of natural language sentences in context. Information about word frequency, co-occurrence, collocations etc. can be derived from corpora and used to build statistical language models, for word sense disambiguation or speech recognition.
In Holy Quran we have what we call a unity of subject. Holy Quran divided into 114 chapters (Surahs) and each chapter consists of a number of verses (Ayat). This study aims to classify any verse to a predefined subjects, since the Quran as book is not classified on subjects.
This algorithm is fully implemented using Microsoft Visual Basic. Visual Basic was used because it adopts Unicode which leads to the support of the Arabic language. In this case we will not need to use Arabization software.
Figure 1 shows a diagram of the major components of our system
|| Major components of the system
Figure 2 describes the algorithm used in this study:
|| Algorithm of the system
Present methodology can be summarized by the following steps:
||Select the desired Sura (Chapter) of the Holy Quran.
||Select the verse you want to classify.
||Subdivide the verse into features ( keywords).
||Try to find the recurrences of the keywords (features) in other Surahs
||For each verse extracted from previous step try to know what is the subject
this verse is talking about.
||Collection of such information needs a holy Quran corpus that contains
words a long with the verse and sura it was mentioned.
||Step 6 was built manually and we depends on: http://www.alnoor-world.com/
||The system aims to build the following Table 1:
||The previous table shows that we have to take the maximum summation of
subject (S1) which indicates the subject or class of the verse.
||General subjects that we found are relevant to Muslim scholars classifications
|| Simple view of gained table
After selecting sura (chapter) and the verse by the user, the system starts normalizing the verse by removing, diacritical marks, punctuations and stop words. In addition to parsing the verse into different tokens.
The contents of our specialized corpus are from alnoor-world web site. The construction of our specialized corpus depends on a freeware program from this site, facilitating the process of searching of any word in the Quran.
The system use an executable code written by Hilat E. to remove stop words in the normalization process.
This system depends on holy Quran corpus and since such corpus has not been
built yet, we build a specialized corpus for Fatiha and Yaseen Surahs (Chapters).
Figure 3 shows a Database tables represent a entity relationship
of the system:
|| Entity relationship of the system
This corpus is used to collect all the verses of holy Quran that contains a
tokenized word (feature) from a verse under classification, in addition to the
subject(s) classification for each verse according to Islamic scholars.
Afterward, assume that:
||The number of words remained in the verse after normalization
equal to N.
||The number of subjects these N words talking about is M.
||The total number of the predefined subjects of the holy Quran
||P(Si) represents the percentage of a specific subject relative
to the other resulting subjects and is computed by the following formula:
The system also used a two dimensional table of the verse words (features) after normalization and subject percentage of each word. This table correlate the main features (keywords) of any verse with other verses in the same chapter (Sura) and within other chapters (surahs).
In order to determine the class of the verse, the following computation has
to be done depending on the Table 1:
||Summation of subject percentages to each word in the verse
as shown in the following equation:
We will get 15 values (number of subjects M) from the above equation.
||The prediction of the class of the verse will depends on the following
Next, we show the execution of the system based on our algorithm. Figure
4 shows the Interface of the system. The interface enable the user to choose
the sura (chapter) and the verse.
|| User interface
Figure 5 shows the 2D table, besides the statistics related to the verse.
|| Classification details
In this study we have described the design and successful implementation of a new text classifier suitable for classifying different verses of the Holy Quran. The text classifier has been implemented using Microsoft Visual Basic 6.0.
This work needs a full corpus for the Holy Quran in order to get more precise results. The Yaseen Sura was selected due to its size and the variety of subjects it discusses.
The system has been tested on the Fatiha and Yaseen Surahs (Chapters) and showed 91% accuracy in classifying different verses. The results of the system are compared with the classifications of Islamic scholars to all verses of the Quran.
The accuracy of this system can be improved substantially if a full corpus is built and a better stemmer is used.