Automated Text Categorization (ATC) is the task of building software tools
capable of classifying text (or hypertext) documents under predefined categories
or subject codes. ATC has witnessed a booming interest in recent times, due
to the availability of ever larger numbers of text documents in digital form
and to the ensuing need to organize them for easier use. The dominant approach
is nowadays one of building text classifiers automatically by learning the characteristics
of the categories from a training set of pre-classified documents (http://mason.gmu.edu/~kersch/JIIS/Special_Issues/
Sahih Bukhari is a collection of sayings and deeds of Prophet Muhammad (PBUH),
also known as the Sunnah. The reports of the Prophet's sayings and deeds are
called hadith. Hadith consists of two main parts, the Sanad and Matn. Bukhari
lived a couple of centuries after the Prophet's death and worked extremely hard
to collect his hadith. Each report in his collection was checked for compatibility
with the Qur'an and the veracity of the chain of reporters had to be painstakingly
established. Bukhari's collection is recognized by the overwhelming majority
of the Muslim world to be one of the most authentic collections of the Sunnah
of the Prophet (PBUH) (http://www.usc.edu/dept/MSA/
Bukhari spent sixteen years compiling it and ended up with 2,602 hadith without
repetition (9,082 with repetition). His criteria for acceptance into the collection
were amongst the most stringent of all the scholars of hadith (http://www.usc.edu/dept/MSA/fundamentals/
Each hadith is preceded by a chain of the names of those who have transmitted it in each generation, leading all the way back to the companion who reported it from the Prophet. These isnads (Sanad) guarantee the authenticity and verbal accuracy of hadith. For the first few generations, the hadiths are believed to have been transmitted mainly orally rather than in writing.
Many algorithms and technique have been applied for many years to text categorization and classification. They include decision tree learning, Bayesian learning, nearest neighbor learning and artificial neural networks, early such works may be found in Hassan et al. and Bensaid et al.
A good study comparing document categorization algorithms can be found in Yang and Liu. Also, Hassan et al. present experimental results on document clustering and classification achieved on the Arabic corpus using statistical methods.
Concerning Arabic, one automatic categorizer has been reported to have been
put under operational use to classify Arabic documents; it is referred to as
"Sakhr's categorizer" (http://www.Sakhr.com).
Our approach depends on extracting the main terms from hadith, computing term frequency; TF/IDF (Term Frequency-Inverse Document Frequency) method was used for text searching, term weighting; in which document weights for the selected terms are computed, to classify non-vocalized sayings, after filtering the inserted hadith.
Our corpus contains 8 books, separated in 8 files (i.e. each book in a file).
|Present methodology can be summarized as follows:
|| Term frequency table
||tf: is a terms frequency in the document
||df: is the frequency of documents in the corpus that contain the term
||N: is the number of documents in the corpus.
|| Term frequency table with threshold
|| Cumulative weight table
The algorithm that is used in this paper was implemented using Microsoft Visual Basic programming Language, such language support the Arabic texts, provide a variety of string functions and deal with files in a helpful way.
Figure 1 shows the sequence of this classification process:
|| Classification process
Usually natural language projects don't give accurate results. Our system accuracy depends heavily on the accuracy of stop words and stemmer systems, which normally has its own flaws.
We choose hadiths that contain frequent terms; others that do not contain such terms have been skipped (which depends on the semantics of hadith).
In Sahih al-Bukhari we notice that the same hadith may belong to more than one book. Our system can handle such case by displaying two books with the highest ranks.
Our corpus is limited (contains only 8 books), we should enlarge it to contain more books and hadith.
One of the main drawbacks of our system relies in its inability to classify
hadiths according to their semantics, so our system cannot classify correctly
the following hadith:
'Adi b. Hatim reported that he heard Allah's Messenger (may peace and blessings be upon him) as saying: He who among you can protect himself against Fire, he should do so, even if it should be with half a date.
Classification: Almsgiving book
In order to test the accuracy of our system, we selected 80 hadiths that resides in 8 books (Fig. 2). Table 4 summarizes the accuracy measures; the average accuracy for this sample is approximately 83.2%.
|| Accuracy measures
||Hadith classification form
||Term weight computation form
As training set we collect about 15 hadiths for each book and 5 hadiths for each test set, normally when training set is large the classifier accuracy goes up.
We can see that each book has its own terms, so the accuracy of the classifier varies from one book to another.
Now, we show the execution of the system based on our algorithm. The inputs to this system are text files of hadiths.
Next, we show the execution of the system based on our algorithm. Figure
3-5 shows the Interface of the system. The interface enable
the user to choice the interface language, open hadith file, removing Sanad,
stop word, find the stem of terms and then calculate the weight.
In this paper we have described the design and successful implementation of a new method suitable for classifying the prophet Mohammed (PBUH) sayings (Hadiths) in Arabic. The method has been implemented using Microsoft Visual Basic 6.0.
Future work will concentrate on enhancing the method so that it can classify
nested classification in the same book for each hadith, for example the following
hadith can be classified according to all books as in Faith book and according
to hadiths in this book, its classifieds as Faith matters book.
It is narrated on the authority of Abu Huraira that the Messenger of Allah (PBUH) said: Faith has over sixty branches and modesty is the branch of faith.
Classification: Faith book, faith matters book.