Al-Hadith Text Classifier

INTRODUCTION

Automated Text Categorization (ATC) is the task of building software tools capable of classifying text (or hypertext) documents under predefined categories or subject codes. ATC has witnessed a booming interest in recent times, due to the availability of ever larger numbers of text documents in digital form and to the ensuing need to organize them for easier use. The dominant approach is nowadays one of building text classifiers automatically by learning the characteristics of the categories from a training set of pre-classified documents (http://mason.gmu.edu/~kersch/JIIS/Special_Issues/ TextCategory.html).

Sahih Bukhari is a collection of sayings and deeds of Prophet Muhammad (PBUH), also known as the Sunnah. The reports of the Prophet's sayings and deeds are called hadith. Hadith consists of two main parts, the Sanad and Matn. Bukhari lived a couple of centuries after the Prophet's death and worked extremely hard to collect his hadith. Each report in his collection was checked for compatibility with the Qur'an and the veracity of the chain of reporters had to be painstakingly established. Bukhari's collection is recognized by the overwhelming majority of the Muslim world to be one of the most authentic collections of the Sunnah of the Prophet (PBUH) (http://www.usc.edu/dept/MSA/ fundamentals/ hadithsunnah/bukhari/sbtintro.html).

Bukhari spent sixteen years compiling it and ended up with 2,602 hadith without repetition (9,082 with repetition). His criteria for acceptance into the collection were amongst the most stringent of all the scholars of hadith (http://www.usc.edu/dept/MSA/fundamentals/ hadithsunnah/bukhari/sbtintro.html).

Each hadith is preceded by a chain of the names of those who have transmitted it in each generation, leading all the way back to the companion who reported it from the Prophet. These isnads (Sanad) guarantee the authenticity and verbal accuracy of hadith. For the first few generations, the hadiths are believed to have been transmitted mainly orally rather than in writing^[1].

Many algorithms and technique have been applied for many years to text categorization and classification. They include decision tree learning, Bayesian learning, nearest neighbor learning and artificial neural networks, early such works may be found in Hassan et al.^[2] and Bensaid et al.^[3]

A good study comparing document categorization algorithms can be found in Yang and Liu^[4]. Also, Hassan et al.^[2] present experimental results on document clustering and classification achieved on the Arabic corpus using statistical methods.

Concerning Arabic, one automatic categorizer has been reported to have been put under operational use to classify Arabic documents; it is referred to as "Sakhr's categorizer" (http://www.Sakhr.com).

METHODOLOGY

Our approach depends on extracting the main terms from hadith, computing term frequency; TF/IDF (Term Frequency-Inverse Document Frequency) method was used for text searching, term weighting; in which document weights for the selected terms are computed, to classify non-vocalized sayings, after filtering the inserted hadith.

Our corpus contains 8 books, separated in 8 files (i.e. each book in a file).

Present methodology can be summarized as follows:

Table 1:	Term frequency table

Where:

	tf: is a term’s frequency in the document
	df: is the frequency of documents in the corpus that contain the term
	N: is the number of documents in the corpus.

Table 2:	Term frequency table with threshold

Table 3:	Cumulative weight table

ALGORITHM

The algorithm that is used in this paper was implemented using Microsoft Visual Basic programming Language, such language support the Arabic texts, provide a variety of string functions and deal with files in a helpful way.

Figure 1 shows the sequence of this classification process:


Fig. 1:	Classification process

CHALLENGES

Usually natural language projects don't give accurate results. Our system accuracy depends heavily on the accuracy of stop words and stemmer systems, which normally has its own flaws.

We choose hadiths that contain frequent terms; others that do not contain such terms have been skipped (which depends on the semantics of hadith).

In Sahih al-Bukhari we notice that the same hadith may belong to more than one book. Our system can handle such case by displaying two books with the highest ranks.

Our corpus is limited (contains only 8 books), we should enlarge it to contain more books and hadith.

One of the main drawbacks of our system relies in its inability to classify hadiths according to their semantics, so our system cannot classify correctly the following hadith:

'Adi b. Hatim reported that he heard Allah's Messenger (may peace and blessings be upon him) as saying: “He who among you can protect himself against Fire, he should do so, even if it should be with half a date”.

Classification: Almsgiving book

EVALUATION

In order to test the accuracy of our system, we selected 80 hadiths that resides in 8 books (Fig. 2). Table 4 summarizes the accuracy measures; the average accuracy for this sample is approximately 83.2%.


Fig. 2:	Accuracy measures


Fig. 3:	Main form


Fig. 4:	Hadith classification form


Fig. 5:	Term weight computation form

Table 4:	Accuracy measures

As training set we collect about 15 hadiths for each book and 5 hadiths for each test set, normally when training set is large the classifier accuracy goes up.

We can see that each book has its own terms, so the accuracy of the classifier varies from one book to another.

Now, we show the execution of the system based on our algorithm. The inputs to this system are text files of hadiths.

Next, we show the execution of the system based on our algorithm. Figure 3-5 shows the Interface of the system. The interface enable the user to choice the interface language, open hadith file, removing Sanad, stop word, find the stem of terms and then calculate the weight.

CONCLUSIONS

In this paper we have described the design and successful implementation of a new method suitable for classifying the prophet Mohammed (PBUH) sayings (Hadiths) in Arabic. The method has been implemented using Microsoft Visual Basic 6.0.

Future work will concentrate on enhancing the method so that it can classify nested classification in the same book for each hadith, for example the following hadith can be classified according to all books as in Faith book and according to hadiths in this book, its classifieds as Faith matters book.

It is narrated on the authority of Abu Huraira that the Messenger of Allah (PBUH) said: Faith has over sixty branches and modesty is the branch of faith.

Classification: Faith book, faith matters book.

HOME JOURNALS CONTACT

Journal of Applied Sciences

Year: 2005 | Volume: 5 | Issue: 3 | Page No.: 584-587
DOI: 10.3923/jas.2005.584.587

Mohammed Naji Al- Kabi, Ghassan Kanaan, Riyad Al- Shalabi, Saja I. Al- Sinjilawi and Ronza S. Al- Mustafa

How to cite this article

Mohammed Naji Al- Kabi, Ghassan Kanaan, Riyad Al- Shalabi, Saja I. Al- Sinjilawi and Ronza S. Al- Mustafa, 2005. Al-Hadith Text Classifier. Journal of Applied Sciences, 5: 584-587.

Keywords: data-mining, statistical analysis, Arabic Text categorization, Arabic text classification, hadith (prophet sayings) text classifier, Arabic text mining and classification

REFERENCES