pyHadith

A package which automatically segments, categorizes and, extracts narrators from, ahadith.


Keywords
ahadith, hadith, isnad, nlp, arabic, ai, cnn, islam, ml, spacy
License
GPL-3.0
Install
pip install pyHadith==0.1.4

Documentation

pyHadith

GPLv3 License PyUp PyPi Version Python Versions

pyHadith is a python package for the automatic analysis of ahadith.

pyHadith performs three key functions: categorization (as either athar or khabar); segmentation (of a hadith into a matn and an isnad); and, narrator identification/extraction (that is, the identification of the names of narrators within an isnad).

As of the 18th of November 2020, pyHadith has been able to achieve the following percision, recall and F scores (functions were evaluated against a dataset of 20,430 manually annotated ahadith (more precisely, 1,322,035 tokens) (withheld from training)):

Function Precision Recall F-Score
Categorization 0.8785 0.7889 0.8313
Segmentation 0.9709 0.9961 0.9833
Narrator Identification/Extraction 0.9706 0.9758 0.9732

1. How It Works

1.1 Statistical Natural Language Processing Models

pyHadith uses four statistical natural language processing (NLP) models to segment, extract narrators from, and categorize, ahadith. These are: a Text Classification model known as masdar (responsible for categorization); a Part-of-Speech tagging model known as muqasim (responsible for segmentation); a Named Entity Recognition (NER) model known as musaid (responsible for segmentation); and, a Named Entity Recognition (NER) model known as rawa (responsible for narrator extraction; trained only on asnad).

These models were trained on manually annotated ahadith by the Saudi Arabian Permanent Committee for Scholarly Research and Ifta.

These models were generated by spaCy version 2.2.4. The training corpus contained 102,153 annotated ahadith, taken from sunnah.alifta.gov.sa. For the rawa model, duplicate asnad were removed from the corpus, resulting in the inclusion of 96,887 asnad. 20% of the ahadith in the datasets were withheld and used for evaluating models. After training models for 100 iterations, the best performing models were selected.

The results of the final models are displayed in the table below:

Model Model Type Accuracy Precision Recall F-Score
Asl Text Classification 97.32
Ajza Part-of-Speech Tagging 99.48
Musaid Named Entity Recognition 98.99 99.20 99.10
Rawa Named Entity Recognition 99.05 99.44 99.25

1.2 Pre-Processing

Before a hadith is analysed, it is first "cleaned" by a pre-processor.

The pre-processor strips away punctuation and extra white space.

The pre-processor also uses Motaz Saad's split-waw-arabic method to identify and add whitespaces after the word "وَ". This is necessary to differentiate between the letter "و‎" and the word "وَ" (Eng. and) (effectively, tokenizing words).

1.3 Rawa Post-Processor

To ensure that the names extracted by the rawa model are accurate, a post-processor looks for common joining terms at the beginning of each name (i.e. where the word "من" (Eng. from) is included as part of the name of a narrator). If a common joining term is found, it is removed from the name.

1.4 Segmentation Algorithm

To segment a hadith into a matn and an isnad, a custom algorithm is employed. This algorithm relies on the muqasim and musaid models. It first splits a hadith at the last occurrence of a "BEGINMATN" tag (identified by muqasim). It then searches for narrators within the text before the last "BEGINMATN" tag. If a narrator has not been found after 6 or more tokens, it assumes that the last narrator identified is the actual last narrator. All the text before the token immediately succeeding that narrator is then labelled as the isnad. The text after it is labelled the matn.

1.5 Asnad Reconstruction Algorithm

An Asnad Reconstruction algorithm is employed to standardize narrational relationships in a tree-like data structure.

There are two possible relationships recognized by the algorithm: A from B, and, A from B and C. Thus, where a term joins two or more narrators to a single narratee, that narratee will have multiple "parent" narrators. Multiple "parents" are identified by looking for the Arabic word "وَ".

2. Installation

pyHadith is available on pip. You can install pyHadith using the following command:

pip install pyHadith

The following python packages will be automatically installed as dependencies of pyHadith:

Package Version Description
spaCy 2.2.4 Used to interact with the rawa and asl models.
pyArabic >= 0.6.10 Used to remove diacritics from Arabic strings.
nltk >= 3.4.5 Used to tokenize Arabic strings.

pyHadith also requires the NTLK punkt tokenizer. Punkt can be installed by executing the following commands (in python):

# Import the "download" module of nltk.
from nltk import download
# Download punkt.
download("punkt")

3. Usage

3.1 Import pyHadith

The first step in using pyHadith is to import the package to your code. You can do so with the following line:

# Import the pyHadith package.
import pyhadith

3.2 Create a 'Hadith' Object

Before you can segment and analyse a hadith, you must first create a 'Hadith' object. This does not require the passing of any arguments.

The code below demonstrates how a 'Hadith' object can be created:

# Continue on from example code in 3.1.
# Create a 'Hadith' object.
hadithObj = pyhadith.Hadith()

3.3 Preprocess a Hadith

To analyse a hadith, you must preprocess it. This requires the passing of a single argument, a UTF-8 encoded Arabic string with diacritics.

The code below demonstrates how you can preprocess a hadith:

# Continue on from example code in 3.2.
# Set the hadith to be pre-processed.
text = u'حَدَّثَنَا مُحَمَّدُ بْنُ بَشَّارٍ، حَدَّثَنَا أَبُو أَحْمَدَ، حَدَّثَنَا سُفْيَانُ، عَنْ يَزِيدَ أَبِي خَالِدٍ الدَّالاَنِيِّ، عَنْ رَجُلٍ، عَنْ جَابِرِ بْنِ عَبْدِ اللَّهِ، قَالَ صَنَعَ أَبُو الْهَيْثَمِ بْنُ التَّيْهَانِ لِلنَّبِيِّ صلى الله عليه وسلم طَعَامًا فَدَعَا النَّبِيَّ صلى الله عليه وسلم وَأَصْحَابَهُ فَلَمَّا فَرَغُوا قَالَ ‏"‏ أَثِيبُوا أَخَاكُمْ ‏"‏ ‏.‏ قَالُوا يَا رَسُولَ اللَّهِ وَمَا إِثَابَتُهُ قَالَ ‏"‏ إِنَّ الرَّجُلَ إِذَا دُخِلَ بَيْتُهُ فَأُكِلَ طَعَامُهُ وَشُرِبَ شَرَابُهُ فَدَعَوْا لَهُ فَذَلِكَ إِثَابَتُهُ ‏"‏ ‏.‏'
# Pre-process the hadith.
hadithObj.preprocess(text)
# Print the resulting attributes.
print({
    "raw" : hadithObj.raw,
    "clean" : hadithObj.clean
})

Once you have pre-processed a hadith, the following attributes will become available:

Attribute Data Type Description
raw String The original raw text.
clean String The cleaned raw text.

3.4 Segment a Hadith

To segment a hadith into a matn and an isnad, you must call the 'segment' function of a 'Hadith' object. The 'segment' function does not require the passing of any arguments.

The code below demonstrates how this is done:

# Continue on from example code in 3.3.
# Call the 'segment' function.
hadithObj.segment()
# Print the resulting attributes.
print({
    "matn" : hadithObj.matn,
    "isnad" : hadithObj.isnad
})

Once the function has been called, the following attributes will become available:

Attribute Data Type Description
matn Dictionary A dictionary containing the 'raw' text, 'start_char' index (in the cleaned text), and 'end_char' index (in the cleaned text), of the matn.
isnad Dictionary A dictionary containing the 'raw' text, 'start_char' index (in the cleaned text), and 'end_char' index (in the cleaned text), of the isnad, along with a 'narrators' list which contains the names and character indices of narrators.

3.5 Categorize a Hadith

To categorize a hadith, you must call the 'categorize' function of your 'Hadith' object. Like the 'segment' function, this function does not require the passing of any arguments. This function also does not require you to have previously called the 'segment' function.

The code below demonstrates how you can call the function:

# Continue on from example code in 3.4.
# Call the 'categorize' function.
hadithObj.categorize()
# Print the resulting attributes.
print(hadithObj.category)

Once the 'categorize' function has been called, the 'category' attribute will become available.

Attribute Data Type Description
category Dictionary A dictionary containing the 'name' (either 'athar' or 'khabar') and 'score' (from .5 to 1) of the assigned category.

3.6 Reconstruct an Isnad

To reconstruct the isnad of your 'Hadith' object, you must call the 'treeify' function of a 'Hadith' object. Before calling the function, however, you must have already called the 'segment' function.

The code below demonstrates how you can call the 'treeify' function:

# Continue on from example code in 3.5.
# Call the 'treeify' function.
hadithObj.treeify()
# Print the resulting attributes.
print(hadithObj.tree)

Once the 'treeify' function has been called, the 'tree' attribute will be created. This attribute is a list which contains 'narrator' dictionaries.

A 'narrator' dictionary in the 'tree' list will contain the following keys:

Key Data Type Description
id Integer A unique identifier number.
name String The raw text of the narrator's name.
start_char Integer The character index in the cleaned text where the name begins.
end_char Integer The character index in the cleaned text where the name ends.
parents List A list of the ids of the narrator's parents within the isnad.

4. Changelog

The changelog for pyHadith is available in the CHANGELOG.md file.

5. License

pyHadith is licensed under GPLv3. pyArabic, spaCy and NLTK are licensed under GPLv3, MIT, and Apache 2.0, respectively. These licenses are all GPL compatible.

6. Citation

You may cite pyHadith using the following citation:

Butler, U. (2020). PyHadith (Version 0.1.3) [Computer software]. Retrieved from https://pypi.org/project/pyHadith/0.1.3/