pyHadith
pyHadith is a python package for the automatic analysis of ahadith.
pyHadith performs three key functions: categorization (as either athar or khabar); segmentation (of a hadith into a matn and an isnad); and, narrator identification/extraction (that is, the identification of the names of narrators within an isnad).
As of the 18th of November 2020, pyHadith has been able to achieve the following percision, recall and F scores (functions were evaluated against a dataset of 20,430 manually annotated ahadith (more precisely, 1,322,035 tokens) (withheld from training)):
Function | Precision | Recall | F-Score |
---|---|---|---|
Categorization | 0.8785 | 0.7889 | 0.8313 |
Segmentation | 0.9709 | 0.9961 | 0.9833 |
Narrator Identification/Extraction | 0.9706 | 0.9758 | 0.9732 |
1. How It Works
1.1 Statistical Natural Language Processing Models
pyHadith uses four statistical natural language processing (NLP) models to segment, extract narrators from, and categorize, ahadith. These are: a Text Classification model known as masdar (responsible for categorization); a Part-of-Speech tagging model known as muqasim (responsible for segmentation); a Named Entity Recognition (NER) model known as musaid (responsible for segmentation); and, a Named Entity Recognition (NER) model known as rawa (responsible for narrator extraction; trained only on asnad).
These models were trained on manually annotated ahadith by the Saudi Arabian Permanent Committee for Scholarly Research and Ifta.
These models were generated by spaCy version 2.2.4. The training corpus contained 102,153 annotated ahadith, taken from sunnah.alifta.gov.sa. For the rawa model, duplicate asnad were removed from the corpus, resulting in the inclusion of 96,887 asnad. 20% of the ahadith in the datasets were withheld and used for evaluating models. After training models for 100 iterations, the best performing models were selected.
The results of the final models are displayed in the table below:
Model | Model Type | Accuracy | Precision | Recall | F-Score |
---|---|---|---|---|---|
Asl | Text Classification | 97.32 | |||
Ajza | Part-of-Speech Tagging | 99.48 | |||
Musaid | Named Entity Recognition | 98.99 | 99.20 | 99.10 | |
Rawa | Named Entity Recognition | 99.05 | 99.44 | 99.25 |
1.2 Pre-Processing
Before a hadith is analysed, it is first "cleaned" by a pre-processor.
The pre-processor strips away punctuation and extra white space.
The pre-processor also uses Motaz Saad's split-waw-arabic method to identify and add whitespaces after the word "وَ". This is necessary to differentiate between the letter "و" and the word "وَ" (Eng. and) (effectively, tokenizing words).
1.3 Rawa Post-Processor
To ensure that the names extracted by the rawa model are accurate, a post-processor looks for common joining terms at the beginning of each name (i.e. where the word "من" (Eng. from) is included as part of the name of a narrator). If a common joining term is found, it is removed from the name.
1.4 Segmentation Algorithm
To segment a hadith into a matn and an isnad, a custom algorithm is employed. This algorithm relies on the muqasim and musaid models. It first splits a hadith at the last occurrence of a "BEGINMATN" tag (identified by muqasim). It then searches for narrators within the text before the last "BEGINMATN" tag. If a narrator has not been found after 6 or more tokens, it assumes that the last narrator identified is the actual last narrator. All the text before the token immediately succeeding that narrator is then labelled as the isnad. The text after it is labelled the matn.
1.5 Asnad Reconstruction Algorithm
An Asnad Reconstruction algorithm is employed to standardize narrational relationships in a tree-like data structure.
There are two possible relationships recognized by the algorithm: A from B, and, A from B and C. Thus, where a term joins two or more narrators to a single narratee, that narratee will have multiple "parent" narrators. Multiple "parents" are identified by looking for the Arabic word "وَ".
2. Installation
pyHadith is available on pip. You can install pyHadith using the following command:
pip install pyHadith
The following python packages will be automatically installed as dependencies of pyHadith:
Package | Version | Description |
---|---|---|
spaCy | 2.2.4 | Used to interact with the rawa and asl models. |
pyArabic | >= 0.6.10 | Used to remove diacritics from Arabic strings. |
nltk | >= 3.4.5 | Used to tokenize Arabic strings. |
pyHadith also requires the NTLK punkt tokenizer. Punkt can be installed by executing the following commands (in python):
# Import the "download" module of nltk.
from nltk import download
# Download punkt.
download("punkt")
3. Usage
3.1 Import pyHadith
The first step in using pyHadith is to import the package to your code. You can do so with the following line:
# Import the pyHadith package.
import pyhadith
3.2 Create a 'Hadith' Object
Before you can segment and analyse a hadith, you must first create a 'Hadith' object. This does not require the passing of any arguments.
The code below demonstrates how a 'Hadith' object can be created:
# Continue on from example code in 3.1.
# Create a 'Hadith' object.
hadithObj = pyhadith.Hadith()
3.3 Preprocess a Hadith
To analyse a hadith, you must preprocess it. This requires the passing of a single argument, a UTF-8 encoded Arabic string with diacritics.
The code below demonstrates how you can preprocess a hadith:
# Continue on from example code in 3.2.
# Set the hadith to be pre-processed.
text = u'حَدَّثَنَا مُحَمَّدُ بْنُ بَشَّارٍ، حَدَّثَنَا أَبُو أَحْمَدَ، حَدَّثَنَا سُفْيَانُ، عَنْ يَزِيدَ أَبِي خَالِدٍ الدَّالاَنِيِّ، عَنْ رَجُلٍ، عَنْ جَابِرِ بْنِ عَبْدِ اللَّهِ، قَالَ صَنَعَ أَبُو الْهَيْثَمِ بْنُ التَّيْهَانِ لِلنَّبِيِّ صلى الله عليه وسلم طَعَامًا فَدَعَا النَّبِيَّ صلى الله عليه وسلم وَأَصْحَابَهُ فَلَمَّا فَرَغُوا قَالَ " أَثِيبُوا أَخَاكُمْ " . قَالُوا يَا رَسُولَ اللَّهِ وَمَا إِثَابَتُهُ قَالَ " إِنَّ الرَّجُلَ إِذَا دُخِلَ بَيْتُهُ فَأُكِلَ طَعَامُهُ وَشُرِبَ شَرَابُهُ فَدَعَوْا لَهُ فَذَلِكَ إِثَابَتُهُ " .'
# Pre-process the hadith.
hadithObj.preprocess(text)
# Print the resulting attributes.
print({
"raw" : hadithObj.raw,
"clean" : hadithObj.clean
})
Once you have pre-processed a hadith, the following attributes will become available:
Attribute | Data Type | Description |
---|---|---|
raw | String | The original raw text. |
clean | String | The cleaned raw text. |
3.4 Segment a Hadith
To segment a hadith into a matn and an isnad, you must call the 'segment' function of a 'Hadith' object. The 'segment' function does not require the passing of any arguments.
The code below demonstrates how this is done:
# Continue on from example code in 3.3.
# Call the 'segment' function.
hadithObj.segment()
# Print the resulting attributes.
print({
"matn" : hadithObj.matn,
"isnad" : hadithObj.isnad
})
Once the function has been called, the following attributes will become available:
Attribute | Data Type | Description |
---|---|---|
matn | Dictionary | A dictionary containing the 'raw' text, 'start_char' index (in the cleaned text), and 'end_char' index (in the cleaned text), of the matn. |
isnad | Dictionary | A dictionary containing the 'raw' text, 'start_char' index (in the cleaned text), and 'end_char' index (in the cleaned text), of the isnad, along with a 'narrators' list which contains the names and character indices of narrators. |
3.5 Categorize a Hadith
To categorize a hadith, you must call the 'categorize' function of your 'Hadith' object. Like the 'segment' function, this function does not require the passing of any arguments. This function also does not require you to have previously called the 'segment' function.
The code below demonstrates how you can call the function:
# Continue on from example code in 3.4.
# Call the 'categorize' function.
hadithObj.categorize()
# Print the resulting attributes.
print(hadithObj.category)
Once the 'categorize' function has been called, the 'category' attribute will become available.
Attribute | Data Type | Description |
---|---|---|
category | Dictionary | A dictionary containing the 'name' (either 'athar' or 'khabar') and 'score' (from .5 to 1) of the assigned category. |
3.6 Reconstruct an Isnad
To reconstruct the isnad of your 'Hadith' object, you must call the 'treeify' function of a 'Hadith' object. Before calling the function, however, you must have already called the 'segment' function.
The code below demonstrates how you can call the 'treeify' function:
# Continue on from example code in 3.5.
# Call the 'treeify' function.
hadithObj.treeify()
# Print the resulting attributes.
print(hadithObj.tree)
Once the 'treeify' function has been called, the 'tree' attribute will be created. This attribute is a list which contains 'narrator' dictionaries.
A 'narrator' dictionary in the 'tree' list will contain the following keys:
Key | Data Type | Description |
---|---|---|
id | Integer | A unique identifier number. |
name | String | The raw text of the narrator's name. |
start_char | Integer | The character index in the cleaned text where the name begins. |
end_char | Integer | The character index in the cleaned text where the name ends. |
parents | List | A list of the ids of the narrator's parents within the isnad. |
4. Changelog
The changelog for pyHadith is available in the CHANGELOG.md file.
5. License
pyHadith is licensed under GPLv3. pyArabic, spaCy and NLTK are licensed under GPLv3, MIT, and Apache 2.0, respectively. These licenses are all GPL compatible.
6. Citation
You may cite pyHadith using the following citation:
Butler, U. (2020). PyHadith (Version 0.1.3) [Computer software]. Retrieved from https://pypi.org/project/pyHadith/0.1.3/