epiclean

A package for data cleaning inscriptions in Epigraphia Carnatica


License
MIT
Install
pip install epiclean==0.0.5

Documentation

'Cleaning' Epigraphia Carnatica for Knowledge Graphs

This repo will contain code and documentation for data cleaning that will be done for Epigraphia Carnatica PDFs sourced from archive.org

The database of 'cleaned' information and APIs to access them is being developed at REST APIs for Epigraphia Carnatica database

Why does data need to be cleaned?

The entire series of Epigraphia Carnatica is available online as part of a digitization project by Microsoft.

The files have been prepared with some basic OCR that recognizes most of the typeset English characters. However, the books record inscriptions in several Indian languages and many scripts. The transliterations in English also make use of an extended character set with diacriritcs to accurately represent the entire phonetic range of the Indian languages. Unfortunately, these characters are not correctly recognized by the OCR system. In addition, the character set used by the publishers at that time varies from the transliteration schemes of nowadays.

Ignoring for now the inscription text in the Indic scripts, the purpose is to make available the text of the translations and transliterations with 'corrections'.

Process

This process (currently manual) involves

  1. Copying the text from the PDFs to a text editor.
  2. Correcting the text by substituting wrongly identified characters with the correct character from a standard. To validate, the inscription text in Indic script might need to be consulted.
  3. Cleaning up any other distortions caused during copying.
  4. Saving the corrected text in a database.

The repo will contain code for eventual automation, after substantial insights gained from the manual process.

This project uses the relatively large ISO 15919 character set for transliteration keeping in mind future extensibility.

The table below lists the common substitutions from EC and observations

Indic Character(s) EC character ISO 15919 character Comments
ಆ/आ â ā
ಈ/ई î ī
󠁲ಊ/ऊ û ū
ಋ/ऋ
é ē Long vowel in Dravidian languages
ô ō Long vowel in Dravidian languages
ಂ/ं ṃ/ṁ Final anusvara
ಃ/ः Same character
ಙ/ङ Same character
ಞ/ञ ñ ñ Same character
ಟ/ट Same character
ಡ/ड Same character
ಣ/ण Same character
r_with_two_dots Old Kannada character
l_with_two_dots Old Kannada character
ಶ/श s_with_left_acute/ś/š ś Used in EC Vol 2, character has fallen into disuse. EC Vol 3 uses ś. EC Vol 5 uses š
ಷ/ष ś Used in EC Vol 2. EC Vol 3 and later uses compound character of sh
ಜ್ಞ (ಜ್ + ಞ)/ज्ञ (ज् + ञ) Same characters

Example

text_sample

is copied as

Malliséna-bhatarara guddain Charengayyam tirtthamain bandisidam

and after correction becomes

Mallisēna-bhaṭārara guḍḍaṁ Chaṟeṅgayyaṁ tīrtthamaṁ bandisidaṁ

Note:

Only character level substitutions and corrections are made to render text human-readable and partially standards-compliant. No attempt has been made to make entire translation/transliteration text fully in line with the standard. This might need to be done in the future.

For example,

  1. ಋ/ऋ is written in EC as ṛi, such as in kṛiṣṇa. This is retained, instead of changing the original text to kṛṣṇa.

  2. ಶ/श is written in EC Vol 3 as ś, while ಷ/ष is written as sh. During data cleaning, ś is corrected but sh is retained.

  3. ಚ/च is represented in EC as ch. This is retained instead of correcting to the standards compliant c. Similarly ಛ/छ is retained as chh and not corrected to ch.

  4. The anusvāra/anunāsikā poses problems. In Sanskrit, this character changes depending on the succeeding consonant. The usage of the ಂ/ं is also common. Correctness would require consulting the inscription in Indic characters to make sure that the anusvāra carries over accurately. This is done in a few cases, but ignored in the vast majority.

Thoughts on automation

  1. Some observations from repeatedly cleaning up texts manually could help in automation.

    For Kannada inscriptions, the following are made:

    • The OCR text consistently copies ē as é. Thus, this could be automated.
    • Similarly, it sometimes copies ô/ó instead of ō.
    • In some texts, â is copied as à or á. This could also be automated to replace these two characters with ā.
    • The $ character is inserted by the OCR in place of ś. This can be easily identified and corrected. Similarly, S' might actually represent Ś.
    • The character sequence śrī is very common and can be inserted into a few variants that happen during OCR copying.
    • Largely, the anusvāra is used properly by the ancient and medieval scribes. So when n is followed by a consonant, the appropriate cluster can be inserted properly.
    • OCR copies ti for ū quite often. A possible rule would be to search for ti between two consonants. (to implement)
    • Similarly, fi for ñ. A rule might be to look for fi between a preceding vowel and a few consonants (c, j).
    • An m at the end of a word or word in a compound-word-sequence, is possibly an ṁ if it is followed by a consonant.
  2. A large enough corpus of uncorrected and corrected texts could allow ML/DL to take a go at it.

  3. A UI should provide a diff between the original and the automatically corrected and provide an approve/reject mechanism. (to implement)

Using the software (in-development)

As and when the automation code goes through milestones, it will be released so that it can be used as a UI tool, part of REST APIs for Epigraphia Carnatica database.

The current version is epiclean==0.0.2 and can be found here

To install, run

pip install epiclean

Optionally, create a requirements.txt with the content

epiclean==<version>

and install with

pip install -r requirements.txt