Natural Language Toolkit for Indian Languages (iNLTK)


Keywords
data-augmentation, deep-learning, indic-languages, nlp, pytorch, sentence-embeddings, sentence-encoding, sentence-similarity, word-embeddings
License
MIT
Install
pip install inltk==0.9

Documentation

Natural Language Toolkit for Indic Languages (iNLTK)

Gitter Downloads

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.

Documentation

Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io

Supported languages

Language Code
Hindi hi
Punjabi pa
Sanskrit sa
Gujarati gu
Kannada kn
Malayalam ml
Nepali ne
Odia or
Marathi mr
Bengali bn
Tamil ta
Urdu ur

Repositories containing models used in iNLTK

Language Repository Dataset used for Language modeling Perplexity of ULMFiT LM Perplexity of TransformerXL LM Dataset used for Classification Classification Accuracy Classification Kappa score ULMFiT Embeddings visualization TransformerXL Embeddings visualization
Hindi NLP for Hindi Hindi Wikipedia Articles - 172k


Hindi Wikipedia Articles - 55k
34.06


35.87
26.09


34.78
Hindi Movie Reviews Dataset


BBC Hindi News Dataset
61.66


79.79
42.29


73.01
Hindi Embeddings projection Hindi Embeddings projection
Punjabi NLP for Punjabi Punjabi Wikipedia Articles 24.40 14.03 Punjabi News Dataset 89.17 54.5 Punjabi Embeddings projection Punjabi Embeddings projection
Sanskrit NLP for Sanskrit Sanskrit Wikipedia Articles ~6 ~3 Sanskrit Shlokas Dataset 84.3 76.1 Sanskrit Embeddings projection Sanskrit Embeddings projection
Gujarati NLP for Gujarati Gujarati Wikipedia Articles 34.12 28.12 Gujarati News Dataset 92.4 87.9 Gujarati Embeddings projection Gujarati Embeddings projection
Kannada NLP for Kannada Kannada Wikipedia Articles 70.10 61.97 Kannada News Dataset 95.9 93.04 Kannada Embeddings projection Kannada Embeddings projection
Malayalam NLP for Malayalam Malayalam Wikipedia Articles 26.39 25.79 Malayalam News Dataset 94.36 91.54 Malayalam Embeddings projection Malayalam Embeddings projection
Nepali NLP for Nepali Nepali Wikipedia Articles 31.5 29.3 Nepali News Dataset 98.5 97.7 Nepali Embeddings projection Nepali Embeddings projection
Odia NLP for Odia Odia Wikipedia Articles 26.57 26.81 Odia News Dataset 95.52 93.02 Odia Embeddings Projection Odia Embeddings Projection
Marathi NLP for Marathi Marathi Wikipedia Articles 18 17.42 Marathi News Dataset 93.55 87.50 Marathi Embeddings projection Marathi Embeddings projection
Bengali NLP for Bengali Bengali Wikipedia Articles 41.2 39.3 Bengali News Dataset 93.8 92 Bengali Embeddings projection Bengali Embeddings projection
Tamil NLP for Tamil Tamil Wikipedia Articles 19.80 17.22 Tamil News Dataset 96.78 95.09 Tamil Embeddings projection Tamil Embeddings projection
Urdu NLP for Urdu Urdu Wikipedia Articles 13.19 12.55 Urdu News Dataset 95.28 91.58 Urdu Embeddings projection Urdu Embeddings projection

Contributing

Add a new language support

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next

..and being worked upon

Shout out if you want to help :)

  • Add Telugu and Maithili support
  • Add NER support
  • Add Textual Entailment support
  • Add English to iNLTK

..and NOT being worked upon

Shout out if you want to lead :)

iNLTK's Appreciation