bltk

A lightweight but robust toolkit for Bengali Natural Language Processing.


Keywords
pos-tagger, pos, tagger, phrase, chunker, phrase-chunker, stemmer, bengali, natural, language, processing, Machine, learning, NLP
License
MIT
Install
pip install bltk==1.2

Documentation

BLTK: The Bengali Natural Language Processing Toolkit

A lightweight but robust toolkit for processing Bengali Language.

Open Source Love License: MIT forthebadge made-with-python

Overview

BLTK is a lightweight but exceptionally robust language processing toolkit for the Bengali Language. I, Mr. Saimon Hossain, along with my friend, Mr. Liton Shil, have conducted research as a part of our undergraduate thesis under the supervision of our respected sir, Mr. Sowmitra Das. This is the outcome of our 6 month long research & development project.

I have choosen this name after taking inspiration from the popular natural language processing toolkit - NLTK.

BLTK is still in its childhood. It's maturing everyday. It'll receive updates in the days to come.

If you want to contribute to BLTK's growth, please read the contribution section at the end of this page.

Supported Functionalities

  • Word Tokenization
  • Sentence Tokenization
  • Sentence Splitting
  • Stopwords Filtering
  • Statistical Part-of-speech Tagging
  • Phrase Chunking/Named-Entity Recognition
  • Stemming

Installation

To get BLTK up and running, run the following command.

pip install bltk

Usage

1) The Bengali Characters

In BLTK, the banglachars module contains 7 lists of characters specific to the Bengali Language.

  1. vowels
  2. vowel signs
  3. consonants
  4. digits
  5. punctuation marks
  6. operators
  7. others

Code

from bltk.langtools.banglachars import (vowels,
                                        vowel_signs,
                                        consonants,
                                        digits,
                                        operators,
                                        punctuations,
                                        others)
print(f'Vowels: {vowels}')
print(f'Vowel signs: {vowel_signs}')
print(f'Consonants: {consonants}')
print(f'Digits: {digits}')
print(f'Operators: {operators}')
print(f'Punctuation marks: {punctuations}')
print(f'Others: {others}')

Output

Vowels: ['āĻ…', 'āĻ†', 'āĻ‡', 'āĻˆ', 'āĻ‰', 'āĻŠ', 'āĻ‹', 'āĻŒ', 'āĻ', 'āĻ', 'āĻ“', 'āĻ”']
Vowel signs: ['āĻž', 'āĻŋ', 'ā§€', 'ā§', 'ā§‚', 'ā§ƒ', 'ā§„', 'ā§‡', 'ā§ˆ', 'ā§‹', 'ā§Œ']
Consonants: ['āĻ•', 'āĻ–', 'āĻ—', 'āĻ˜', 'āĻ™', 'āĻš', 'āĻ›', 'āĻœ', 'āĻ', 'āĻž', 'āĻŸ', 'āĻ ', 'āĻĄ', 'āĻĸ', 'āĻŖ', 'āĻ¤', 'āĻĨ', 'āĻĻ', 'āĻ§', 'āĻ¨', 'āĻĒ', 'āĻĢ', 'āĻŦ', 'āĻ­', 'āĻŽ', 'āĻ¯', 'āĻ°', 'āĻ˛', 'āĻļ', 'āĻˇ', 'āĻ¸', 'āĻš', 'ā§œ', 'ā§', 'ā§Ÿ', 'ā§Ž', 'āĻ‚', 'āĻƒ', 'āĻ']
Digits: ['ā§Ļ', 'ā§§', 'ā§¨', 'ā§Š', 'ā§Ē', 'ā§Ģ', 'ā§Ŧ', 'ā§­', 'ā§Ž', 'ā§¯']
Operators: ['=', '+', '-', '*', '/', '%', '<', '>', '×', 'Ãˇ']
Punctuation marks: ['āĨ¤', ',', ';', ':', '?', '!', "'", '.', '"', '-', '[', ']', '{', '}', '(', ')', '–', '—', '―', '~']
Others: ['ā§ŗ', 'ā§ē', 'ā§', 'āĻ€', 'āĻŊ', '#', '$']

2) Word Tokenization

In BLTK, the word_tokenizer(text: str) method of the Tokenizer class performs word tokenization. It takes a text string and returns a list of tokenized words. The Following code shows how it is done.

Code

from bltk.langtools import Tokenizer

# Sample text
text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ "\
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

# Creating an instance
tokenizer = Tokenizer()

# Tokenizing words
print('TOKENIZED WORDS')
words = tokenizer.word_tokenizer(text)
print(words)

Output

TOKENIZED WORDS
['āĻ†āĻŽāĻŋ', 'āĻœāĻžāĻ¨āĻŋ', 'āĻ†āĻŽāĻžāĻ°', 'āĻāĻ‡', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻœāĻ¨ā§āĻ¯', 'āĻ†āĻŽāĻžāĻ•ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻšāĻŦā§‡', ',', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĨ¤', 'āĻ˛āĻŋāĻ–ā§‡', 'āĻ–ā§āĻŦ', 'āĻ•āĻžāĻœ', 'āĻšā§Ÿ', 'āĻ¸ā§‡', 'āĻ°āĻ•āĻŽ', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻ†āĻŽāĻžāĻ°', 'āĻšāĻžāĻ¤ā§‡', 'āĻ–ā§āĻŦ', 'āĻŦā§‡āĻļā§€', 'āĻ¨ā§‡āĻ‡', 'āĻ•āĻŋāĻ¨ā§āĻ¤ā§', 'āĻ…āĻ¨ā§āĻ¤āĻ¤', 'āĻ¨āĻŋāĻœā§‡āĻ°', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°', 'āĻ•āĻ°āĻž', 'āĻ¯āĻžā§Ÿ', 'āĻ¸ā§‡āĻŸāĻžāĻ‡', 'āĻ†āĻŽāĻžāĻ°', 'āĻœāĻ¨ā§āĻ¯ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĨ¤']

3) Sentence Tokenization

In Bengali, most of the sentence delimiters are same as in English except full-stop. Statements and imperative sentences are terminated by āĨ¤ - the Devanagari Danda. Questions and exclamatory sentences are terminated by ? and ! respectively.

In BLTK, the sentence_tokenizer(text: str) method of the Tokenizer class performs sentence tokenization. It takes a text string and returns a list of tokenized sentences. The Following code shows how it is done.

Code

from bltk.langtools import Tokenizer

# Sample text
text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ " \
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

# Creating an instance
tokenizer = Tokenizer()


# Tokenizing Sentences
print("TOKENIZED SENTENCES")
sentences = tokenizer.sentence_tokenizer(text)
print(sentences)

Output

TOKENIZED SENTENCES
['āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤', 'āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤']

4) Sentence Split

The sentence_splitter(sentence: list) method takes a list of tokened sentences and then splits them into their corresponding list of tokened words with the help of word_tokenizer() method. The return value is a list of tokened words lists.

Code

from bltk.langtools import Tokenizer

# Sample text
text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ " \
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

# Creating an instance
tokenizer = Tokenizer()


# Tokenizing Sentences
sentences = tokenizer.sentence_tokenizer(text)

print("SPLIT SENTENCES")
sentence_list = tokenizer.sentence_splitter(sentences)
print(sentence_list)

print("INDIVIDUAL SENTENCE")
for i in sentence_list:
    print(i)

Output

SPLIT SENTENCES
[['āĻ†āĻŽāĻŋ', 'āĻœāĻžāĻ¨āĻŋ', 'āĻ†āĻŽāĻžāĻ°', 'āĻāĻ‡', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻœāĻ¨ā§āĻ¯', 'āĻ†āĻŽāĻžāĻ•ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻšāĻŦā§‡', ',', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĨ¤'], ['āĻ˛āĻŋāĻ–ā§‡', 'āĻ–ā§āĻŦ', 'āĻ•āĻžāĻœ', 'āĻšā§Ÿ', 'āĻ¸ā§‡', 'āĻ°āĻ•āĻŽ', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻ†āĻŽāĻžāĻ°', 'āĻšāĻžāĻ¤ā§‡', 'āĻ–ā§āĻŦ', 'āĻŦā§‡āĻļā§€', 'āĻ¨ā§‡āĻ‡', 'āĻ•āĻŋāĻ¨ā§āĻ¤ā§', 'āĻ…āĻ¨ā§āĻ¤āĻ¤', 'āĻ¨āĻŋāĻœā§‡āĻ°', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°', 'āĻ•āĻ°āĻž', 'āĻ¯āĻžā§Ÿ', 'āĻ¸ā§‡āĻŸāĻžāĻ‡', 'āĻ†āĻŽāĻžāĻ°', 'āĻœāĻ¨ā§āĻ¯ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĨ¤']]

INDIVIDUAL SENTENCE

['āĻ†āĻŽāĻŋ', 'āĻœāĻžāĻ¨āĻŋ', 'āĻ†āĻŽāĻžāĻ°', 'āĻāĻ‡', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻœāĻ¨ā§āĻ¯', 'āĻ†āĻŽāĻžāĻ•ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻšāĻŦā§‡', ',', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĨ¤']
['āĻ˛āĻŋāĻ–ā§‡', 'āĻ–ā§āĻŦ', 'āĻ•āĻžāĻœ', 'āĻšā§Ÿ', 'āĻ¸ā§‡', 'āĻ°āĻ•āĻŽ', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻ†āĻŽāĻžāĻ°', 'āĻšāĻžāĻ¤ā§‡', 'āĻ–ā§āĻŦ', 'āĻŦā§‡āĻļā§€', 'āĻ¨ā§‡āĻ‡', 'āĻ•āĻŋāĻ¨ā§āĻ¤ā§', 'āĻ…āĻ¨ā§āĻ¤āĻ¤', 'āĻ¨āĻŋāĻœā§‡āĻ°', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°', 'āĻ•āĻ°āĻž', 'āĻ¯āĻžā§Ÿ', 'āĻ¸ā§‡āĻŸāĻžāĻ‡', 'āĻ†āĻŽāĻžāĻ°', 'āĻœāĻ¨ā§āĻ¯ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĨ¤']


5) Stopwords Filtering

BLTK's *remove_stopwords(words: list, , level: str = "soft") function by default performs a soft stopwords elimination. It takes two parameters: a list of words and a keyword argument which can be either 'soft', 'moderate' or 'hard'. If no parameter is given, a soft elimination is performed.

Filtering stopwords is not always an ideal choice. In any language, there is no universal list of stop words, and sometimes different researchers use different methods for eliminating stopwords. If you are not sure about which level to use, use the default.

Code

from bltk.langtools import remove_stopwords
from bltk.langtools import Tokenizer

tokenizer = Tokenizer()

text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ " \
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

tokened_words = tokenizer.word_tokenizer(text)

print(f"Len of words: {len(tokened_words)}")
print(f"After soft elimination: {(remove_stopwords(tokened_words))}")
print(f"Length after soft elimination: {len(remove_stopwords(tokened_words))}")
print(f"After moderate elimination: {(remove_stopwords(tokened_words, level='moderate'))}")
print(f"Length after moderate elimination: {len(remove_stopwords(tokened_words, level='moderate'))}")
print(f"After hard elimination: {(remove_stopwords(tokened_words, level='hard'))}")
print(f"Length after hard elimination: {len(remove_stopwords(tokened_words, level='hard'))}")

Output

Len of words: 40
After soft elimination: ['āĻœāĻžāĻ¨āĻŋ', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻ…āĻ¨ā§‡āĻ•', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĻ˛āĻŋāĻ–ā§‡', 'āĻ•āĻžāĻœ', 'āĻ°āĻ•āĻŽ', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻšāĻžāĻ¤ā§‡', 'āĻŦā§‡āĻļā§€', 'āĻ…āĻ¨ā§āĻ¤āĻ¤', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°', 'āĻ•āĻ°āĻž', 'āĻ¸ā§‡āĻŸāĻžāĻ‡', 'āĻ…āĻ¨ā§‡āĻ•']
Length after soft elimination: 20
After moderate elimination: ['āĻœāĻžāĻ¨āĻŋ', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĻ˛āĻŋāĻ–ā§‡', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻšāĻžāĻ¤ā§‡', 'āĻŦā§‡āĻļā§€', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°', 'āĻ¯āĻžā§Ÿ', 'āĻœāĻ¨ā§āĻ¯ā§‡']
Length after moderate elimination: 15
After hard elimination: ['āĻœāĻžāĻ¨āĻŋ', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĻ˛āĻŋāĻ–ā§‡', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻšāĻžāĻ¤ā§‡', 'āĻŦā§‡āĻļā§€', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°']
Length after hard elimination: 13

6) Statistical Part-of-speech Tagging

BLTK includes a statistical POS tagger which has an overall system accuracy of 95.9%. The POS tagger works in sentence-level which means that instead of tagging a word individually, it tags words in a sentence or a phrase, taking features such as previous word and next word into consideration. It relies on the Logistic Regression classifier.

The BLTK's PosTagger class has a method pos_tag() which takes a list of split sentences and returns a list of tagged sentences. Each tagged sentence is a list of tuples of length 2 each, where the first index of the tuple holds the word itself and the last index holds its corresponding tag.

Code

from bltk.langtools import PosTagger
from bltk.langtools import Tokenizer

pos_tagger = PosTagger()
tokenizer = Tokenizer()

text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ " \
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

token_text = tokenizer.sentence_tokenizer(text)


pos_tags = []
for text in token_text:
    tokened = tokenizer.word_tokenizer(text)
    tagged = pos_tagger.pos_tag(tokened)
    pos_tags.append(tagged)
print(pos_tags)

Output

[[('āĻ†āĻŽāĻŋ', 'PPR'), ('āĻœāĻžāĻ¨āĻŋ', 'VM'), ('āĻ†āĻŽāĻžāĻ°', 'PPR'), ('āĻāĻ‡', 'DAB'), ('āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'NC'), ('āĻœāĻ¨ā§āĻ¯', 'PP'), ('āĻ†āĻŽāĻžāĻ•ā§‡', 'PPR'), ('āĻ…āĻ¨ā§‡āĻ•', 'JQ'), ('āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'NC'), ('āĻļā§āĻ¨āĻ¤ā§‡', 'VM'), ('āĻšāĻŦā§‡', 'VA'), (',', 'PU'), ('āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'ALC'), ('āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'VM'), ('āĨ¤', 'PU')], [('āĻ˛āĻŋāĻ–ā§‡', 'VM'), ('āĻ–ā§āĻŦ', 'JQ'), ('āĻ•āĻžāĻœ', 'NC'), ('āĻšā§Ÿ', 'VM'), ('āĻ¸ā§‡', 'PPR'), ('āĻ°āĻ•āĻŽ', 'NC'), ('āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'NC'), ('āĻ†āĻŽāĻžāĻ°', 'PPR'), ('āĻšāĻžāĻ¤ā§‡', 'NC'), ('āĻ–ā§āĻŦ', 'JQ'), ('āĻŦā§‡āĻļā§€', 'JJ'), ('āĻ¨ā§‡āĻ‡', 'VM'), ('āĻ•āĻŋāĻ¨ā§āĻ¤ā§', 'CSB'), ('āĻ…āĻ¨ā§āĻ¤āĻ¤', 'CSB'), ('āĻ¨āĻŋāĻœā§‡āĻ°', 'PRF'), ('āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'NST'), ('āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'NC'), ('āĻŦā§‡āĻ°', 'VM'), ('āĻ•āĻ°āĻž', 'NV'), ('āĻ¯āĻžā§Ÿ', 'VM'), ('āĻ¸ā§‡āĻŸāĻžāĻ‡', 'PPR'), ('āĻ†āĻŽāĻžāĻ°', 'PPR'), ('āĻœāĻ¨ā§āĻ¯ā§‡', 'PP'), ('āĻ…āĻ¨ā§‡āĻ•', 'JQ'), ('āĨ¤', 'PU')]]


7) Phrase Chunking/Named-Entity Recognition

BLTK's phrase chunker can find out all the phrases in a given text as long as a correct grammatical syntax for that phrase is provided in the form of a regular expression. The performance of the chunker is unparalleled since it heavily relies on the BLTK's POS Tagger which has an outstanding accuracy and NLTK's Regular Expression Parser which is extremely powerful.

BLTK's Chunker class has method named chunk() that takes two parameters: a grammar in the form of regular expression, and a text from which phrases will be extracted.

This section explains how to create a noun phrase chunker using BLTK's Chunker class and a regular expression grammar. A noun phrase begins with an optional demonstrative, followed by zero or more adjectives/quantifiers and terminates with a noun. Some examples of Bangla noun phrases are given below:

NP: (NP āĻ—āĻŖāĻ¤āĻ¨ā§āĻ¤ā§āĻ°/NC) - a noun phrase with only one noun.

NP: (NP āĻŽāĻžāĻ¨āĻŦāĻŋāĻ•/JJ āĻŦā§‹āĻ§/NC) - a noun phrase with an adjective followed by a noun.

NP: (NP āĻāĻ‡/DAB āĻ¸ā§āĻ¨ā§āĻĻāĻ°/JJ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°/NC) - a noun phrase with a demonstrative, followed by an adjective and terminated by a noun.

The grammar for extracting Bangla noun phrases can be constructed with the following regular expression.

 NP: {<DAB|DRL>?<JJ|JQ>*<N.>} 

The tags used to construct the grammar as well as training the POS Tagger have been explained in the following table as specified by reserachers at Microsof Research, India. Grammars for verb phrases, named entities, etc. can also be constructed in the similar fashion.

Name Tag Example
COMMON NOUN NC āĻŽāĻžāĻ¨ā§āĻˇ
PROPER NOUN NP āĻ°āĻŦā§€āĻ¨ā§āĻĻā§āĻ°āĻ¨āĻžāĻĨ
VERBAL NOUN NV āĻ˜āĻŸāĻžāĻ¨ā§‹
SPATIO-TEMPORAL NOUN NST āĻ‰āĻĒāĻ°ā§‡
MAIN VERB VM āĻ•āĻ°āĻ›āĻŋāĻ˛ā§‡āĻ¨
AUXILIARY VERB VA āĻāĻ¸ā§‡
PRONOMINAL PRONOUN PPR āĻ†āĻŽāĻžāĻĻā§‡āĻ°
REFLEXIVE PRONOUN PRF āĻ¨āĻŋāĻœ
RECIPROCAL PRONOUN PRC āĻĒāĻ°āĻ¸ā§āĻĒāĻ°
RELATIVE PRONOUN PRL āĻ¯āĻžāĻšāĻžāĻ°
WH-PRONOUN PWH āĻ•ā§‡āĻ¨
ADJECTIVE JJ āĻ—ā§āĻ°ā§āĻ¤ā§āĻŦāĻĒā§‚āĻ°ā§āĻŖ
QUANTIFIER JQ āĻ•ā§Ÿā§‡āĻ•āĻŸāĻŋ
ABSOLUTE DEMONSTRATIVE DAB āĻāĻ‡
RELATIVE DEMONSTRATIVE DRL āĻ¯ā§‡
WH-DEMONSTRATIVE DWH āĻ•ā§€
ADVERB of MANNER AMN āĻ†āĻŦāĻžāĻ°
ADVERB of LOCATION ALC āĻ¯āĻ–āĻ¨
CONDITIONAL PARTICIPLE LC āĻšāĻ˛ā§‡āĻ‡
VERBAL PARTICIPLE LV āĻŦāĻ‡āĻ¤ā§‡-āĻŦāĻ‡āĻ¤ā§‡āĻ‡
POSTPOSITION PP āĻœāĻ¨ā§āĻ¯
COORDINATING PARTICLE CCD āĻāĻŦāĻ‚
SUBORDINATING PARTICLE CSB āĻ¸ā§āĻ¤āĻ°āĻžāĻ‚
CLASSIFIER PARTICLE CCL āĻĒā§āĻ°āĻŽā§āĻ–
INTERJECTION CIN āĻ†āĻ°ā§‡
OTHER PARTICLE CX āĻ¤āĻžāĻ‡
PUNCTUATION PU ā§ˇ
FOREIGN WORD RDF Schedule
SYMBOL RDS $
OTHER RDX ā§Šā§Ģā§Ŧ

Like grammar for noun phrases, grammar for verb phrases, postpositional phrases, etc. can be constructed with valid regular expressions.

Code

from bltk.langtools import Tokenizer
from bltk.langtools import Chunker


grammar = r"""
  NP: {<DAB>?<JJ|JQ>*<N.>}      
  """
text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ " \
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

tokenizer = Tokenizer()
sentences = tokenizer.sentence_tokenizer(text)
tokened_text = [tokenizer.word_tokenizer(sentence) for sentence in sentences]

noun_phrases = []
for t in tokened_text:
    chunky = Chunker(grammar=grammar, tokened_text=t)
    chunk_tree = chunky.chunk()
    for i in chunk_tree.subtrees():
        if i.label() == "NP":
            print(i)
            noun_phrases.append(i)

Output

(NP āĻāĻ‡/DAB āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°/NC)
(NP āĻ…āĻ¨ā§‡āĻ•/JQ āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ/NC)
(NP āĻ–ā§āĻŦ/JQ āĻ•āĻžāĻœ/NC)
(NP āĻ°āĻ•āĻŽ/NC)
(NP āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ/NC)
(NP āĻšāĻžāĻ¤ā§‡/NC)
(NP āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§/NC)
(NP āĻ•āĻ°āĻž/NV)

Note: BLTK's Phrase Chunker relies on BLTK's POS Tagger and NLTK's RegexParser. For a complete documentation on NLTK's Tree class, which has been used in its RegexParser, follow this link.

8) Stemming

BLTK currently supports one stemmer - the Ugra stemmer. It relies on some pre-arranged lists of suffixes and BLTK's POS Tagger for stemming Bangla words. The reason POS tagging is done before any stemming is even performed is that eliminating suffixes without determining part-of-speech of the words leads to serious miss-stemming issues.

The inflectional morpheme 'āĻ“' or 'āĻ‡' modifies a word such as 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“' . Ugra eliminates 'āĻ“' or 'āĻ‡' from the end of the words and makes sure that after elimination of it, the lengths of the words are greater than or equal to two in terms of the number of characters. It should be noted that if 'āĻ“' or 'āĻ‡' is an independent word, it's never removed.

Code

from bltk.langtools import UgraStemmer
from bltk.langtools import Tokenizer


text = "āĻ†āĻŽāĻŋ āĻœāĻžāĻ¨āĻŋ āĻ†āĻŽāĻžāĻ° āĻāĻ‡ āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ° āĻœāĻ¨ā§āĻ¯ āĻ†āĻŽāĻžāĻ•ā§‡ āĻ…āĻ¨ā§‡āĻ• āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ āĻļā§āĻ¨āĻ¤ā§‡ āĻšāĻŦā§‡, āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“ āĻ˛āĻŋāĻ–āĻ›āĻŋāĨ¤ " \
       "āĻ˛āĻŋāĻ–ā§‡ āĻ–ā§āĻŦ āĻ•āĻžāĻœ āĻšā§Ÿ āĻ¸ā§‡ āĻ°āĻ•āĻŽ āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ āĻ†āĻŽāĻžāĻ° āĻšāĻžāĻ¤ā§‡ āĻ–ā§āĻŦ āĻŦā§‡āĻļā§€ āĻ¨ā§‡āĻ‡ āĻ•āĻŋāĻ¨ā§āĻ¤ā§ āĻ…āĻ¨ā§āĻ¤āĻ¤ āĻ¨āĻŋāĻœā§‡āĻ° āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ° āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§ āĻŦā§‡āĻ° āĻ•āĻ°āĻž " \
       "āĻ¯āĻžā§Ÿ āĻ¸ā§‡āĻŸāĻžāĻ‡ āĻ†āĻŽāĻžāĻ° āĻœāĻ¨ā§āĻ¯ā§‡ āĻ…āĻ¨ā§‡āĻ•āĨ¤"

stemmer = UgraStemmer()
tokenizer = Tokenizer()
tokenized_text = tokenizer.word_tokenizer(text)

stem = stemmer.stem(tokenized_text)

print(f"Before stemming: {tokenized_text}")
print(f'After stemming: {stem}')

Output

Before stemming: ['āĻ†āĻŽāĻŋ', 'āĻœāĻžāĻ¨āĻŋ', 'āĻ†āĻŽāĻžāĻ°', 'āĻāĻ‡', 'āĻ˛ā§‡āĻ–āĻžāĻŸāĻŋāĻ°', 'āĻœāĻ¨ā§āĻ¯', 'āĻ†āĻŽāĻžāĻ•ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻšāĻŦā§‡', ',', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡āĻ“', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĨ¤', 'āĻ˛āĻŋāĻ–ā§‡', 'āĻ–ā§āĻŦ', 'āĻ•āĻžāĻœ', 'āĻšā§Ÿ', 'āĻ¸ā§‡', 'āĻ°āĻ•āĻŽ', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻ†āĻŽāĻžāĻ°', 'āĻšāĻžāĻ¤ā§‡', 'āĻ–ā§āĻŦ', 'āĻŦā§‡āĻļā§€', 'āĻ¨ā§‡āĻ‡', 'āĻ•āĻŋāĻ¨ā§āĻ¤ā§', 'āĻ…āĻ¨ā§āĻ¤āĻ¤', 'āĻ¨āĻŋāĻœā§‡āĻ°', 'āĻ­ā§‡āĻ¤āĻ°ā§‡āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­āĻŸā§āĻ•ā§', 'āĻŦā§‡āĻ°', 'āĻ•āĻ°āĻž', 'āĻ¯āĻžā§Ÿ', 'āĻ¸ā§‡āĻŸāĻžāĻ‡', 'āĻ†āĻŽāĻžāĻ°', 'āĻœāĻ¨ā§āĻ¯ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĨ¤']

After stemming: ['āĻ†āĻŽāĻŋ', 'āĻœāĻžāĻ¨āĻŋ', 'āĻ†āĻŽāĻŋ', 'āĻāĻ‡', 'āĻ˛ā§‡āĻ–āĻž', 'āĻœāĻ¨ā§āĻ¯', 'āĻ†āĻŽāĻŋ', 'āĻ…āĻ¨ā§‡āĻ•', 'āĻ—āĻžāĻ˛āĻŽāĻ¨ā§āĻĻ', 'āĻļā§āĻ¨āĻ¤ā§‡', 'āĻšāĻŦā§‡', ',', 'āĻ¤āĻžāĻ°āĻĒāĻ°ā§‡', 'āĻ˛āĻŋāĻ–āĻ›āĻŋ', 'āĨ¤', 'āĻ˛āĻŋāĻ–ā§‡', 'āĻ–ā§āĻŦ', 'āĻ•āĻžāĻœ', 'āĻšā§Ÿ', 'āĻ¸ā§‡', 'āĻ°āĻ•āĻŽ', 'āĻ‰āĻĻāĻžāĻšāĻ°āĻŖ', 'āĻ†āĻŽāĻŋ', 'āĻšāĻžāĻ¤ā§‡', 'āĻ–ā§āĻŦ', 'āĻŦā§‡āĻļ', 'āĻ¨ā§‡', 'āĻ•āĻŋāĻ¨ā§āĻ¤ā§', 'āĻ…āĻ¨ā§āĻ¤āĻ¤', 'āĻ¨āĻŋāĻœā§‡āĻ°', 'āĻ­ā§‡āĻ¤āĻ°', 'āĻ•ā§āĻˇā§‹āĻ­', 'āĻŦā§‡āĻ°', 'āĻ•āĻ°āĻž', 'āĻ¯āĻžā§Ÿ', 'āĻ¸ā§‡āĻŸāĻŋ', 'āĻ†āĻŽāĻŋ', 'āĻœāĻ¨ā§āĻ¯ā§‡', 'āĻ…āĻ¨ā§‡āĻ•', 'āĨ¤']


Contribution

If you want to contribute, please make a pull request and wait for PR confirmation. You can also send me a mail to saimoncse19@gmail.com with the subject Contributing to BLTK specifying a little bit about what you are interested to contribute.

Contribution can also be made by adding issues.