sltkpy

Sinhala Language Tool Kit


Keywords
python, Sinhala, Tokenizer
License
MIT
Install
pip install sltkpy==0.1.0b1

Documentation

PyPI - Version PyPI - Status PyPI - Format PyPI - Types Pepy Total Downloads PyPI - License GitHub commit activity

SLTK: A Comprehensive Tokenizer for Sinhala Language

Welcome to the GitHub repository for SLTK, a powerful tokenizer designed to enhance Sinhala Natural Language Processing (NLP) tasks. SLTK implements Grapheme Pair Encoding for tokenizing. Although our first SLTK version was implemented using our own research, this is implemented inspired by the research paper by Velayuthan et al. (2024).

Installation

To install SLTK, run following command:

pip install sltkpy

Usage

You can train the tokenizer on a custom dataset to create your own vabulary and use it to tokenize your text data. First, import SLTK:

from sltkpy import GPETokenizer

Now initialize the tokenizer:

tokenizer = GPETokenizer()

Train new vocab

To train a new vocab, provide corpus to the train method. Additionally you can provide the maximum size of vocab to vocab_size and the minimum frequency for a pair to be qualified as a vocab by setting min_freq.

vocab = tokenizer.train(corpus=corpus, vocab_size=3000)

Note: Default value of min_freq is 3.

Once the training is finished, the method will return the vocab as a dictionary. You can save it as a JSON file to use it in future.

Load vocab

There are two ways to load vocab to the tokenizer. Either you can use your own vocab or you can load the pre-trained vocab available within the SLTK library. It is trained on Wikipedia Sinhala Dataset on Huggingface Datasets.

  1. Load pre-trained vocab:
tokenizer.pre_load()
  1. Load your own trained vocab:
tokenizer.load_vocab('<path_to_your_vocab>.json')

Tokenize text

Once you have loaded vocab using any method above, you can tokenize your text as follows:

tokens = tokenizer.tokenize('ශ්‍රී ලංකාව සිලෝන් ලෙස ද හැඳින් වේ.')

Encode tokens

To encode tokens, use following method:

encoded_tokens = tokenizer.encode(tokens)

Decode tokens

To decode tokens, use the following method:

decoded_text = tokenizer.decode(encoded_tokens)