Welcome to the GitHub repository for SLTK, a powerful tokenizer designed to enhance Sinhala Natural Language Processing (NLP) tasks. SLTK implements Grapheme Pair Encoding for tokenizing. Although our first SLTK version was implemented using our own research, this is implemented inspired by the research paper by Velayuthan et al. (2024).
To install SLTK, run following command:
pip install sltkpy
You can train the tokenizer on a custom dataset to create your own vabulary and use it to tokenize your text data. First, import SLTK:
from sltkpy import GPETokenizer
Now initialize the tokenizer:
tokenizer = GPETokenizer()
To train a new vocab, provide corpus
to the train
method. Additionally you can provide the maximum size of vocab to vocab_size
and the minimum frequency for a pair to be qualified as a vocab by setting min_freq
.
vocab = tokenizer.train(corpus=corpus, vocab_size=3000)
Note: Default value of
min_freq
is 3.
Once the training is finished, the method will return the vocab as a dictionary. You can save it as a JSON file to use it in future.
There are two ways to load vocab to the tokenizer. Either you can use your own vocab or you can load the pre-trained vocab available within the SLTK library. It is trained on Wikipedia Sinhala Dataset on Huggingface Datasets.
- Load pre-trained vocab:
tokenizer.pre_load()
- Load your own trained vocab:
tokenizer.load_vocab('<path_to_your_vocab>.json')
Once you have loaded vocab using any method above, you can tokenize your text as follows:
tokens = tokenizer.tokenize('ශ්රී ලංකාව සිලෝන් ලෙස ද හැඳින් වේ.')
To encode tokens, use following method:
encoded_tokens = tokenizer.encode(tokens)
To decode tokens, use the following method:
decoded_text = tokenizer.decode(encoded_tokens)