SLTK: A Comprehensive Tokenizer for Sinhala Language

Welcome to the GitHub repository for SLTK, a powerful tokenizer designed to enhance Sinhala Natural Language Processing (NLP) tasks. SLTK implements Grapheme Pair Encoding for tokenizing. Although our first SLTK version was implemented using our own research, this is implemented inspired by the research paper by Velayuthan et al. (2024).

Note

Installation

To install SLTK, run following command:

pip install sltkpy

Usage

You can train the tokenizer on a custom dataset to create your own vabulary and use it to tokenize your text data. First, import SLTK:

from sltkpy import GPETokenizer

Now initialize the tokenizer:

tokenizer = GPETokenizer()

Train new vocab

To train a new vocab, provide corpus to the train method. Additionally you can provide the maximum size of vocab to vocab_size and the minimum frequency for a pair to be qualified as a vocab by setting min_freq.

vocab = tokenizer.train(corpus=corpus, vocab_size=3000)

Note: Default value of min_freq is 3.

Once the training is finished, the method will return the vocab as a dictionary. You can save it as a JSON file to use it in future.

Load vocab

There are two ways to load vocab to the tokenizer. Either you can use your own vocab or you can load the pre-trained vocab available within the SLTK library. It is trained on Wikipedia Sinhala Dataset on Huggingface Datasets.

Load pre-trained vocab:

tokenizer.pre_load()

Load your own trained vocab:

tokenizer.load_vocab('<path_to_your_vocab>.json')

Tokenize text

Once you have loaded vocab using any method above, you can tokenize your text as follows:

tokens = tokenizer.tokenize('ශ්‍රී ලංකාව සිලෝන් ලෙස ද හැඳින් වේ.')

Encode tokens

To encode tokens, use following method:

encoded_tokens = tokenizer.encode(tokens)

Decode tokens

To decode tokens, use the following method:

decoded_text = tokenizer.decode(encoded_tokens)

sltkpy
Release 0.1.0b1

Release 0.1.0b1

0.0.2

1.0.0

0.1.1

0.1.0b1

0.0.6

0.0.5

0.0.4

0.0.3

0.0.1

Documentation

SLTK: A Comprehensive Tokenizer for Sinhala Language

Installation

Usage

Train new vocab

Load vocab

Tokenize text

Encode tokens

Decode tokens

Stats

Releases

Contributors

sltkpy Release 0.1.0b1

Release 0.1.0b1 Toggle Dropdown 0.0.2 1.0.0 0.1.1 0.1.0b1 0.0.6 0.0.5 0.0.4 0.0.3 0.0.1

Documentation

SLTK: A Comprehensive Tokenizer for Sinhala Language

Installation

Usage

Train new vocab

Load vocab

Tokenize text

Encode tokens

Decode tokens

Stats

Releases

Contributors

sltkpy
Release 0.1.0b1

Release 0.1.0b1

0.0.2

1.0.0

0.1.1

0.1.0b1

0.0.6

0.0.5

0.0.4

0.0.3

0.0.1