Fast and customizable tokenizer


License
MIT
Install
pip install tok==0.0.4

Documentation

tok

PyPI PyPI

Fast and most complete/customizable tokenizer in Python.

It is roughly 25x faster than spacy's and nltk's regex based tokenizers.

Using the aho-corasick algorithm makes it a novelty and allows it to be both explainable and fast in how it will split.

The heavy lifting is done by textsearch and pyahocorasick, allowing this to be written in only ~200 lines of code.

Contrary to regex-based approaches, it will go over each character in a text only once. Read below about how this works.

Installation

pip install tok

Usage

By default it handles contractions, http, (float) numbers and currencies.

from tok import word_tokenize
word_tokenize("I wouldn't do that.... would you?")
['I', 'would', 'not', 'do', 'that', '...', 'would', 'you', '?']

Or configure it yourself:

from tok import Tokenizer
tokenizer = Tokenizer(protected_words=["some.thing"]) # still using the defaults
tokenizer.word_tokenize("I want to protect some.thing")
['I', 'want', 'to', 'protect', 'some.thing']

Split by sentences:

from tok import sent_tokenize
sent_tokenize("I wouldn't do that.... would you?")
[['I', 'would', 'not', 'do', 'that', '...'], ['would', 'you', '?']]

for more options check the documentation of the Tokenizer.

Further customization

Given:

from tok import Tokenizer
t = Tokenizer(protected_words=["some.thing"]) # still using the defaults

You can add your own ideas to the tokenizer by using:

  • t.keep(x, reason): Whenever it finds x, it will not add whitespace. Prevents direct tokenization.
  • t.split(x, reason): Whenever it finds x, it will surround it by whitespace, thus creating a token.
  • t.drop(x, reason): Whenever it finds x, it will remove it but add a split.
  • t.strip(x, reason): Whenever it finds x, it will remove it without splitting.
t.drop("bla", "bla is not needed")
t.word_tokenize("Please remove bla, thank you")
['Please', 'remove', ',', 'thank', 'you']

Explainable

Explain what happened:

t.explain("bla")
[{'from': 'bla', 'to': ' ', 'explanation': 'bla is not needed'}]

See everything in there (will help you understand how it works):

t.explain_dict

How it works

It will always only keep the longest match. By introducing a space in your tokens, it will make it be split.

If you consider how the tokenization of . works, see here:

  • When it finds a A. it will make it A. (single letter abbreviations)
  • When it finds a .0 it will make it .0 (numbers)
  • When it finds a ., it will make it . (thus making a split)

If you want to make sure something including a dot stays, you can use for example:

t.keep("cool.")

Contributing

It would be greatly appreciated if you want to contribute to this library.

It would also be great to add contractions for other languages.