- THE MODULE IS IN ALPHA VERSION, SOME FUNCTIONS AREN'T BACKWORD COMPATIBLE.
One of the things NLP suffers from, at least in English, is that tokenizing function has a hard time extracting multi-words, such as "New York", from input text.
However, for the practical usage, they need to be treated as one word.
- Install the module using pip:
pip install multiword_tokenization
- Before using the module, you need to prepare a list of multi-words to extract. If you have csv file splitted by newline character, you can feed them to the instance.
- For user's convenience, the class will allow user to specify Python function to tokenize input text. You can use your favorite tokenizer. In the example below,
nltkpackage is used.
from multiword_tokenization import MultiWordTokenizer from nltk.tokenize import word_tokenize keywords = ['New York', 'San Francisco', 'New Orleans'] mword_tokenizer = MultiWordTokenizer(word_tokenize, keywords)
- Then, all you have to do is to call
text = 'A Python event was held in New York.' tokenized = mword_tokenizer.tokenize(word_tokenize, text) print(tokenized) >>> ['A', 'Python', 'event', 'was', 'held', 'in', 'New York', '.']
What happens under the hood
- User-specified Python function will tokenize input text:
['A', 'Python', 'event', 'was', 'held', 'in', 'New', 'York', '.']
- A sliding-"window" will be applied to every multi-word and replace if it is one of the keywords, until there's nothing to replace:
Python event, ...,
__replace_one_word is called while "connecting" words.
w denote the word count of the input text,
m the length of the longest keyword/entity, the complexity can be represented as following:
Therefore, the longer the input text is, or the longer the keyword that has the maximum length is, the more time it it takes.
Contribution / Wanted
- The module is very expensive, considering it has a lot of for loops. C/C++ extension will be nice to implement to make it faster.
- Integration with CI tool, such as Travis
Apache Licence 2.0