nlpTurk - Turkish NLP library
nlpTurk is an open source Turkish NLP library consisting of machine learning based sentence boundary detection, lemmatization and POS tagging models.
Installation & Usage
nlpTurk can be installed from PyPI.
pip install nlpturk
nlpTurk offers a simple API to extract sentences, lemmas and POS tags.
import nlpturk
text = "Sosyal medya hayatımıza hızlı girdi.ama yazım kurallarına dikkat eden pek yok :)"
doc = nlpturk(text)
# iterate over tokens
for token in doc:
print(f"token: {token.text}, lemma: {token.lemma}, pos: {token.pos}")
"""
Prints:
token: Sosyal, lemma: sosyal, pos: ADJ
token: medya, lemma: medya, pos: NOUN
...
"""
# or get tokens by token ids
token = doc[5]
print(f"token: {token.text}, sent_start: {token.is_sent_start}, sent_end: {token.is_sent_end}")
token = doc[6]
print(f"token: {token.text}, sent_start: {token.is_sent_start}, sent_end: {token.is_sent_end}")
"""
Prints:
token: ., sent_start: False, sent_end: True
token: ama, sent_start: True, sent_end: False
"""
# iterate over sentences
for i, sent in enumerate(doc.sents):
print(f"sentence #{i+1}: {sent.text}")
for token in sent:
print(f" token: {token.text}, lemma: {token.lemma}, pos: {token.pos}")
"""
Prints:
sentence #1: Sosyal medya hayatımıza hızlı girdi.
token: Sosyal, lemma: sosyal, pos: ADJ
...
sentence #2: ama yazım kurallarına dikkat eden pek yok :)
token: ama, lemma: ama, pos: CCONJ
...
"""
Performance
The evaluation was performed on test dataset. Detailed evaluation and benchmarking results can be found here.
accuracy | precision | recall | f1-score | |
---|---|---|---|---|
Sentence Segmenter | - | 98.09 | 96.05 | 97.06 |
POS Tagger | - | 95.75 | 96.26 | 96.01 |
Lemmatizer | 96.87 | - | - | - |
You can perform benchmarking on your own dataset.
git clone https://github.com/nlpturk/nlpturk.git
cd nlpturk
pip install -r requirements.txt
python -m nlpturk benchmark --data_path path/to/data --output_path path/to/output