fitrat

An NLP library for Uzbek. It includes morphological analysis, language identification, transliterators and tokenizers.


Keywords
language, morphology, nlp, tahrirchi, transliteration, uzbek
License
Other
Install
pip install fitrat==0.0.9

Documentation

fitrat

Abdurauf Fitrat

An NLP library for Uzbek. It includes morphological analysis, transliterators, language identifiers, tokenizers and many more.

It is named after historian and linguist Abdurauf Fitrat, who was one of the creators of Modern Uzbek as well as the first Uzbek professor.


Usage

Installation

pip install fitrat

Transliteration

We used hfst library for creating transliterators. This library provides finite-state transducers, a finite-state machines that come very handy for efficient mapping one text to another.

from fitrat import Transliterator, WritingType

t = Transliterator(to=WritingType.LAT)
result = t.convert("ΠšΠ΅Ρ‡Π° Ρ†ΠΈΡ€ΠΊΠΊΠ° Π±ΠΎΡ€Π΄ΠΈΠΌ.")
print(result)
# Kecha sirkka bordim.

t2 = Transliterator(to=WritingType.CYR)
result = t2.convert("Kecha sirkka bordim.")
print(result)
# ΠšΠ΅Ρ‡Π° Ρ†ΠΈΡ€ΠΊΠΊΠ° Π±ΠΎΡ€Π΄ΠΈΠΌ.

While Cyrillic-Latin conversion is rule-based and simple, the converse is not true. We included special pre-compiled exceptions transducer for Latin-Cyrillic that handles all (to our knowledge) exceptions. We'll continue working on improving on our exceptions list.

If you want to compile the transliterators from source, you have to use hfst-dev or hfst library. The package uses only pre-compiled binaries and hfstol library for efficient lookup.

Language Identification

We can recognize Uzbek text, both Latin or Cyrillic. Additionally, we can recognize other major languages, such as Russian, English, Arabic and etc.

from fitrat import LanguageDetector

lang_detector = LanguageDetector()

print(lang_detector.is_uzbek("bu o'zbekchada yozilgan matn"))
# True

print(lang_detector.is_uzbek("Π±Ρƒ Π½ΠΎΡ‚ΡƒΠ³Ρ€ΠΈ ΠΉΠΎΠ·ΠΈΠ»Π³Π°Π½ булсаям, Π»Π΅ΠΊΠΈΠ½ ΡƒΠ·Π±Π΅ΠΊΡ‡Π° ΠΌΠ°Ρ‚Π½"))
# True

print(lang_detector.is_uzbek("ВСкст Π½Π° русском языкС"))
# False

Tokenization

from fitrat import word_tokenize

s = "Bugun o'zbekchada gapirishga qaror qildim!"
print(word_tokenize(s))
# ['Bugun', "o'zbekchada", 'gapirishga', 'qaror', 'qildim', '!']

Authors

  • Mukhammadsaid Mamasaidov
  • Jasur Yusupov