Chulalongkorn University Natural Languages Processing Library (Beta)


Keywords
nlp, thai, machine-learning, tokenization, wordembeddings
License
GPL-3.0
Install
pip install cunlp==0.3.2

Documentation

CUnlp v0.3.2 (beta) license

What is CUnlp?

      a python library for NLP tasks in thai language by using machine learning approach running on TensorFlow and Keras.

Features List

Model

  • Word tokenization
  • POS tagging
  • Sentiment analysis (soon)
  • Topic analysis (soon)
  • Latent analysis (soon)
  • Review analysis (soon)

Embedding

  • Word2Vector
  • Compare word similarity
  • K-nearest word similarity
  • Word substitution
  • Initialize embeddings

and more..

Requirement

       Now our library only supported Python3 with TensorFlow v1.4.0rc0+ and Keras v2.1.5 installed.

Installation

Directly install via PyPI

$ pip install cunlp

Usage

import cunlp as cu

# Word tokenization
sentence = "สวัสดีชาวโลกเรามาช่วยตัดคำให้"
tokens_in_list = cu.model.tokenize(sentence)
tokens_in_string = cu.model.tokenize(sentence, listing=False)

# POS tagging
sentence = "ฉันชอบกินอาหารจีน"
tokens_of_sentence = cu.model.tokenize(sentence)
pos_of_words = cu.model.pos(tokens_of_sentence)

# Word embedding
word_a = "หมา"
word_b = "แมว"
word_c = "เสื้อฮาวาย"
vector_of_word_a = cu.embedding.vectorize(word_a)
vector_and_word_of_word_a = cu.embedding.vectorize_in_depth(word_a)
substituted_word_c = cu.embedding.substitute(word_c)

similarity_score = cu.embedding.compare_similarity(word_a, word_b)
similarity_score_with_substitution = cu.embedding.compare_similarity(word_a, word_b, sub=True)

top_three_similar = cu.embedding.most_similarity(word_c, rank=3)
top_three_similar_with_substitution = cu.embedding.most_similarity(word_c, rank=3, sub=True)

API

https://cunlp-api.herokuapp.com/tokenize?sentence=ทดสอบการตัดคำอย่างง่าย
https://cunlp-api.herokuapp.com/pos?sentence=ฉันชอบกินอาหารจีนมาก
https://cunlp-api.herokuapp.com/vectorize?word=แมว
https://cunlp-api.herokuapp.com/compare_similarity?word1=แมว&word2=หมา

*For testing only!

Benchmark

Task Precision Recall F1-score Detail
Word tokenization 0.97072 0.97052 0.97062 on BEST2010
Word embedding - - - view
POS tagging 0.81327 0.75963 0.78554 view

Contributor

Danupat Khamnuansin
jrkns


Nuttasit Mahakusolsirikul
nattasit-m