newmm-tokenizer

Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP.

Objectives

This repository is created for reducing an overall size of original PyThaiNLP Tokenizer Module. The main objective is to be able to segment Thai sentences into a list of words.

Supports

The module supports Python 3.7+ as follow the original PyThaiNLP repository.

Installation

pip install newmm-tokenizer

How to Use

from newmm_tokenizer.tokenizer import word_tokenize

text = 'เป็นเรื่องแรกที่ร้องไห้ตั้งแต่ ep 1 แล้วก็เป็นเรื่องแรกที่เลือกไม่ได้ว่าจะเชียร์พระเอกหรือพระรองดี 19...'
words = word_tokenize(text)

print(words) 
# ['เป็นเรื่อง', 'แรก', 'ที่', 'ร้องไห้', 'ตั้งแต่', ' ', 'ep', ' ', '1', ' ', 'แล้วก็', 'เป็นเรื่อง', 'แรก', 'ที่', 'เลือกไม่ได้', 'ว่า', 'จะ', 'เชียร์', 'พระเอก', 'หรือ', 'พระรอง', 'ดี', ' ', '19', '...']

LICENSE

Please see the original license of PyThaiNLP here

newmm-tokenizer
Release 0.2.2

Release 0.2.2

0.2.2

0.2.1

0.2.0

0.1.1

0.1.0

Documentation

newmm-tokenizer

Objectives

Supports

Installation

How to Use

LICENSE

Stats

Development practices

Releases

Contributors

newmm-tokenizer Release 0.2.2

Release 0.2.2 Toggle Dropdown 0.2.2 0.2.1 0.2.0 0.1.1 0.1.0

Documentation

newmm-tokenizer

Objectives

Supports

Installation

How to Use

LICENSE

Stats

Development practices

Releases

Contributors

newmm-tokenizer
Release 0.2.2

Release 0.2.2

0.2.2

0.2.1

0.2.0

0.1.1

0.1.0