pythainlp-rust-modules

pythainlp-rust-modules is now nlpo3


Keywords
hacktoberfest, natural-language-processing, nodejs, python, rust, text-processing, thai-language, tokenizer
License
Apache-2.0
Install
pip install pythainlp-rust-modules==0.2.2

Documentation

nlpO3

Thai Natural Language Processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.

Features

  • Thai word tokenizer
    • use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
      • 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
    • load a dictionary from a plain text file (one word per line) or from Vec<String>

Dictionary file

  • For the interest of library size, nlpO3 does not assume what dictionary the developer would like to use. It does not come with a dictionary. A dictionary is needed for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Usage

Command-line interface

echo "āļ‰āļąāļ™āļāļīāļ™āļ‚āđ‰āļēāļ§" | nlpo3 segment

Bindings

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("āļŠāļ§āļąāļŠāļ”āļĩāļ„āļĢāļąāļš", "dict_name")

As Rust library

crates.io

In Cargo.toml:

[dependencies]
# ...
nlpo3 = "1.3.2"

Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("āļŦāđ‰āļ­āļ‡āļŠāļĄāļļāļ”āļ›āļĢāļ°āļŠāļēāļŠāļ™", true, false).unwrap();

Create a tokenizer using a dictionary from a vector of Strings:

let words = vec!["āļ›āļēāļĨāļīāđ€āļĄāļ™āļ•āđŒ".to_string(), "āļ„āļ­āļ™āļŠāļ•āļīāļ•āļīāļ§āļŠāļąāđˆāļ™".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["āļĄāļīāļ§āđ€āļ‹āļĩāļĒāļĄ"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["āļāļĢāļ°āđ€āļžāļĢāļē", "āļŠāļēāļ™āļŠāļĨāļē"]);

Build

Requirements

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Development documents

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues