tiniestsegmenter

Compact Japanese segmenter


Keywords
tokenizer, NLP, ngram, Japanese
License
MIT
Install
pip install tiniestsegmenter==0.2.0

Documentation

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

Python

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("γ‚Έγƒ£γ‚¬γ‚€γƒ’γŒε₯½γγ§γ™γ€‚")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["γ‚Έγƒ£γ‚¬γ‚€γƒ’γŒε₯½γγ§γ™γ€‚"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Rust

Add the crate to your project: cargo add tiniestsegmenter

Usage:

use tiniestsegmenter as ts;

fn main() {
    let tokens: Result<Vec<&str>, ts::TokenizeError> = ts::tokenize("γ‚Έγƒ£γ‚¬γ‚€γƒ’γŒε₯½γγ§γ™γ€‚");
}

Performance

tiniestsegmenter can process 2GB of text in less than 90 seconds on a Macbook Pro at speeds of around Β±20 MB/s on a single thread.

Comparison with similar codebases

Each codebase was benchmarked using the timemachineu8j dataset, a Japanese transation of The Time Machine by Herbert George Wells.

Repo Lang time (ms)
jwnz/tiniestsegmenter Rust 11.996
jwnz/tiniestsegmenter Python 14.803
nyarla/go-japanese-segmenter Go 36.869
woxtu/rust-tinysegmenter Rust 44.535
JuliaStrings/TinySegmenter.jl Julia 45.691
ikawaha/tinysegmenter.go Go 58.694
SamuraiT/tinysegmenter Python 219.604

System:
Chip: Apple M2 Pro (Macbook Pro 14-inch, 2023)
Cores: 10
Memory: 16 GB