A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.
TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).
Python
tiniestsegmenter
can be installed from PyPI: pip install tiniestsegmenter
import tiniestsegmenter
tokens = tiniestsegmenter.tokenize("γΈγ£γ¬γ€γ’γε₯½γγ§γγ")
With the GIL released on the rust side, multi-threading is also possible.
import functools
import tiniestsegmenter
tokenizer = functools.partial(tiniestsegmenter.tokenize)
documents = ["γΈγ£γ¬γ€γ’γε₯½γγ§γγ"] * 10_000
with ThreadPoolExecutor(4) as e:
list(e.map(encoder, documents))
Rust
Add the crate to your project: cargo add tiniestsegmenter
Usage:
use tiniestsegmenter as ts;
fn main() {
let tokens: Result<Vec<&str>, ts::TokenizeError> = ts::tokenize("γΈγ£γ¬γ€γ’γε₯½γγ§γγ");
}
tiniestsegmenter
can process 2GB of text in less than 90 seconds on a Macbook Pro at speeds of around Β±20 MB/s
on a single thread.
Comparison with similar codebases
Each codebase was benchmarked using the timemachineu8j dataset, a Japanese transation of The Time Machine by Herbert George Wells.
Repo | Lang | time (ms) |
---|---|---|
jwnz/tiniestsegmenter | Rust | 11.996 |
jwnz/tiniestsegmenter | Python | 14.803 |
nyarla/go-japanese-segmenter | Go | 36.869 |
woxtu/rust-tinysegmenter | Rust | 44.535 |
JuliaStrings/TinySegmenter.jl | Julia | 45.691 |
ikawaha/tinysegmenter.go | Go | 58.694 |
SamuraiT/tinysegmenter | Python | 219.604 |
System:
Chip: Apple M2 Pro (Macbook Pro 14-inch, 2023)
Cores: 10
Memory: 16 GB