TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

Python

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Rust

Add the crate to your project: cargo add tiniestsegmenter

Usage:

use tiniestsegmenter as ts;

fn main() {
    let tokens: Result<Vec<&str>, ts::TokenizeError> = ts::tokenize("ジャガイモが好きです。");
}

Performance

tiniestsegmenter can process 2GB of text in less than 90 seconds on a Macbook Pro at speeds of around ±20 MB/s on a single thread.

Comparison with similar codebases

Each codebase was benchmarked using the timemachineu8j dataset, a Japanese transation of The Time Machine by Herbert George Wells.

Repo	Lang	time (ms)
jwnz/tiniestsegmenter	Rust	11.996
jwnz/tiniestsegmenter	Python	14.803
nyarla/go-japanese-segmenter	Go	36.869
woxtu/rust-tinysegmenter	Rust	44.535
JuliaStrings/TinySegmenter.jl	Julia	45.691
ikawaha/tinysegmenter.go	Go	58.694
SamuraiT/tinysegmenter	Python	219.604

System:
Chip: Apple M2 Pro (Macbook Pro 14-inch, 2023)
Cores: 10
Memory: 16 GB

tiniestsegmenter
Release 0.2.0

Release 0.2.0

0.2.0

0.1.0

Documentation

TiniestSegmenter

Usage

Performance

Stats

Development practices

Releases

Contributors

tiniestsegmenter Release 0.2.0

Release 0.2.0 Toggle Dropdown 0.2.0 0.1.0

Documentation

TiniestSegmenter

Usage

Performance

Stats

Development practices

Releases

Contributors

tiniestsegmenter
Release 0.2.0

Release 0.2.0

0.2.0

0.1.0