rakutenma

morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese


Keywords
morphological, analyzer, chinese, japanese-language, nlp, part-of-speech-tagger, python, word-segmentation
License
Apache-2.0
Install
pip install rakutenma==0.3.3

Documentation

Rakuten MA Python

travis-ci.org coveralls.io pyversion latest version Code Health license

Rakuten MA Python (morphological analyzer) is a Python version of Rakuten MA (word segmentor + PoS Tagger) for Chinese and Japanese.

For details about Rakuten MA, See https://github.com/rakuten-nlp/rakutenma

See also http://qiita.com/yukinoi/items/925bc238185aa2fad8a7 (In Japanese)

Contributions are welcome!

Installation

pip install rakutenma

Example

from rakutenma import RakutenMA

# Initialize a RakutenMA instance with an empty model
# the default ja feature set is set already
rma = RakutenMA()

# Let's analyze a sample sentence (from http://tatoeba.org/jpn/sentences/show/103809)
# With a disastrous result, since the model is empty!
print(rma.tokenize("ε½Όγ―ζ–°γ—γ„δ»•δΊ‹γ§γγ£γ¨ζˆεŠŸγ™γ‚‹γ γ‚γ†γ€‚"))

# Feed the model with ten sample sentences from tatoeba.com
# "tatoeba.json" is available at https://github.com/rakuten-nlp/rakutenma
import json
tatoeba = json.load(open("tatoeba.json"))
for i in tatoeba:
    rma.train_one(i)

# Now what does the result look like?
print(rma.tokenize("ε½Όγ―ζ–°γ—γ„δ»•δΊ‹γ§γγ£γ¨ζˆεŠŸγ™γ‚‹γ γ‚γ†γ€‚"))

# Initialize a RakutenMA instance with a pre-trained model
rma = RakutenMA(phi=1024, c=0.007812)  # Specify hyperparameter for SCW (for demonstration purpose)
rma.load("model_ja.json")

# Set the feature hash function (15bit)
rma.hash_func = rma.create_hash_func(15)

# Tokenize one sample sentence
print(rma.tokenize("γ†γ‚‰γ«γ‚γ«γ―γ«γ‚γ«γ‚γ¨γ‚ŠγŒγ„γ‚‹"));

# Re-train the model feeding the right answer (pairs of [token, PoS tag])
res = rma.train_one(
       [["うらにわ","N-nc"],
        ["に","P-k"],
        ["は","P-rj"],
        ["にわ","N-n"],
        ["γ«γ‚γ¨γ‚Š","N-nc"],
        ["が","P-k"],
        ["いる","V-c"]])
# The result of train_one contains:
#   sys: the system output (using the current model)
#   ans: answer fed by the user
#   update: whether the model was updated
print(res)

# Now what does the result look like?
print(rma.tokenize("γ†γ‚‰γ«γ‚γ«γ―γ«γ‚γ«γ‚γ¨γ‚ŠγŒγ„γ‚‹"))

NOTE

Added API

As compared to original RakutenMA, following methods are added:

  • RakutenMA::load(model_path) - Load model from JSON file
  • RakutenMA::save(model_path) - Save model to path

misc

As initial setting, following values are set:

  • rma.featset = CTYPE_JA_PATTERNS # RakutenMA.default_featset_ja
  • rma.hash_func = rma.create_hash_func(15)
  • rma.tag_scheme = "SBIEO" # if using Chinese, set "IOB2"

LICENSE

Apache License version 2.0

Copyright

Rakuten MA Python (c) 2015- Yukino Ikegami. All Rights Reserved.

Rakuten MA (original) (c) 2014 Rakuten NLP Project. All Rights Reserved.