parasol-nlp

Korean tokenizer with character decomposition


Keywords
hangul, korean, nlp, tokenizer, decomposition
License
Apache-2.0
Install
pip install parasol-nlp==0.0.4

Documentation

Parasol Tokenizer

Parasol tokenizes hangul after decomposition. ν•œκΈ€ 자음,λͺ¨μŒμ„ λΆ„ν•΄ν•˜μ—¬ ν† ν°ν™”ν•©λ‹ˆλ‹€.

  • Original text : κ³ κ°€λ„λ‘œμ— μ‚μ Έλ‚˜μ˜¨ 초둝잎 μ•„λ§ˆ 이 λ„μ‹œμ—μ„œ 유일히 적응 λͺ»ν•œ λ‚­λ§ŒμΌ κ±°μ•Ό
  • Decomposed text : ㄱㅗㄱㅏㄷㅗㄹㅗㅇㅔ γ…ƒγ…£γ…ˆγ…•γ„΄γ…γ…‡γ…—γ„΄ γ…Šγ…—γ„Ήγ…—γ„±γ…‡γ…£γ… ㅇㅏㅁㅏ γ…‡γ…£ γ„·γ…—γ……γ…£γ…‡γ…”γ……γ…“ γ…‡γ… γ…‡γ…£γ„Ήγ…Žγ…£ γ…ˆγ…“γ„±γ…‡γ…‘γ…‡ γ…γ…—γ……γ…Žγ…γ„΄ ㄴㅏㅇㅁㅏㄴㅇㅣㄹ γ„±γ…“γ…‡γ…‘
  • Tokens : ▁ㄱㅗㄱㅏ / γ„·γ…—γ„Ήγ…— / γ…‡γ…” / ▁ㅃㅣ / γ…ˆγ…•γ„΄ / ㅏㅇㅗㄴ / β–γ…Š / γ…—γ„Ή / γ…—γ„± / γ…‡γ…£ / ㅍ / ▁ㅇㅏㅁㅏ / ▁ㅇㅣ / ▁ㄷㅗㅅㅣ / γ…‡γ…”γ……γ…“ / ▁ㅇㅠㅇㅣㄹ / γ…Žγ…£ / β–γ…ˆγ…“γ„±γ…‡γ…‘γ…‡ / β–γ…γ…—γ……γ…Žγ…γ„΄ / ▁ㄴㅏㅇㅁㅏㄴ / γ…‡γ…£γ„Ή / ▁ㄱㅓㅇㅑ
  • Composed tokens : ▁고가 / λ„λ‘œ / 에 / ▁삐 / μ Ό / γ…μ˜¨ / β–γ…Š / γ…—γ„Ή / γ…—γ„± / 이 / ㅍ / β–μ•„λ§ˆ / ▁이 / β–λ„μ‹œ / μ—μ„œ / β–μœ μΌ / 히 / ▁적응 / ▁λͺ»ν•œ / β–λ‚­λ§Œ / 일 / ▁거야

Installation

pip install parasol-nlp

Experiment

The figure shows the results of the perplexity comparison experiment. with decomposition is tokenized with charactor decomposition and no decomposition is just tokenized. Experiment source code is here.

comparison_experiment_figure

Usage

Tokenizer

Use SentencePiece's BPE model as tokenizer and hgtk for decomposition.

from parasol import Tokenizer

# tokenize after decomposition  
t1 = Tokenizer(decompose=True)
# tokenize without decomposition
t2 = Tokenizer(decompose=False)

then

>>> t1.tokenize("κ³ κ°€λ„λ‘œμ— μ‚μ Έλ‚˜μ˜¨ 초둝잎 μ•„λ§ˆ 이 λ„μ‹œμ—μ„œ 유일히 적응 λͺ»ν•œ λ‚­λ§ŒμΌ κ±°μ•Ό")
['▁고가', 'λ„λ‘œ', '에', '▁삐', 'μ Ό', 'γ…μ˜¨', 'β–γ…Š', 'ㅗ둝', '잎', 'β–μ•„λ§ˆ', '▁이', 'β–λ„μ‹œ', 'μ—μ„œ', 'β–μœ μΌ', '히', '▁적응', '▁λͺ»ν•œ', 'β–λ‚­λ§Œ', '일', '▁거야']
>>> t2.tokenize("κ³ κ°€λ„λ‘œμ— μ‚μ Έλ‚˜μ˜¨ 초둝잎 μ•„λ§ˆ 이 λ„μ‹œμ—μ„œ 유일히 적응 λͺ»ν•œ λ‚­λ§ŒμΌ κ±°μ•Ό")
['▁고가', 'λ„λ‘œ', '에', '▁삐', 'μ Έ', 'λ‚˜μ˜¨', 'β–μ΄ˆλ‘', '잎', 'β–μ•„λ§ˆ', '▁이', 'β–λ„μ‹œ', 'μ—μ„œ', 'β–μœ μΌ', '히', '▁적응', '▁λͺ»ν•œ', 'β–λ‚­λ§Œ', '일', '▁거야']

# Output as vocabulary id
>>> t1.tokenize("κ³ κ°€λ„λ‘œμ— μ‚μ Έλ‚˜μ˜¨ 초둝잎 μ•„λ§ˆ 이 λ„μ‹œμ—μ„œ 유일히 적응 λͺ»ν•œ λ‚­λ§ŒμΌ κ±°μ•Ό", as_id=True)
[17687, 2135, 36, 8351, 3904, 3842, 52, 12256, 27398, 3469, 30, 6105, 160, 3767, 198, 8953, 2345, 13164, 89, 6872]

Composer

Hangul jamo composer

from parasol import Composer

c = Composer()

then

>>> c.compose("ㄷㅏㄹㅇㅣ γ„±γ…£γ…‡γ…œγ„΄ ㅂㅏㅁ γ…γ…“γ„Ήγ…“γ„΄γ…‚γ…£γ…Šγ…‡γ…£ ㅅㅑㅁㅕㄷㅑㄴ ㄱㅗㄹㅁㅗㄱㅇㅑㄹ ㄱㅓㄹㅇㅓㄱㅏㄷㅓㄴ γ„±γ…£γ„Ήγ…‡γ…”")
'달이 기운 λ°€ νΌλŸ°λΉ›μ΄ μŠ€λ©°λ“  골λͺ©μ„ κ±Έμ–΄κ°€λ˜ 길에'

but it is not perfect, like..

>>> c.compose("γ…Žγ…γ…‡γ…‡γ…œγ„΄γ…‡γ…‘γ„Ή γ…‚γ…£γ„Ήγ…‡γ…“γ…‡γ…›γ…Žγ…Ž")
'ν–‰μš΄μ„ λΉŒμ–΄μš―γ…Ž'

which of original text is ν–‰μš΄μ„ λΉŒμ–΄μš”γ…Žγ…Ž