Parasol Tokenizer
Parasol tokenizes hangul after decomposition. νκΈ μμ,λͺ¨μμ λΆν΄νμ¬ ν ν°νν©λλ€.
- Original text : κ³ κ°λλ‘μ μμ Έλμ¨ μ΄λ‘μ μλ§ μ΄ λμμμ μ μΌν μ μ λͺ»ν λλ§μΌ κ±°μΌ
- Decomposed text : γ±γ γ±γ γ·γ γΉγ γ γ γ γ £γ γ γ΄γ γ γ γ΄ γ γ γΉγ γ±γ γ £γ γ γ γ γ γ γ £ γ·γ γ γ £γ γ γ γ γ γ γ γ £γΉγ γ £ γ γ γ±γ γ ‘γ γ γ γ γ γ γ΄ γ΄γ γ γ γ γ΄γ γ £γΉ γ±γ γ γ
- Tokens : βγ±γ γ±γ / γ·γ γΉγ / γ γ / βγ γ £ / γ γ γ΄ / γ γ γ γ΄ / βγ / γ γΉ / γ γ± / γ γ £ / γ / βγ γ γ γ / βγ γ £ / βγ·γ γ γ £ / γ γ γ γ / βγ γ γ γ £γΉ / γ γ £ / βγ γ γ±γ γ ‘γ / βγ γ γ γ γ γ΄ / βγ΄γ γ γ γ γ΄ / γ γ £γΉ / βγ±γ γ γ
- Composed tokens : βκ³ κ° / λλ‘ / μ / βμ / μ Ό / γ μ¨ / βγ / γ γΉ / γ γ± / μ΄ / γ / βμλ§ / βμ΄ / βλμ / μμ / βμ μΌ / ν / βμ μ / βλͺ»ν / βλλ§ / μΌ / βκ±°μΌ
Installation
pip install parasol-nlp
Experiment
The figure shows the results of the perplexity comparison experiment. with decomposition
is tokenized with charactor decomposition and no decomposition
is just tokenized.
Experiment source code is here.
Usage
Tokenizer
Use SentencePiece's BPE model as tokenizer and hgtk for decomposition.
from parasol import Tokenizer
# tokenize after decomposition
t1 = Tokenizer(decompose=True)
# tokenize without decomposition
t2 = Tokenizer(decompose=False)
then
>>> t1.tokenize("κ³ κ°λλ‘μ μμ Έλμ¨ μ΄λ‘μ μλ§ μ΄ λμμμ μ μΌν μ μ λͺ»ν λλ§μΌ κ±°μΌ")
['βκ³ κ°', 'λλ‘', 'μ', 'βμ', 'μ Ό', 'γ
μ¨', 'βγ
', 'γ
λ‘', 'μ', 'βμλ§', 'βμ΄', 'βλμ', 'μμ', 'βμ μΌ', 'ν', 'βμ μ', 'βλͺ»ν', 'βλλ§', 'μΌ', 'βκ±°μΌ']
>>> t2.tokenize("κ³ κ°λλ‘μ μμ Έλμ¨ μ΄λ‘μ μλ§ μ΄ λμμμ μ μΌν μ μ λͺ»ν λλ§μΌ κ±°μΌ")
['βκ³ κ°', 'λλ‘', 'μ', 'βμ', 'μ Έ', 'λμ¨', 'βμ΄λ‘', 'μ', 'βμλ§', 'βμ΄', 'βλμ', 'μμ', 'βμ μΌ', 'ν', 'βμ μ', 'βλͺ»ν', 'βλλ§', 'μΌ', 'βκ±°μΌ']
# Output as vocabulary id
>>> t1.tokenize("κ³ κ°λλ‘μ μμ Έλμ¨ μ΄λ‘μ μλ§ μ΄ λμμμ μ μΌν μ μ λͺ»ν λλ§μΌ κ±°μΌ", as_id=True)
[17687, 2135, 36, 8351, 3904, 3842, 52, 12256, 27398, 3469, 30, 6105, 160, 3767, 198, 8953, 2345, 13164, 89, 6872]
Composer
Hangul jamo composer
from parasol import Composer
c = Composer()
then
>>> c.compose("γ·γ
γΉγ
γ
£ γ±γ
£γ
γ
γ΄ γ
γ
γ
γ
γ
γΉγ
γ΄γ
γ
£γ
γ
γ
£ γ
γ
‘γ
γ
γ·γ
‘γ΄ γ±γ
γΉγ
γ
γ±γ
γ
‘γΉ γ±γ
γΉγ
γ
γ±γ
γ·γ
γ΄ γ±γ
£γΉγ
γ
")
'λ¬μ΄ κΈ°μ΄ λ°€ νΌλ°λΉμ΄ μ€λ©°λ 골λͺ©μ κ±Έμ΄κ°λ κΈΈμ'
but it is not perfect, like..
>>> c.compose("γ
γ
γ
γ
γ
γ΄γ
γ
‘γΉ γ
γ
£γΉγ
γ
γ
γ
γ
γ
")
'νμ΄μ λΉμ΄μ―γ
'
which of original text is νμ΄μ λΉμ΄μγ
γ