kiwipiepy-model

Model for kiwipiepy


Keywords
Korean, morphological, analysis, korean-nlp, korean-tokenizer, morphological-analysis, nlp, python-library, word-segmentation
License
xpp
Install
pip install kiwipiepy-model==0.17.0

Documentation

Kiwipiepy, Python์šฉ Kiwi ํŒจํ‚ค์ง€

https://github.com/bab2min/kiwipiepy

PyPI version

Python3 API ๋ฌธ์„œ: https://bab2min.github.io/kiwipiepy

Kiwi 0.5 ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” Python3์šฉ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์ ํŠธ๋ฅผ ๋นŒ๋“œํ•˜์—ฌ Python์— ๋ชจ๋“ˆ์„ importํ•ด์„œ ์‚ฌ์šฉํ•˜์…”๋„ ์ข‹๊ณ , ํ˜น์€ ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ pip๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฏธ ๋นŒ๋“œ๋œ kiwipiepy ๋ชจ๋“ˆ์„ ์„ค์น˜ํ•˜์…”๋„ ์ข‹์Šต๋‹ˆ๋‹ค.

$ pip install --upgrade pip
$ pip install kiwipiepy

๋˜๋Š”

$ pip3 install --upgrade pip
$ pip3 install kiwipiepy

ํ˜„์žฌ kiwipiepy ํŒจํ‚ค์ง€๋Š” Vista ๋ฒ„์ „ ์ด์ƒ์˜ Windows OS ๋ฐ Linux, macOS 10.12 ์ด์ƒ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

macOS M1 ๋“ฑ binary distribution์ด ์ œ๊ณต๋˜์ง€ ์•Š๋Š” ํ™˜๊ฒฝ์—์„œ๋Š” ์„ค์น˜์‹œ ์†Œ์Šค ์ฝ”๋“œ ์ปดํŒŒ์ผ์„ ์œ„ํ•ด cmake3.12 ์ด์ƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

$ pip install cmake
$ pip install --upgrade pip
$ pip install kiwipiepy

ํ…Œ์ŠคํŠธํ•ด๋ณด๊ธฐ

Kiwi 0.6.3 ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” ์„ค์น˜ ํ›„ ๋ฐ”๋กœ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋„๋ก ๋Œ€ํ™”ํ˜• ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. pip๋ฅผ ํ†ตํ•ด ์„ค์น˜๊ฐ€ ์™„๋ฃŒ๋œ ํ›„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹คํ–‰ํ•˜์—ฌ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋ฅผ ์‹œํ—˜ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

$ python -m kiwipiepy

๋˜๋Š”

$ python3 -m kiwipiepy

๋Œ€ํ™”ํ˜• ์ธํ„ฐํŽ˜์ด์Šค๊ฐ€ ์‹œ์ž‘๋˜๋ฉด, ์›ํ•˜๋Š” ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•ด ๋ฐ”๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>> ์•ˆ๋…•?
[Token(form='์•ˆ๋…•', tag='IC', start=0, len=2), Token(form='?', tag='SF', start=2, len=3)]

์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ข…๋ฃŒํ•˜๋ ค๋ฉด Ctrl + C ๋ฅผ ๋ˆ„๋ฅด์‹ญ์‹œ์˜ค.

Kiwi์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ’ˆ์‚ฌ ํƒœ๊ทธ๋Š” ์„ธ์ข… ๋ง๋ญ‰์น˜์˜ ํ’ˆ์‚ฌ ํƒœ๊ทธ๋ฅผ ๊ธฐ์ดˆ๋กœ ํ•˜๊ณ  ์ผ๋ถ€ ํƒœ๊ทธ๋“ค์„ ๊ฐœ๋Ÿ‰ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ํƒœ๊ทธ ์ฒด๊ณ„์— ๋Œ€ํ•ด์„œ๋Š” ์—ฌ๊ธฐ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

๊ฐ„๋‹จ ์˜ˆ์ œ

>>> from kiwipiepy import Kiwi
>>> kiwi = Kiwi()
# tokenize ํ•จ์ˆ˜๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.tokenize("์•ˆ๋…•ํ•˜์„ธ์š” ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ ํ‚ค์œ„์ž…๋‹ˆ๋‹ค.")
[Token(form='์•ˆ๋…•', tag='NNG', start=0, len=2),
 Token(form='ํ•˜', tag='XSA', start=2, len=1),
 Token(form='์‹œ', tag='EP', start=4, len=1),
 Token(form='์–ด์š”', tag='EC', start=3, len=2),
 Token(form='ํ˜•ํƒœ์†Œ', tag='NNG', start=6, len=3),
 Token(form='๋ถ„์„', tag='NNG', start=10, len=2),
 Token(form='๊ธฐ', tag='NNG', start=12, len=1),
 Token(form='ํ‚ค์œ„', tag='NNG', start=14, len=2),
 Token(form='์ด', tag='VCP', start=16, len=1),
 Token(form='แ†ธ๋‹ˆ๋‹ค', tag='EF', start=17, len=2),
 Token(form='.', tag='SF', start=19, len=1)]

# normalize_coda ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด 
# ๋ง๋ถ™์€ ๋ฐ›์นจ ๋•Œ๋ฌธ์— ๋ถ„์„์ด ๊นจ์ง€๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.tokenize("ใ…‹ใ…‹ใ…‹ ์ด๋Ÿฐ ๊ฒƒ๋„ ๋ถ„์„์ด ๋ ๊นŒ์šฌใ…‹ใ…‹?", normalize_coda=True)
[Token(form='ใ…‹ใ…‹ใ…‹', tag='SW', start=0, len=3),
 Token(form='์ด๋Ÿฐ', tag='MM', start=4, len=2),
 Token(form='๊ฒƒ', tag='NNB', start=7, len=1),
 Token(form='๋„', tag='JX', start=8, len=1),
 Token(form='๋ถ„์„', tag='NNG', start=10, len=2),
 Token(form='์ด', tag='JKS', start=12, len=1),
 Token(form='๋˜', tag='VV', start=14, len=1),
 Token(form='แ†ฏ๊นŒ์š”', tag='EC', start=15, len=2),
 Token(form='ใ…‹ใ…‹ใ…‹', tag='SW', start=17, len=2),
 Token(form='?', tag='SF', start=19, len=1)]

# ๋ถˆ์šฉ์–ด ๊ด€๋ฆฌ๋ฅผ ์œ„ํ•œ Stopwords ํด๋ž˜์Šค๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
>>> from kiwipiepy.utils import Stopwords
>>> stopwords = Stopwords()
>>> kiwi.tokenize("๋ถ„์„ ๊ฒฐ๊ณผ์—์„œ ๋ถˆ์šฉ์–ด๋งŒ ์ œ์™ธํ•˜๊ณ  ์ถœ๋ ฅํ•  ์ˆ˜๋„ ์žˆ๋‹ค.", stopwords=stopwords)
[Token(form='๋ถ„์„', tag='NNG', start=0, len=2),
 Token(form='๊ฒฐ๊ณผ', tag='NNG', start=3, len=2),
 Token(form='๋ถˆ', tag='XPN', start=8, len=1),
 Token(form='์šฉ์–ด', tag='NNG', start=9, len=2),
 Token(form='์ œ์™ธ', tag='NNG', start=13, len=2),
 Token(form='์ถœ๋ ฅ', tag='NNG', start=18, len=2)]

# add, remove ๋ฉ”์†Œ๋“œ๋ฅผ ์ด์šฉํ•ด ๋ถˆ์šฉ์–ด ๋ชฉ๋ก์— ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ์‚ญ์ œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> stopwords.add(('๊ฒฐ๊ณผ', 'NNG'))
>>> kiwi.tokenize("๋ถ„์„ ๊ฒฐ๊ณผ์—์„œ ๋ถˆ์šฉ์–ด๋งŒ ์ œ์™ธํ•˜๊ณ  ์ถœ๋ ฅํ•  ์ˆ˜๋„ ์žˆ๋‹ค.", stopwords=stopwords)
[Token(form='๋ถ„์„', tag='NNG', start=0, len=2),
 Token(form='๋ถˆ', tag='XPN', start=8, len=1),
 Token(form='์šฉ์–ด', tag='NNG', start=9, len=2),
 Token(form='์ œ์™ธ', tag='NNG', start=13, len=2),
 Token(form='์ถœ๋ ฅ', tag='NNG', start=18, len=2)]

>>> tokens = kiwi.tokenize("๊ฐ ํ† ํฐ์€ ์—ฌ๋Ÿฌ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.")
>>> tokens[0]
Token(form='๊ฐ', tag='MM', start=0, len=1)
>>> tokens[0].form # ํ˜•ํƒœ์†Œ์˜ ํ˜•ํƒœ ์ •๋ณด
'๊ฐ'
>>> tokens[0].tag # ํ˜•ํƒœ์†Œ์˜ ํ’ˆ์‚ฌ ์ •๋ณด
'MM'
>>> tokens[0].start # ์‹œ์ž‘ ๋ฐ ๋ ์ง€์  (๋ฌธ์ž ๋‹จ์œ„)
0
>>> tokens[0].end
1
>>> tokens[0].word_position # ํ˜„ ๋ฌธ์žฅ์—์„œ์˜ ์–ด์ ˆ ๋ฒˆํ˜ธ
0
>>> tokens[0].sent_position # ํ˜•ํƒœ์†Œ๊ฐ€ ์†ํ•œ ๋ฌธ์žฅ ๋ฒˆํ˜ธ
0
>>> tokens[0].line_number # ํ˜•ํƒœ์†Œ๊ฐ€ ์†ํ•œ ์ค„์˜ ๋ฒˆํ˜ธ
0

# ๋ฌธ์žฅ ๋ถ„๋ฆฌ ๊ธฐ๋Šฅ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi.split_into_sents("์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค ์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜")
[Sentence(text='์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค', start=0, end=16, tokens=None),
 Sentence(text='์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜', start=17, end=24, tokens=None)]

# ๋ฌธ์žฅ ๋ถ„๋ฆฌ์™€ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ํ•จ๊ป˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.split_into_sents("์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค ์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜", return_tokens=True)
[Sentence(text='์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค', start=0, end=16, tokens=[
  Token(form='์—ฌ๋Ÿฌ', tag='MM', start=0, len=2), 
  Token(form='๋ฌธ์žฅ', tag='NNG', start=3, len=2), 
  Token(form='์œผ๋กœ', tag='JKB', start=5, len=2), 
  Token(form='๊ตฌ์„ฑ', tag='NNG', start=8, len=2), 
  Token(form='๋˜', tag='XSV', start=10, len=1), 
  Token(form='แ†ซ', tag='ETM', start=11, len=0), 
  Token(form='ํ…์ŠคํŠธ', tag='NNG', start=12, len=3), 
  Token(form='์ด', tag='VCP', start=15, len=1), 
  Token(form='๋„ค', tag='EF', start=15, len=1)]),
 Sentence(text='์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜', start=17, end=24, tokens=[
  Token(form='์ด๊ฑฐ', tag='NP', start=17, len=2), 
  Token(form='แ†ฏ', tag='JKO', start=19, len=0), 
  Token(form='๋ถ„๋ฆฌ', tag='NNG', start=20, len=2), 
  Token(form='ํ•˜', tag='XSV', start=22, len=1), 
  Token(form='์–ด', tag='EC', start=22, len=1), 
  Token(form='์ฃผ', tag='VX', start=23, len=1), 
  Token(form='์–ด', tag='EF', start=23, len=1)])]

# ์‚ฌ์ „์— ์ƒˆ๋กœ์šด ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.add_user_word("๊น€๊ฐ‘๊ฐ‘", "NNP")
True
>>> kiwi.tokenize("๊น€๊ฐ‘๊ฐ‘์ด ๋ˆ„๊ตฌ์•ผ")
[Token(form='๊น€๊ฐ‘๊ฐ‘', tag='NNP', start=0, len=3),
 Token(form='์ด', tag='JKS', start=3, len=1),
 Token(form='๋ˆ„๊ตฌ', tag='NP', start=5, len=2),
 Token(form='์•ผ', tag='JKV', start=7, len=1)]

# v0.11.0 ์‹ ๊ธฐ๋Šฅ
# 0.11.0 ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” ์‚ฌ์šฉ์ž ์‚ฌ์ „์— ๋™์‚ฌ/ํ˜•์šฉ์‚ฌ๋ฅผ ์ถ”๊ฐ€ํ•  ๋•Œ, ๊ทธ ํ™œ์šฉํ˜•๋„ ํ•จ๊ป˜ ๋“ฑ์žฌ๋ฉ๋‹ˆ๋‹ค.
# ์‚ฌ์ „์— ๋“ฑ์žฌ๋˜์–ด ์žˆ์ง€ ์•Š์€ ๋™์‚ฌ `ํŒ…๊ธฐ๋‹ค`๋ฅผ ๋ถ„์„ํ•˜๋ฉด, ์—‰๋šฑํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.
>>> kiwi.tokenize('ํŒ…๊ฒผ๋‹ค')
[Token(form='ํŒ…๊ธฐ', tag='NNG', start=0, len=2),
 Token(form='ํ•˜', tag='XSA', start=2, len=0), 
 Token(form='๋‹ค', tag='EF', start=2, len=1)]

# ํ˜•ํƒœ์†Œ `ํŒ…๊ธฐ/VV`๋ฅผ ์‚ฌ์ „์— ๋“ฑ๋กํ•˜๋ฉด, ์ด ํ˜•ํƒœ์†Œ์˜ ๋ชจ๋“  ํ™œ์šฉํ˜•์ด ์ž๋™์œผ๋กœ ์ถ”๊ฐ€๋˜๊ธฐ์—
# `ํŒ…๊ฒผ๋‹ค`, `ํŒ…๊ธธ` ๋“ฑ์˜ ํ˜•ํƒœ๋ฅผ ๋ชจ๋‘ ๋ถ„์„ํ•ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.add_user_word('ํŒ…๊ธฐ', 'VV')
True
>>> kiwi.tokenize('ํŒ…๊ฒผ๋‹ค')
[Token(form='ํŒ…๊ธฐ', tag='VV', start=0, len=2), 
 Token(form='์—ˆ', tag='EP', start=1, len=1), 
 Token(form='๋‹ค', tag='EF', start=2, len=1)]

# ๋˜ํ•œ ๋ณ€ํ˜•๋œ ํ˜•ํƒœ์†Œ๋ฅผ ์ผ๊ด„์ ์œผ๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ๋Œ€์ƒ ํ…์ŠคํŠธ์— ๋งž์ถฐ ๋ถ„์„ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.tokenize("์•ˆ๋…•ํ•˜์„ธ์˜, ์ œ ์ด๋ฆ„์€ ์ด์„ธ์˜์ด์—์˜. ํ•™์ƒ์ด์„ธ์˜?")
[Token(form='์•ˆ๋…•', tag='NNG', start=0, len=2),
 Token(form='ํ•˜', tag='XSA', start=2, len=1),
 Token(form='์‹œ', tag='EP', start=3, len=1),
 Token(form='์–ด', tag='EC', start=3, len=1),
 Token(form='์˜', tag='MAG', start=4, len=1), # ์˜ค๋ถ„์„
 Token(form=',', tag='SP', start=5, len=1),
 Token(form='์ €', tag='NP', start=7, len=1),
 Token(form='์˜', tag='JKG', start=7, len=1),
 Token(form='์ด๋ฆ„', tag='NNG', start=9, len=2),
 Token(form='์€', tag='JX', start=11, len=1),
 Token(form='์ด์„ธ์˜', tag='NNP', start=13, len=3),
 Token(form='์ด', tag='JKS', start=16, len=1),
 Token(form='์—', tag='IC', start=17, len=1),
 Token(form='์˜', tag='NR', start=18, len=1),
 Token(form='.', tag='SF', start=19, len=1),
 Token(form='๋‹˜', tag='NNG', start=21, len=1),
 Token(form='๋„', tag='JX', start=22, len=1),
 Token(form='ํ•™์ƒ', tag='NNG', start=24, len=2),
 Token(form='์ด์„ธ์˜', tag='NNP', start=26, len=3), # ์˜ค๋ถ„์„
 Token(form='?', tag='SF', start=29, len=1)]

# ์ข…๊ฒฐ์–ด๋ฏธ(EF) ์ค‘ '์š”'๋กœ ๋๋‚˜๋Š” ๊ฒƒ๋“ค์„ '์˜'์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ผ๊ด„ ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค. 
# ์ด ๋•Œ ๋ณ€ํ˜•๋œ ์ข…๊ฒฐ์–ด๋ฏธ์—๋Š” -3์˜ ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ์› ํ˜•ํƒœ์†Œ๋ณด๋‹ค ์šฐ์„ ํ•˜์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
# ์ƒˆ๋กœ ์‚ฝ์ž…๋œ ํ˜•ํƒœ์†Œ๋“ค์ด ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค.
>>> kiwi.add_re_rule('EF', '์š”$', '์˜', -3)
['์–ด์˜', '์—์˜', '์ง€์˜', '์ž–์•„์˜', '๊ฑฐ๋“ ์˜', 'แ†ฏ๊นŒ์˜', '๋„ค์˜', '๊ตฌ์˜', '๋‚˜์˜', '๊ตฐ์˜', ..., '์œผ๋‹ˆ๊น์˜']

# ๋™์ผํ•œ ๋ฌธ์žฅ์„ ์žฌ๋ถ„์„ํ•˜๋ฉด ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ๊ฐœ์„ ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.tokenize("์•ˆ๋…•ํ•˜์„ธ์˜, ์ œ ์ด๋ฆ„์€ ์ด์„ธ์˜์ด์—์˜. ๋‹˜๋„ ํ•™์ƒ์ด์„ธ์˜?")
[Token(form='์•ˆ๋…•', tag='NNG', start=0, len=2),
 Token(form='ํ•˜', tag='XSA', start=2, len=1),
 Token(form='์‹œ', tag='EP', start=3, len=1),
 Token(form='์–ด์˜', tag='EF', start=3, len=2), # ๋ถ„์„ ๊ฒฐ๊ณผ ๊ฐœ์„ 
 Token(form=',', tag='SP', start=5, len=1),
 Token(form='์ €', tag='NP', start=7, len=1),
 Token(form='์˜', tag='JKG', start=7, len=1),
 Token(form='์ด๋ฆ„', tag='NNG', start=9, len=2),
 Token(form='์€', tag='JX', start=11, len=1),
 Token(form='์ด์„ธ์˜', tag='NNP', start=13, len=3),
 Token(form='์ด', tag='VCP', start=16, len=1),
 Token(form='์—์˜', tag='EF', start=17, len=2),
 Token(form='.', tag='SF', start=19, len=1),
 Token(form='๋‹˜', tag='NNG', start=21, len=1),
 Token(form='๋„', tag='JX', start=22, len=1),
 Token(form='ํ•™์ƒ', tag='NNG', start=24, len=2),
 Token(form='์ด', tag='VCP', start=26, len=1),
 Token(form='์‹œ', tag='EP', start=27, len=1),
 Token(form='์–ด์˜', tag='EF', start=27, len=2), # ๋ถ„์„ ๊ฒฐ๊ณผ ๊ฐœ์„ 
 Token(form='?', tag='SF', start=29, len=1)]
 
# ๊ธฐ๋ถ„์„ ํ˜•ํƒœ๋ฅผ ๋“ฑ๋กํ•˜์—ฌ ์›ํ•˜๋Š” ๋Œ€๋กœ ๋ถ„์„๋˜์ง€ ์•Š๋Š” ๋ฌธ์ž์—ด์„ ๊ต์ •ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
# ๋‹ค์Œ ๋ฌธ์žฅ์˜ `์‚ฌ๊ฒผ๋Œ€`๋Š” ์˜คํƒ€๊ฐ€ ๋“ค์–ด๊ฐ„ ํ˜•ํƒœ๋ผ ์ œ๋Œ€๋กœ ๋ถ„์„๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
>>> kiwi.tokenize('๊ฑ”๋„ค ๋‘˜์ด ์‚ฌ๊ฒผ๋Œ€')
[Token(form='๊ฑ”', tag='NP', start=0, len=1), 
 Token(form='๋„ค', tag='XSN', start=1, len=1), 
 Token(form='๋‘˜', tag='NR', start=3, len=1), 
 Token(form='์ด', tag='JKS', start=4, len=1), 
 Token(form='์‚ฌ', tag='NR', start=6, len=1), 
 Token(form='๊ธฐ', tag='VV', start=7, len=1), 
 Token(form='์—ˆ', tag='EP', start=7, len=1), 
 Token(form='๋Œ€', tag='EF', start=8, len=1)]
# ๋‹ค์Œ๊ณผ ๊ฐ™์ด add_pre_analyzed_word ๋ฉ”์†Œ๋“œ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฅผ ๊ต์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.add_pre_analyzed_word('์‚ฌ๊ฒผ๋Œ€', ['์‚ฌ๊ท€/VV', '์—ˆ/EP', '๋Œ€/EF'], -3)
True
# ๊ทธ ๋’ค ๋™์ผํ•œ ๋ฌธ์žฅ์„ ๋‹ค์‹œ ๋ถ„์„ํ•ด๋ณด๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ๋ฐ”๋€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.tokenize('๊ฑ”๋„ค ๋‘˜์ด ์‚ฌ๊ฒผ๋Œ€')
[Token(form='๊ฑ”', tag='NP', start=0, len=1), 
 Token(form='๋„ค', tag='XSN', start=1, len=1), 
 Token(form='๋‘˜', tag='NR', start=3, len=1), 
 Token(form='์ด', tag='JKS', start=4, len=1), 
 Token(form='์‚ฌ๊ท€', tag='VV', start=6, len=3), 
 Token(form='์—ˆ', tag='EP', start=6, len=3), 
 Token(form='๋Œ€', tag='EF', start=6, len=3)]
# ๋‹จ, ์‚ฌ๊ท€/VV, ์—ˆ/EP, ๋Œ€/EF์˜ ์‹œ์ž‘์œ„์น˜๊ฐ€ ๋ชจ๋‘ 6, ๊ธธ์ด๊ฐ€ ๋ชจ๋‘ 3์œผ๋กœ ์ž˜๋ชป ์žกํžˆ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ณด์ž…๋‹ˆ๋‹ค.
# ์ด๋ฅผ ๊ณ ์น˜๊ธฐ ์œ„ํ•ด์„œ๋Š” add_pre_analyzed_word ์‹œ ๊ฐ ํ˜•ํƒœ์†Œ์˜ ์œ„์น˜์ •๋ณด๋„ ํ•จ๊ป˜ ์ž…๋ ฅํ•ด์ฃผ์–ด์•ผํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi = Kiwi()
>>> kiwi.add_pre_analyzed_word('์‚ฌ๊ฒผ๋Œ€', [('์‚ฌ๊ท€', 'VV', 0, 2), ('์—ˆ', 'EP', 1, 2), ('๋Œ€', 'EF', 2, 3)], -3)
True
>>> kiwi.tokenize('๊ฑ”๋„ค ๋‘˜์ด ์‚ฌ๊ฒผ๋Œ€')
[Token(form='๊ฑ”', tag='NP', start=0, len=1), 
 Token(form='๋„ค', tag='XSN', start=1, len=1), 
 Token(form='๋‘˜', tag='NR', start=3, len=1), 
 Token(form='์ด', tag='JKS', start=4, len=1), 
 Token(form='์‚ฌ๊ท€', tag='VV', start=6, len=2, 
 Token(form='์—ˆ', tag='EP', start=7 len=1, 
 Token(form='๋Œ€', tag='EF', start=8 len=1]

# v0.12.0 ์‹ ๊ธฐ๋Šฅ
# 0.12.0 ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” ํ˜•ํƒœ์†Œ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฌธ์žฅ์œผ๋กœ ๋ณต์›ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi.join([('๊ธธ', 'NNG'), ('์„', 'JKO'), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')])
'๊ธธ์„ ๋ฌผ์–ด์š”'
>>> kiwi.join([('ํ™', 'NNG'), ('์ด', 'JKS'), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')])
'ํ™์ด ๋ฌป์–ด์š”'

# v0.13.0 ์‹ ๊ธฐ๋Šฅ
# ๋” ๊ฐ•๋ ฅํ•œ ์–ธ์–ด ๋ชจ๋ธ์ธ SkipBigram(sbg)์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# ๊ธฐ์กด์˜ knlm๊ณผ ๋‹ฌ๋ฆฌ ๋จผ ๊ฑฐ๋ฆฌ์— ์žˆ๋Š” ํ˜•ํƒœ์†Œ๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi = Kiwi(model_type='knlm')
>>> kiwi.tokenize('์ด ๋ฒˆํ˜ธ๋กœ ์ „ํ™”๋ฅผ ์ด๋”ฐ๊ฐ€ ๊ผญ ๋ฐ˜๋“œ์‹œ ๊ฑธ์–ด.')
[Token(form='์ด', tag='MM', start=0, len=1), 
 Token(form='๋ฒˆํ˜ธ', tag='NNG', start=2, len=2), 
 Token(form='๋กœ', tag='JKB', start=4, len=1), 
 Token(form='์ „ํ™”', tag='NNG', start=6, len=2), 
 Token(form='๋ฅผ', tag='JKO', start=8, len=1), 
 Token(form='์ด๋”ฐ๊ฐ€', tag='MAG', start=10, len=3), 
 Token(form='๊ผญ', tag='MAG', start=14, len=1), 
 Token(form='๋ฐ˜๋“œ์‹œ', tag='MAG', start=16, len=3), 
 Token(form='๊ฑท', tag='VV-I', start=20, len=1),  # ๊ฑท๋‹ค/๊ฑธ๋‹ค ์ค‘ ํ‹€๋ฆฌ๊ฒŒ '๊ฑท๋‹ค'๋ฅผ ์„ ํƒํ–ˆ์Œ.
 Token(form='์–ด', tag='EF', start=21, len=1), 
 Token(form='.', tag='SF', start=22, len=1)]

>>> kiwi = Kiwi(model_type='sbg')
>>> kiwi.tokenize('์ด ๋ฒˆํ˜ธ๋กœ ์ „ํ™”๋ฅผ ์ด๋”ฐ๊ฐ€ ๊ผญ ๋ฐ˜๋“œ์‹œ ๊ฑธ์–ด.')
[Token(form='์ด', tag='MM', start=0, len=1), 
 Token(form='๋ฒˆํ˜ธ', tag='NNG', start=2, len=2), 
 Token(form='๋กœ', tag='JKB', start=4, len=1), 
 Token(form='์ „ํ™”', tag='NNG', start=6, len=2), 
 Token(form='๋ฅผ', tag='JKO', start=8, len=1), 
 Token(form='์ด๋”ฐ๊ฐ€', tag='MAG', start=10, len=3), 
 Token(form='๊ผญ', tag='MAG', start=14, len=1), 
 Token(form='๋ฐ˜๋“œ์‹œ', tag='MAG', start=16, len=3), 
 Token(form='๊ฑธ', tag='VV', start=20, len=1), # ๊ฑท๋‹ค/๊ฑธ๋‹ค ์ค‘ ๋ฐ”๋ฅด๊ฒŒ '๊ฑธ๋‹ค'๋ฅผ ์„ ํƒํ–ˆ์Œ.
 Token(form='์–ด', tag='EC', start=21, len=1), 
 Token(form='.', tag='SF', start=22, len=1)]

# ๋˜ํ•œ ์˜คํƒ€ ๊ต์ • ๊ธฐ๋Šฅ์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# ๊ฐ„๋‹จํ•œ ์˜คํƒ€๋ฅผ ๊ต์ •ํ•˜์—ฌ, ์‚ฌ์†Œํ•œ ์˜คํƒ€ ๋•Œ๋ฌธ์— ์ „์ฒด ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ์–ด๊ธ‹๋‚˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi = Kiwi(model_type='sbg', typos='basic')
>>> kiwi.tokenize('์™ธ์•Š๋€๋Œ€?') # ์˜คํƒ€ ๊ต์ • ์‚ฌ์šฉ ์‹œ ๋กœ๋”ฉ ์‹œ๊ฐ„์ด 5~10์ดˆ ์ •๋„ ์†Œ์š”๋จ
[Token(form='์™œ', tag='MAG', start=0, len=1),
 Token(form='์•ˆ', tag='MAG', start=1, len=1),
 Token(form='๋˜', tag='VV', start=2, len=1),
 Token(form='แ†ซ๋Œ€', tag='EF', start=2, len=2),
 Token(form='?', tag='SF', start=4, len=1)]

>>> kiwi.tokenize('์žฅ๋ก€ํฌ๋ง์ด ๋ญ๋ƒ๋Š” ์„ ์„•๋‹˜์˜ ์งˆ๋ฌธ์— ๋ฒ™์–ด๋ฆฌ๊ฐ€ ๋ซ๋”ฐ') 
[Token(form='์žฅ๋ž˜', tag='NNG', start=0, len=2),
 Token(form='ํฌ๋ง', tag='NNG', start=2, len=2), 
 Token(form='์ด', tag='JKS', start=4, len=1), 
 Token(form='๋ญ', tag='NP', start=6, len=1), 
 Token(form='์ด', tag='VCP', start=7, len=0), 
 Token(form='๋ƒ๋Š”', tag='ETM', start=7, len=2), 
 Token(form='์„ ์ƒ', tag='NNG', start=10, len=2), 
 Token(form='๋‹˜', tag='XSN', start=12, len=1), 
 Token(form='์˜', tag='JKG', start=13, len=1), 
 Token(form='์งˆ๋ฌธ', tag='NNG', start=15, len=2), 
 Token(form='์—', tag='JKB', start=17, len=1), 
 Token(form='๋ฒ™์–ด๋ฆฌ', tag='NNG', start=19, len=3), 
 Token(form='๊ฐ€', tag='JKC', start=22, len=1), 
 Token(form='๋˜', tag='VV', start=24, len=1), 
 Token(form='์—‡', tag='EP', start=24, len=1), 
 Token(form='๋‹ค', tag='EF', start=25, len=1)]

# 0.17.1์—์„œ๋Š” ์—ฐ์ฒ ์— ๋Œ€ํ•œ ์˜คํƒ€ ๊ต์ •์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# ๋ฐ›์นจ + ์ดˆ์„ฑ ใ…‡/ใ…Ž ๊ผด์„ ์ž˜๋ชป ์ด์–ด์ ์€ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด ๊ต์ •์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi = Kiwi(typos='continual')
>>> kiwi.tokenize('์˜ค๋Š˜์‚ฌ๋ฌด์‹œ๋ ˆ์„œ')
[Token(form='์˜ค๋Š˜', tag='NNG', start=0, len=2),
 Token(form='์‚ฌ๋ฌด์‹ค', tag='NNG', start=2, len=4),
 Token(form='์—์„œ', tag='JKB', start=5, len=2)]
>>> kiwi.tokenize('์ง€๊ฐ€์บค์–ด์š”')
[Token(form='์ง€๊ฐ', tag='NNG', start=0, len=3),
 Token(form='ํ•˜', tag='XSV', start=2, len=1),
 Token(form='์—ˆ', tag='EP', start=2, len=1),
 Token(form='์–ด์š”', tag='EF', start=3, len=2)]

# ๊ธฐ๋ณธ ์˜คํƒ€ ๊ต์ •์— ์—ฐ์ฒ  ์˜คํƒ€ ๊ต์ •๊นŒ์ง€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi = Kiwi(typos='basic_with_continual')
>>> kiwi.tokenize('์›จ ์ง€๊ฐ€์บค๋‹ˆ?')
[Token(form='์™œ', tag='MAG', start=0, len=1),
 Token(form='์ง€๊ฐ', tag='NNG', start=2, len=3),
 Token(form='ํ•˜', tag='XSV', start=4, len=1),
 Token(form='์—ˆ', tag='EP', start=4, len=1),
 Token(form='๋‹ˆ', tag='EC', start=5, len=1),
 Token(form='?', tag='SF', start=6, len=1)]

# 0.17.0 ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” ์‚ฌ์šฉ์ž ์‚ฌ์ „์— ๊ณต๋ฐฑ์ด ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> kiwi = Kiwi()
# '๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ'๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi.add_user_word('๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ', 'NNP')
True

# ๋“ฑ๋กํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ํ˜•ํƒœ์—์„œ๋Š”
# ๋‹น์—ฐํžˆ ์ž˜ ๋ถ„์„๋ฉ๋‹ˆ๋‹ค.
>>> kiwi.tokenize('๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ์—์„œ') 
[Token(form='๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ', tag='NNP', start=0, len=7),
 Token(form='์—์„œ', tag='JKB', start=7, len=2)]

# ์ถ”๊ฐ€๋กœ ๊ณต๋ฐฑ์ด ์—†๋Š” ํ˜•ํƒœ์—๋„ ์ผ์น˜๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi.tokenize('๋Œ€ํ•™์ƒ์„ ๊ตํšŒ์—์„œ') 
kiwi.tokenize('๋Œ€ํ•™์ƒ์„ ๊ตํšŒ์—์„œ')  
[Token(form='๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ', tag='NNP', start=0, len=6),
 Token(form='์—์„œ', tag='JKB', start=6, len=2)]

# ํƒญ ๋ฌธ์ž๋‚˜ ์ค„๋ฐ”๊ฟˆ ๋ฌธ์ž ๋“ฑ์ด ๋“ค์–ด๊ฐ€๋„ ์ผ์น˜๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
# ์—ฐ์†ํ•œ ๊ณต๋ฐฑ ๋ฌธ์ž๋Š” ๊ณต๋ฐฑ 1๋ฒˆ๊ณผ ๋™์ผํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi.tokenize('๋Œ€ํ•™์ƒ \t \n ์„ ๊ตํšŒ์—์„œ') 
[Token(form='๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ', tag='NNP', start=0, len=11),
 Token(form='์—์„œ', tag='JKB', start=11, len=2)]

# ๊ทธ๋Ÿฌ๋‚˜ ์‚ฌ์ „ ๋“ฑ์žฌ ์‹œ ๊ณต๋ฐฑ์ด ์—†๋˜ ์ง€์ ์—
# ๊ณต๋ฐฑ์ด ์žˆ๋Š” ๊ฒฝ์šฐ์—๋Š” ์ผ์น˜๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
>>> kiwi.tokenize('๋Œ€ํ•™ ์ƒ์„  ๊ตํšŒ์—์„œ')      
[Token(form='๋Œ€ํ•™', tag='NNG', start=0, len=2),
 Token(form='์ƒ์„ ', tag='NNG', start=3, len=2),
 Token(form='๊ตํšŒ', tag='NNG', start=6, len=2),
 Token(form='์—์„œ', tag='JKB', start=8, len=2)]

# space_tolerance๋ฅผ 2๋กœ ์„ค์ •ํ•˜์—ฌ
# ๊ณต๋ฐฑ์ด ๋‘ ๊ฐœ๊นŒ์ง€ ํ‹€๋ฆฐ ๊ฒฝ์šฐ๋ฅผ ํ—ˆ์šฉํ•˜๋„๋ก ํ•˜๋ฉด
# '๋Œ€ํ•™ ์ƒ์„  ๊ตํšŒ'์—๋„ '๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ'๊ฐ€ ์ผ์น˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
>>> kiwi.space_tolerance = 2
>>> kiwi.tokenize('๋Œ€ํ•™ ์ƒ์„  ๊ตํšŒ์—์„œ')
[Token(form='๋Œ€ํ•™์ƒ ์„ ๊ตํšŒ', tag='NNP', start=0, len=8),
 Token(form='์—์„œ', tag='JKB', start=8, len=2)]

์‹œ์ž‘ํ•˜๊ธฐ

kiwipiepy ํŒจํ‚ค์ง€ ์„ค์น˜๊ฐ€ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํŒจํ‚ค์ง€๋ฅผ importํ›„ Kiwi ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ–ˆ์„๋•Œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

from kiwipiepy import Kiwi, Match
kiwi = Kiwi()

Kiwi ์ƒ์„ฑ์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Kiwi(num_workers=0, model_path=None, load_default_dict=True, integrate_allomorph=True, model_type='knlm', typos=None, typo_cost_threshold=2.5)
  • num_workers: 2 ์ด์ƒ์ด๋ฉด ๋‹จ์–ด ์ถ”์ถœ ๋ฐ ํ˜•ํƒœ์†Œ ๋ถ„์„์— ๋ฉ€ํ‹ฐ ์ฝ”์–ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์กฐ๊ธˆ ๋” ๋น ๋ฅธ ์†๋„๋กœ ๋ถ„์„์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    1์ธ ๊ฒฝ์šฐ ๋‹จ์ผ ์ฝ”์–ด๋งŒ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. num_workers๊ฐ€ 0์ด๋ฉด ํ˜„์žฌ ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.
    ์ƒ๋žต ์‹œ ๊ธฐ๋ณธ๊ฐ’์€ 0์ž…๋‹ˆ๋‹ค.
  • model_path: ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ชจ๋ธ์ด ์žˆ๋Š” ๊ฒฝ๋กœ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ƒ๋žต์‹œ kiwipiepy_model ํŒจํ‚ค์ง€๋กœ๋ถ€ํ„ฐ ๋ชจ๋ธ ๊ฒฝ๋กœ๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.
  • load_default_dict: ์ถ”๊ฐ€ ์‚ฌ์ „์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€ ์‚ฌ์ „์€ ์œ„ํ‚ค๋ฐฑ๊ณผ์˜ ํ‘œ์ œ์–ด ํƒ€์ดํ‹€๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ๋กœ๋”ฉ ๋ฐ ๋ถ„์„ ์‹œ๊ฐ„์ด ์•ฝ๊ฐ„ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ ๋‹ค์–‘ํ•œ ๊ณ ์œ ๋ช…์‚ฌ๋ฅผ ์ข€ ๋” ์ž˜ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ถ„์„ ๊ฒฐ๊ณผ์— ์›์น˜ ์•Š๋Š” ๊ณ ์œ ๋ช…์‚ฌ๊ฐ€ ์žกํžˆ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋ ค๋ฉด ์ด๋ฅผ False๋กœ ์„ค์ •ํ•˜์‹ญ์‹œ์˜ค.
  • integrate_allomorph: ์–ด๋ฏธ ์ค‘, '์•„/์–ด', '์•˜/์—ˆ'๊ณผ ๊ฐ™์ด ๋™์ผํ•˜์ง€๋งŒ ์Œ์šด ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ํ˜•ํƒœ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ์ดํ˜•ํƒœ๋“ค์„ ์ž๋™์œผ๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • model_type: ํ˜•ํƒœ์†Œ ๋ถ„์„์— ์‚ฌ์šฉํ•  ์–ธ์–ด ๋ชจ๋ธ์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. 'knlm', 'sbg' ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 'sbg' ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋Š๋ฆฌ์ง€๋งŒ ๋จผ ๊ฑฐ๋ฆฌ์— ์žˆ๋Š” ํ˜•ํƒœ์†Œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • typos: ํ˜•ํƒœ์†Œ ๋ถ„์„ ์‹œ ๊ฐ„๋‹จํ•œ ์˜คํƒ€๋ฅผ ๊ต์ •ํ•ฉ๋‹ˆ๋‹ค. None์œผ๋กœ ์„ค์ • ์‹œ ๊ต์ •์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • typo_cost_threshold: ์˜คํƒ€ ๊ต์ •์„ ํ—ˆ์šฉํ•  ์ตœ๋Œ€ ์˜คํƒ€ ๋น„์šฉ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

kiwi ๊ฐ์ฒด๋Š” ํฌ๊ฒŒ ๋‹ค์Œ ์„ธ ์ข…๋ฅ˜์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ฝ”ํผ์Šค๋กœ๋ถ€ํ„ฐ ๋ฏธ๋“ฑ๋ก ๋‹จ์–ด ์ถ”์ถœ
  • ์‚ฌ์šฉ์ž ์‚ฌ์ „ ๊ด€๋ฆฌ
  • ํ˜•ํƒœ์†Œ ๋ถ„์„

์ฝ”ํผ์Šค๋กœ๋ถ€ํ„ฐ ๋ฏธ๋“ฑ๋ก ๋‹จ์–ด ์ถ”์ถœ

Kiwi 0.5๋ถ€ํ„ฐ ์ƒˆ๋กœ ์ถ”๊ฐ€๋œ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค. ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์ž์—ด์˜ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜์—ฌ ๋‹จ์–ด๋กœ ์ถ”์ •๋˜๋Š” ๋ฌธ์ž์—ด์„ ์ถ”์ถœํ•ด์ค๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์˜ ๊ธฐ์ดˆ์ ์ธ ์•„์ด๋””์–ด๋Š” https://github.com/lovit/soynlp ์˜ Word Extraction ๊ธฐ๋ฒ•์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด์— ๋ฌธ์ž์—ด ๊ธฐ๋ฐ˜์˜ ๋ช…์‚ฌ ํ™•๋ฅ ์„ ์กฐํ•ฉํ•˜์—ฌ ๋ช…์‚ฌ์ผ ๊ฒƒ์œผ๋กœ ์˜ˆ์ธก๋˜๋Š” ๋‹จ์–ด๋งŒ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

Kiwi๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๋ฏธ๋“ฑ๋ก ๋‹จ์–ด ์ถ”์ถœ ๊ด€๋ จ ๋ฉ”์†Œ๋“œ๋Š” ๋‹ค์Œ ๋‘ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

Kiwi.extract_words(texts, min_cnt, max_word_len, min_score)
Kiwi.extract_add_words(texts, min_cnt, max_word_len, min_score, pos_score)
extract_words(texts, min_cnt=10, max_word_len=10, min_score=0.25, pos_score=-3.0, lm_filter=True)
  • texts: ๋ถ„์„ํ•  ํ…์ŠคํŠธ๋ฅผ Iterable[str] ํ˜•ํƒœ๋กœ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๊ฑด ์•„๋ž˜์˜ ์˜ˆ์ œ๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.
  • min_cnt: ์ถ”์ถœํ•  ๋‹จ์–ด๊ฐ€ ์ž…๋ ฅ ํ…์ŠคํŠธ ๋‚ด์—์„œ ์ตœ์†Œ ๋ช‡ ๋ฒˆ ์ด์ƒ ๋“ฑ์žฅํ•˜๋Š” ์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ํ…์ŠคํŠธ๊ฐ€ ํด ์ˆ˜๋ก ๊ทธ ๊ฐ’์„ ๋†’์—ฌ์ฃผ์‹œ๋Š”๊ฒŒ ์ข‹์Šต๋‹ˆ๋‹ค.
  • max_word_len: ์ถ”์ถœํ•  ๋‹จ์–ด์˜ ์ตœ๋Œ€ ๊ธธ์ด์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ๋„ˆ๋ฌด ํฌ๊ฒŒ ์„ค์ •ํ•  ๊ฒฝ์šฐ ๋‹จ์–ด๋ฅผ ์Šค์บ”ํ•˜๋Š” ์‹œ๊ฐ„์ด ๊ธธ์–ด์ง€๋ฏ€๋กœ ์ ์ ˆํ•˜๊ฒŒ ์กฐ์ ˆํ•ด์ฃผ์‹œ๋Š” ๊ฒŒ ์ข‹์Šต๋‹ˆ๋‹ค.
  • min_score: ์ถ”์ถœํ•  ๋‹จ์–ด์˜ ์ตœ์†Œ ๋‹จ์–ด ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ๋‚ฎ์ถœ์ˆ˜๋ก ๋‹จ์–ด๊ฐ€ ์•„๋‹Œ ํ˜•ํƒœ๊ฐ€ ์ถ”์ถœ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง€๊ณ , ๋ฐ˜๋Œ€๋กœ ์ด ๊ฐ’์„ ๋†’์ผ ์ˆ˜๋ก ์ถ”์ถœ๋˜๋Š” ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ค„์–ด๋“ค๋ฏ€๋กœ ์ ์ ˆํ•œ ์ˆ˜์น˜๋กœ ์„ค์ •ํ•˜์‹ค ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 0.25์ž…๋‹ˆ๋‹ค.
  • pos_score: ์ถ”์ถœํ•  ๋‹จ์–ด์˜ ์ตœ์†Œ ๋ช…์‚ฌ ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ๋‚ฎ์ถœ์ˆ˜๋ก ๋ช…์‚ฌ๊ฐ€ ์•„๋‹Œ ๋‹จ์–ด๋“ค์ด ์ถ”์ถœ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์œผ๋ฉฐ, ๋ฐ˜๋Œ€๋กœ ๋†’์ผ์ˆ˜๋ก ์ถ”์ถœ๋˜๋Š” ๋ช…์‚ฌ์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ -3์ž…๋‹ˆ๋‹ค.
  • lm_filter: ํ’ˆ์‚ฌ ๋ฐ ์–ธ์–ด ๋ชจ๋ธ์„ ์ด์šฉํ•œ ํ•„ํ„ฐ๋ง์„ ์‚ฌ์šฉํ•  ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
# ์ž…๋ ฅ์œผ๋กœ str์˜ list๋ฅผ ์ค„ ๊ฒฝ์šฐ
inputs = list(open('test.txt', encoding='utf-8'))
kiwi.extract_words(inputs, min_cnt=10, max_word_len=10, min_score=0.25)

'''
์œ„์˜ ์ฝ”๋“œ์—์„œ๋Š” ๋ชจ๋“  ์ž…๋ ฅ์„ ๋ฏธ๋ฆฌ list๋กœ ์ €์žฅํ•ด๋‘๋ฏ€๋กœ
test.txt ํŒŒ์ผ์ด ํด ๊ฒฝ์šฐ ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์†Œ๋ชจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ๋Œ€์‹  ํŒŒ์ผ์—์„œ ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•˜๋ ค๋ฉด(streaming)
์•„๋ž˜์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
'''

class IterableTextFile:
    def __init__(self, path):
        self.path = path

    def __iter__(self):
        yield from open(path, encoding='utf-8')

kiwi.extract_words(IterableTextFile('test.txt'), min_cnt=10, max_word_len=10, min_score=0.25)

extract_add_words(texts, min_cnt=10, max_word_len=10, min_score=0.25, pos_score=-3, lm_filter=True)

extract_words ์™€ ๋™์ผํ•˜๊ฒŒ ๋ช…์‚ฌ์ธ ๋‹จ์–ด๋งŒ ์ถ”์ถœํ•ด์ค๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ด ๋ฉ”์†Œ๋“œ๋Š” ์ถ”์ถœ๋œ ๋ช…์‚ฌ ํ›„๋ณด๋ฅผ ์ž๋™์œผ๋กœ ์‚ฌ์šฉ์ž ์‚ฌ์ „์— NNP๋กœ ๋“ฑ๋กํ•˜์—ฌ ํ˜•ํƒœ์†Œ ๋ถ„์„์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์ด ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด add_user_word ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”์ถœ๋œ ๋ฏธ๋“ฑ๋ก ๋‹จ์–ด๋ฅผ ์ง์ ‘ ์‚ฌ์šฉ์ž ์‚ฌ์ „์— ๋“ฑ๋กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.


์‚ฌ์šฉ์ž ์‚ฌ์ „ ๊ด€๋ฆฌ

๊ธฐ์กด์˜ ์‚ฌ์ „์— ๋“ฑ๋ก๋˜์ง€ ์•Š์€ ๋‹จ์–ด๋ฅผ ์ œ๋Œ€๋กœ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‚ฌ์šฉ์ž ์‚ฌ์ „์— ํ•ด๋‹น ๋‹จ์–ด๋ฅผ ๋“ฑ๋กํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” extract_add_words๋ฅผ ํ†ตํ•ด์„œ ์ž๋™์œผ๋กœ ์ด๋ค„์งˆ ์ˆ˜๋„ ์žˆ๊ณ , ์ˆ˜์ž‘์—…์œผ๋กœ ์ง์ ‘ ์ถ”๊ฐ€๋  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ ๋ฉ”์†Œ๋“œ๋“ค์€ ์‚ฌ์šฉ์ž ์‚ฌ์ „์„ ๊ด€๋ฆฌํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”์†Œ๋“œ๋“ค์ž…๋‹ˆ๋‹ค.

Kiwi.add_user_word(word, tag, score, orig_word=None)
Kiwi.add_pre_analyzed_word(form, analyzed, score)
Kiwi.add_rule(tag, replacer, score)
Kiwi.add_re_rule(tag, pattern, repl, score)
Kiwi.load_user_dictionary(user_dict_path)
add_user_word(word, tag='NNP', score=0.0, orig_word=None)

์‚ฌ์šฉ์ž ์‚ฌ์ „์— ์ƒˆ ํ˜•ํƒœ์†Œ๋ฅผ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค.

  • word: ๋“ฑ๋กํ•  ํ˜•ํƒœ์†Œ์˜ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. ํ˜„์žฌ๋Š” ๋„์–ด์“ฐ๊ธฐ(๊ณต๋ฐฑ๋ฌธ์ž)๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ž์—ด๋งŒ ๋‹จ์–ด๋กœ ๋“ฑ๋กํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • tag: ๋“ฑ๋กํ•  ํ˜•ํƒœ์†Œ์˜ ํ’ˆ์‚ฌ์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ NNP(๊ณ ์œ ๋ช…์‚ฌ)์ž…๋‹ˆ๋‹ค.
  • score: ๋“ฑ๋กํ•  ํ˜•ํƒœ์†Œ์˜ ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. ๋™์ผํ•œ ํ˜•ํƒœ๋ผ๋„ ์—ฌ๋Ÿฌ ๊ฒฝ์šฐ๋กœ ๋ถ„์„๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋Š” ๊ฒฝ์šฐ์—, ์ด ๊ฐ’์ด ํด์ˆ˜๋ก ํ•ด๋‹น ํ˜•ํƒœ์†Œ๊ฐ€ ๋” ์šฐ์„ ๊ถŒ์„ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • orig_word: ์ถ”๊ฐ€ํ•  ํ˜•ํƒœ์†Œ๊ฐ€ ํŠน์ • ํ˜•ํƒœ์†Œ์˜ ๋ณ€์ดํ˜•์ธ ๊ฒฝ์šฐ ์ด ์ธ์ž๋กœ ์›๋ณธ ํ˜•ํƒœ์†Œ๋ฅผ ๋„˜๊ฒจ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—†๋Š” ๊ฒฝ์šฐ ์ƒ๋žตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ’์„ ์ค€ ๊ฒฝ์šฐ, ํ˜„์žฌ ์‚ฌ์ „ ๋‚ด์— orig_word/tag ์กฐํ•ฉ์˜ ํ˜•ํƒœ์†Œ๊ฐ€ ๋ฐ˜๋“œ์‹œ ์กด์žฌํ•ด์•ผ ํ•˜๋ฉฐ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ValueError ์˜ˆ์™ธ๋ฅผ ๋ฐœ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์›๋ณธ ํ˜•ํƒœ์†Œ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ orig_word๋ฅผ ๋ช…์‹œํ•˜๋ฉด ๋” ์ •ํ™•ํ•œ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ˜•ํƒœ์†Œ ์‚ฝ์ž…์ด ์„ฑ๊ณตํ•˜๋ฉด True๋ฅผ, ๋™์ผํ•œ ํ˜•ํƒœ์†Œ๊ฐ€ ์ด๋ฏธ ์กด์žฌํ•˜์—ฌ ์‹คํŒจํ•˜๋ฉด False๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.


add_pre_analyzed_word(form, analyzed, score=0.0)

์‚ฌ์šฉ์ž ์‚ฌ์ „์— ๊ธฐ๋ถ„์„ ํ˜•ํƒœ๋ฅผ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํŠน์ • ํ˜•ํƒœ๊ฐ€ ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„์ด ๋˜๋„๋ก ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • form: ๊ธฐ๋ถ„์„์˜ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.
  • analyzed: form์˜ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฒฐ๊ณผ. ์ด ๊ฐ’์€ (ํ˜•ํƒœ, ํ’ˆ์‚ฌ) ๋ชจ์–‘์˜ tuple, ํ˜น์€ (ํ˜•ํƒœ, ํ’ˆ์‚ฌ, ์‹œ์ž‘์ง€์ , ๋์ง€์ ) ๋ชจ์–‘์˜ tuple๋กœ ๊ตฌ์„ฑ๋œ Iterable์ด์–ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์œผ๋กœ ์ง€์ •๋˜๋Š” ํ˜•ํƒœ์†Œ๋Š” ํ˜„์žฌ ์‚ฌ์ „ ๋‚ด์— ๋ฐ˜๋“œ์‹œ ์กด์žฌํ•ด์•ผ ํ•˜๋ฉฐ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ValueError ์˜ˆ์™ธ๋ฅผ ๋ฐœ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  • score: ์ถ”๊ฐ€ํ•  ํ˜•ํƒœ์†Œ์—ด์˜ ๊ฐ€์ค‘์น˜ ์ ์ˆ˜. ํ•ด๋‹น ํ˜•ํƒœ์— ๋ถ€ํ•ฉํ•˜๋Š” ํ˜•ํƒœ์†Œ ์กฐํ•ฉ์ด ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ, ์ด ์ ์ˆ˜๊ฐ€ ๋†’์„ ๋‹จ์–ด๊ฐ€ ๋” ์šฐ์„ ๊ถŒ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

์‚ฝ์ž…์ด ์„ฑ๊ณตํ•˜๋ฉด True๋ฅผ, ๋™์ผํ•œ ํ˜•ํƒœ๊ฐ€ ์ด๋ฏธ ์กด์žฌํ•˜์—ฌ ์‹คํŒจํ•˜๋ฉด False๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฉ”์†Œ๋“œ๋Š” ๋ถˆ๊ทœ์น™์ ์ธ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„๊ธฐ์— ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐ์— ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฌ๊ท€๋‹ค ๋™์‚ฌ์˜ ๊ณผ๊ฑฐํ˜•์€ ์‚ฌ๊ท€์—ˆ๋‹ค๊ฐ€ ๋งž์ง€๋งŒ, ํ”ํžˆ ์‚ฌ๊ฒผ๋‹ค๋กœ ์ž˜๋ชป ์“ฐ์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๊ฒผ๋‹ค๊ฐ€ ์‚ฌ๊ท€/VV + ์—ˆ/EP + ๋‹ค/EF๋กœ ๋ฐ”๋ฅด๊ฒŒ ๋ถ„์„๋˜๋„๋ก ํ•˜๋Š”๋ฐ์— ์ด ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

kiwi.add_pre_analyzed_word('์‚ฌ๊ฒผ๋‹ค', ['์‚ฌ๊ท€/VV', '์—ˆ/EP', '๋‹ค/EF'], -3)`
kiwi.add_pre_analyzed_word('์‚ฌ๊ฒผ๋‹ค', [('์‚ฌ๊ท€', 'VV', 0, 2), ('์—ˆ', 'EP', 1, 2), ('๋‹ค', 'EF', 2, 3)], -3)

ํ›„์ž์˜ ๊ฒฝ์šฐ ๋ถ„์„ ๊ฒฐ๊ณผ์˜ ๊ฐ ํ˜•ํƒœ์†Œ๊ฐ€ ์›๋ณธ ๋ฌธ์ž์—ด์—์„œ ์ฐจ์ง€ํ•˜๋Š” ์œ„์น˜๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ง€์ •ํ•ด์คŒ์œผ๋กœ์จ, Kiwi ๋ถ„์„ ๊ฒฐ๊ณผ์—์„œ ํ•ด๋‹น ํ˜•ํƒœ์†Œ์˜ start, end, length๊ฐ€ ์ •ํ™•ํ•˜๊ฒŒ ๋‚˜์˜ค๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.


add_rule(tag, replacer, score)

๊ทœ์น™์— ์˜ํ•ด ๋ณ€ํ˜•๋œ ํ˜•ํƒœ์†Œ๋ฅผ ์ผ๊ด„์ ์œผ๋กœ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  • tag: ์ถ”๊ฐ€ํ•  ํ˜•ํƒœ์†Œ๋“ค์˜ ํ’ˆ์‚ฌ
  • replacer: ํ˜•ํƒœ์†Œ๋ฅผ ๋ณ€ํ˜•์‹œํ‚ฌ ๊ทœ์น™. ์ด ๊ฐ’์€ ํ˜ธ์ถœ๊ฐ€๋Šฅํ•œ Callable ํ˜•ํƒœ๋กœ ์ œ๊ณต๋˜์–ด์•ผ ํ•˜๋ฉฐ, ์›๋ณธ ํ˜•ํƒœ์†Œ str๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ณ€ํ˜•๋œ ํ˜•ํƒœ์†Œ์˜ str์„ ๋ฐ˜ํ™˜ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์ž…๋ ฅ๊ณผ ๋™์ผํ•œ ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋ฉด ํ•ด๋‹น ๋ณ€ํ˜• ๊ฒฐ๊ณผ๋Š” ๋ฌด์‹œ๋ฉ๋‹ˆ๋‹ค.
  • score: ์ถ”๊ฐ€ํ•  ๋ณ€ํ˜•๋œ ํ˜•ํƒœ์†Œ์˜ ๊ฐ€์ค‘์น˜ ์ ์ˆ˜. ํ•ด๋‹น ํ˜•ํƒœ์— ๋ถ€ํ•ฉํ•˜๋Š” ํ˜•ํƒœ์†Œ ์กฐํ•ฉ์ด ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ, ์ด ์ ์ˆ˜๊ฐ€ ๋†’์„ ๋‹จ์–ด๊ฐ€ ๋” ์šฐ์„ ๊ถŒ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

replacer์— ์˜ํ•ด ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ํ˜•ํƒœ์†Œ์˜ list๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.


add_re_rule(tag, pattern, repl, score)

add_rule๋ฉ”์†Œ๋“œ์™€ ๋™์ผํ•œ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋˜, ๋ณ€ํ˜• ๊ทœ์น™์— ์ •๊ทœํ‘œํ˜„์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • tag: ์ถ”๊ฐ€ํ•  ํ˜•ํƒœ์†Œ๋“ค์˜ ํ’ˆ์‚ฌ
  • pattern: ๋ณ€ํ˜•์‹œํ‚ฌ ํ˜•ํƒœ์†Œ์˜ ๊ทœ์น™. ์ด ๊ฐ’์€ re.compile๋กœ ์ปดํŒŒ์ผ๊ฐ€๋Šฅํ•œ ์ •๊ทœํ‘œํ˜„์‹์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • repl: pattern์— ์˜ํ•ด ๋ฐœ๊ฒฌ๋œ ํŒจํ„ด์€ ์ด ๊ฐ’์œผ๋กœ ๊ต์ฒด๋ฉ๋‹ˆ๋‹ค. Python3 ์ •๊ทœํ‘œํ˜„์‹ ๋ชจ๋“ˆ ๋‚ด์˜ re.sub ํ•จ์ˆ˜์˜ repl ์ธ์ž์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
  • score: ์ถ”๊ฐ€ํ•  ๋ณ€ํ˜•๋œ ํ˜•ํƒœ์†Œ์˜ ๊ฐ€์ค‘์น˜ ์ ์ˆ˜. ํ•ด๋‹น ํ˜•ํƒœ์— ๋ถ€ํ•ฉํ•˜๋Š” ํ˜•ํƒœ์†Œ ์กฐํ•ฉ์ด ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ, ์ด ์ ์ˆ˜๊ฐ€ ๋†’์„ ๋‹จ์–ด๊ฐ€ ๋” ์šฐ์„ ๊ถŒ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

pattern๊ณผ repl์— ์˜ํ•ด ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ํ˜•ํƒœ์†Œ์˜ list๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฉ”์†Œ๋“œ๋Š” ๊ทœ์น™์— ์˜ํ•ด ๋ณ€ํ˜•๋˜๋Š” ์ดํ˜•ํƒœ๋“ค์„ ์ผ๊ด„์ ์œผ๋กœ ์ถ”๊ฐ€ํ•  ๋•Œ ๊ต‰์žฅํžˆ ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด -์š”๊ฐ€ -์—ผ์œผ๋กœ ๊ต์ฒด๋œ ์ข…๊ฒฐ์–ด๋ฏธ๋“ค(๋จน์–ด์—ผ, ๋›ฐ์—ˆ๊ตฌ์—ผ, ๋ฐฐ๋ถˆ๋Ÿฌ์—ผ ๋“ฑ)์„ ์ผ๊ด„ ๋“ฑ๋กํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

kiwi.add_re_rule('EF', r'์š”$', r'์—ผ', -3.0)

์ด๋Ÿฐ ์ดํ˜•ํƒœ๋“ค์„ ๋Œ€๋Ÿ‰์œผ๋กœ ๋“ฑ๋กํ•  ๊ฒฝ์šฐ ์ดํ˜•ํƒœ๊ฐ€ ์›๋ณธ ํ˜•ํƒœ๋ณด๋‹ค ๋ถ„์„๊ฒฐ๊ณผ์—์„œ ๋†’์€ ์šฐ์„ ๊ถŒ์„ ๊ฐ€์ง€์ง€ ์•Š๋„๋ก score๋ฅผ -3 ์ดํ•˜์˜ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋Š”๊ฑธ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.


load_user_dictionary(user_dict_path)

ํŒŒ์ผ๋กœ๋ถ€ํ„ฐ ์‚ฌ์šฉ์ž ์‚ฌ์ „์„ ์ฝ์–ด๋“ค์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž ์‚ฌ์ „ ํŒŒ์ผ์€ UTF-8๋กœ ์ธ์ฝ”๋”ฉ๋˜์–ด ์žˆ์–ด์•ผํ•˜๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํƒญ ๋ฌธ์ž(\t)๋กœ ๊ฐ๊ฐ์˜ ํ•„๋“œ๋Š” ๋ถ„๋ฆฌ๋˜์–ด์•ผ ํ•˜๋ฉฐ, ๋‹จ์–ด ์ ์ˆ˜๋Š” ์ƒ๋žต ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

#์œผ๋กœ ์‹œ์ž‘ํ•˜๋Š” ์ค„์€ ์ฃผ์„ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.
# ๊ฐ ํ•„๋“œ๋Š” Tab(\t)๋ฌธ์ž๋กœ ๊ตฌ๋ถ„๋ฉ๋‹ˆ๋‹ค.
#
# <๋‹จ์ผ ํ˜•ํƒœ์†Œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ>
# (ํ˜•ํƒœ) \t (ํ’ˆ์‚ฌํƒœ๊ทธ) \t (์ ์ˆ˜)
# * (์ ์ˆ˜)๋Š” ์ƒ๋žต์‹œ 0์œผ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.
ํ‚ค์œ„	NNP	-5.0
#
# <์ด๋ฏธ ์กด์žฌํ•˜๋Š” ํ˜•ํƒœ์†Œ์˜ ์ดํ˜•ํƒœ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ>
# (์ดํ˜•ํƒœ) \t (์›ํ˜•ํƒœ์†Œ/ํ’ˆ์‚ฌํƒœ๊ทธ) \t (์ ์ˆ˜)
# * (์ ์ˆ˜)๋Š” ์ƒ๋žต์‹œ 0์œผ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.
๊ธฐ์œ„	ํ‚ค์œ„/NNG	-3.0
#
# <๊ธฐ๋ถ„์„ ํ˜•ํƒœ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ>
# (ํ˜•ํƒœ) \t (์›ํ˜•ํƒœ์†Œ/ํ’ˆ์‚ฌํƒœ๊ทธ + ์›ํ˜•ํƒœ์†Œ/ํ’ˆ์‚ฌํƒœ๊ทธ + ...) \t (์ ์ˆ˜)
# * (์ ์ˆ˜)๋Š” ์ƒ๋žต์‹œ 0์œผ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.
์‚ฌ๊ฒผ๋‹ค	์‚ฌ๊ท€/VV + ์—ˆ/EP + ๋‹ค/EF	-1.0
#
# ํ˜„์žฌ๋Š” ๊ณต๋ฐฑ์„ ํฌํ•จํ•˜๋Š” ๋‹ค์–ด์ ˆ ํ˜•ํƒœ๋ฅผ ๋“ฑ๋กํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์‚ฌ์ „ ํŒŒ์ผ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์ฝ์–ด๋“ค์ด๋ฉด, ์‚ฌ์ „์„ ํ†ตํ•ด ์ƒˆ๋กœ ์ถ”๊ฐ€๋œ ํ˜•ํƒœ์†Œ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ์˜ˆ์‹œ์— ๋Œ€ํ•ด์„œ๋Š” Kiwi์— ๋‚ด์žฅ๋œ ๊ธฐ๋ณธ ์‚ฌ์ „ ํŒŒ์ผ์„ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.


๋ถ„์„

kiwi์„ ์ƒ์„ฑํ•˜๊ณ , ์‚ฌ์šฉ์ž ์‚ฌ์ „์— ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ์ž‘์—…์ด ์™„๋ฃŒ๋˜์—ˆ์œผ๋ฉด ๋‹ค์Œ ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ˜•ํƒœ์†Œ ๋ถ„์„, ๋ฌธ์žฅ ๋ถ„๋ฆฌ, ๋„์–ด์“ฐ๊ธฐ ๊ต์ •, ๋ฌธ์žฅ ๋ณต์› ๋“ฑ์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Kiwi.tokenize(text, match_option, normalize_coda=False, z_coda=True, split_complex=False, blocklist=None)
Kiwi.analyze(text, top_n, match_option, normalize_coda=False, z_coda=True, split_complex=False, blocklist=None)
Kiwi.split_into_sents(text, match_options=Match.ALL, normalize_coda=False, z_coda=True, split_complex=False, blocklist=None, return_tokens=False)
Kiwi.glue(text_chunks, insert_new_lines=None, return_space_insertions=False)
Kiwi.space(text, reset_whitespace=False)
Kiwi.join(morphs, lm_search=True)
Kiwi.template(format_str, cache=True)
tokenize(text, match_option=Match.ALL, normalize_coda=False)

์ž…๋ ฅ๋œ text๋ฅผ ํ˜•ํƒœ์†Œ ๋ถ„์„ํ•˜์—ฌ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋ถ„์„๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด Token์˜ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค.

>> kiwi.tokenize('ํ…Œ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.')
[Token(form='ํ…Œ์ŠคํŠธ', tag='NNG', start=0, len=3), Token(form='์ด', tag='VCP', start=3, len=1), Token(form='แ†ธ๋‹ˆ๋‹ค', tag='EF', start=4, len=2)]

normalize_coda๋Š” ใ…‹ใ…‹ใ…‹,ใ…Žใ…Žใ…Ž์™€ ๊ฐ™์€ ์ดˆ์„ฑ์ฒด๊ฐ€ ๋’ค๋”ฐ๋ผ์™€์„œ ๋ฐ›์นจ์œผ๋กœ ๋“ค์–ด๊ฐ”์„๋•Œ ๋ถ„์„์— ์‹คํŒจํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค.

>> kiwi.tokenize("์•ˆ ๋จน์—ˆ์—Œใ…‹ใ…‹", normalize_coda=False)
[Token(form='์•ˆ', tag='NNP', start=0, len=1), 
 Token(form='๋จน์—ˆ์—Œ', tag='NNP', start=2, len=3), 
 Token(form='ใ…‹ใ…‹', tag='SW', start=5, len=2)]
>> kiwi.tokenize("์•ˆ ๋จน์—ˆ์—Œใ…‹ใ…‹", normalize_coda=True)
[Token(form='์•ˆ', tag='MAG', start=0, len=1), 
 Token(form='๋จน', tag='VV', start=2, len=1), 
 Token(form='์—ˆ', tag='EP', start=3, len=1), 
 Token(form='์–ด', tag='EF', start=4, len=1), 
 Token(form='ใ…‹ใ…‹ใ…‹', tag='SW', start=5, len=2)]

analyze(text, top_n=1, match_option=Match.ALL, normalize_coda=False)

์ž…๋ ฅ๋œ text๋ฅผ ํ˜•ํƒœ์†Œ ๋ถ„์„ํ•˜์—ฌ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด top_n๊ฐœ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ž์„ธํ•˜๊ฒŒ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜ํ™˜๊ฐ’์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

[(๋ถ„์„๊ฒฐ๊ณผ1, ์ ์ˆ˜), (๋ถ„์„๊ฒฐ๊ณผ2, ์ ์ˆ˜), ... ]

๋ถ„์„๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด Token์˜ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜๋ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

>> kiwi.analyze('ํ…Œ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.', top_n=5)
[([Token(form='ํ…Œ์ŠคํŠธ', tag='NNG', start=0, len=3), Token(form='์ด', tag='VCP', start=3, len=1), Token(form='แ†ธ๋‹ˆ๋‹ค', tag='EF', start=4, len=2)], -25.217018127441406), 
 ([Token(form='ํ…Œ์ŠคํŠธ์ž…๋‹ˆ', tag='NNG', start=0, len=5), Token(form='๋‹ค', tag='EC', start=5, len=1)], -40.741905212402344), 
 ([Token(form='ํ…Œ์ŠคํŠธ์ž…๋‹ˆ', tag='NNG', start=0, len=5), Token(form='๋‹ค', tag='MAG', start=5, len=1)], -41.81024932861328), 
 ([Token(form='ํ…Œ์ŠคํŠธ์ž…๋‹ˆ', tag='NNG', start=0, len=5), Token(form='๋‹ค', tag='EF', start=5, len=1)], -42.300254821777344), 
 ([Token(form='ํ…Œ์ŠคํŠธ', tag='NNG', start=0, len=3), Token(form='์ž…', tag='NNG', start=3, len=1), Token(form='๋‹ˆ๋‹ค', tag='EF', start=4, len=2)], -45.86524200439453)
]

๋งŒ์•ฝ text๊ฐ€ str์˜ iterable์ธ ๊ฒฝ์šฐ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž…๋ ฅ์„ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ์˜ ๋ฐ˜ํ™˜๊ฐ’์€ ๋‹จ์ผ text๋ฅผ ์ž…๋ ฅํ•œ ๊ฒฝ์šฐ์˜ ๋ฐ˜ํ™˜๊ฐ’์˜ iterable์ž…๋‹ˆ๋‹ค. Kiwi() ์ƒ์„ฑ์‹œ ์ธ์ž๋กœ ์ค€ num_workers์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ์—์„œ ์ž‘์—…์ด ๋™์‹œ์— ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜ํ™˜๋˜๋Š” ๊ฐ’์€ ์ž…๋ ฅ๋˜๋Š” ๊ฐ’์˜ ์ˆœ์„œ์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

>> result_iter = kiwi.analyze(['ํ…Œ์ŠคํŠธ์ž…๋‹ˆ๋‹ค.', 'ํ…Œ์ŠคํŠธ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค.', '์‚ฌ์‹ค ๋งž์Šต๋‹ˆ๋‹ค.'])
>> next(result_iter)
[([Token(form='ํ…Œ์ŠคํŠธ', tag='NNG', start=0, len=3), Token(form='์ด', tag='VCP', start=3, len=1), Token(form='แ†ธ๋‹ˆ๋‹ค', tag='EF', start=4, len=2), Token(form='.', tag='SF', start=6, len=1)], -20.441545486450195)]
>> next(result_iter)
[([Token(form='ํ…Œ์ŠคํŠธ', tag='NNG', start=0, len=3), Token(form='๊ฐ€', tag='JKC', start=3, len=1), Token(form='์•„๋‹ˆ', tag='VCN', start=5, len=2), Token(form='แ†ธ๋‹ˆ๋‹ค', tag='EF', start=7, len=2), Token(form='.', tag='SF', start=9, len=1)], -30.23870277404785)]
>> next(result_iter)
[([Token(form='์‚ฌ์‹ค', tag='MAG', start=0, len=2), Token(form='๋งž', tag='VV', start=3, len=1), Token(form='์Šต๋‹ˆ๋‹ค', tag='EF', start=4, len=3), Token(form='.', tag='SF', start=7, len=1)], -22.232769012451172)]
>> next(result_iter)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
StopIteration

for ๋ฐ˜๋ณต๋ฌธ์„ ์‚ฌ์šฉํ•˜๋ฉด ์ข€๋” ๊ฐ„๋‹จํ•˜๊ณ  ํŽธ๋ฆฌํ•˜๊ฒŒ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋Œ€๋Ÿ‰์˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•  ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

>> for result in kiwi.analyze(long_list_of_text):
      tokens, score = result[0]
      print(tokens)

text๋ฅผ str์˜ iterable๋กœ ์ค€ ๊ฒฝ์šฐ ์ด iterable์„ ์ฝ์–ด๋“ค์ด๋Š” ์‹œ์ ์€ analyze ํ˜ธ์ถœ ์ดํ›„์ผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด ์ธ์ž๊ฐ€ ๋‹ค๋ฅธ IO ์ž์›(ํŒŒ์ผ ์ž…์ถœ๋ ฅ ๋“ฑ)๊ณผ ์—ฐ๋™๋˜์–ด ์žˆ๋‹ค๋ฉด ๋ชจ๋“  ๋ถ„์„์ด ๋๋‚˜๊ธฐ ์ „๊นŒ์ง€ ํ•ด๋‹น ์ž์›์„ ์ข…๋ฃŒํ•˜๋ฉด ์•ˆ๋ฉ๋‹ˆ๋‹ค.

>> file = open('long_text.txt', encoding='utf-8')
>> result_iter = kiwi.analyze(file)
>> file.close() # ํŒŒ์ผ์ด ์ข…๋ฃŒ๋จ
>> next(result_iter) # ์ข…๋ฃŒ๋œ ํŒŒ์ผ์—์„œ ๋ถ„์„ํ•ด์•ผํ•  ๋‹ค์Œ ํ…์ŠคํŠธ๋ฅผ ์ฝ์–ด๋“ค์ด๋ ค๊ณ  ์‹œ๋„ํ•จ
ValueError: I/O operation on closed file.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: <built-in function next> returned a result with an error set

split_into_sents( text, match_options=Match.ALL, normalize_coda=False, return_tokens=False ) ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฉ”์†Œ๋“œ๋Š” ๋ฌธ์žฅ ๋ถ„ํ•  ๊ณผ์ •์—์„œ ๋‚ด๋ถ€์ ์œผ๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๋ฌธ์žฅ ๋ถ„ํ• ๊ณผ ๋™์‹œ์— ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. `return_tokens`๋ฅผ `True`๋กœ ์„ค์ •ํ•˜๋ฉด ๋ฌธ์žฅ ๋ถ„๋ฆฌ์™€ ํ•จ๊ป˜ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฒฐ๊ณผ๋„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
>> kiwi.split_into_sents("์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค ์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜")
[Sentence(text='์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค', start=0, end=16, tokens=None),
 Sentence(text='์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜', start=17, end=24, tokens=None)]
>> kiwi.split_into_sents("์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค ์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜", return_tokens=True)
[Sentence(text='์—ฌ๋Ÿฌ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ํ…์ŠคํŠธ๋„ค', start=0, end=16, tokens=[
  Token(form='์—ฌ๋Ÿฌ', tag='MM', start=0, len=2), 
  Token(form='๋ฌธ์žฅ', tag='NNG', start=3, len=2), 
  Token(form='์œผ๋กœ', tag='JKB', start=5, len=2), 
  Token(form='๊ตฌ์„ฑ', tag='NNG', start=8, len=2), 
  Token(form='๋˜', tag='XSV', start=10, len=1), 
  Token(form='แ†ซ', tag='ETM', start=11, len=0), 
  Token(form='ํ…์ŠคํŠธ', tag='NNG', start=12, len=3), 
  Token(form='์ด', tag='VCP', start=15, len=1), 
  Token(form='๋„ค', tag='EF', start=15, len=1)
 ]),
 Sentence(text='์ด๊ฑธ ๋ถ„๋ฆฌํ•ด์ค˜', start=17, end=24, tokens=[
  Token(form='์ด๊ฑฐ', tag='NP', start=17, len=2), 
  Token(form='แ†ฏ', tag='JKO', start=19, len=0), 
  Token(form='๋ถ„๋ฆฌ', tag='NNG', start=20, len=2), 
  Token(form='ํ•˜', tag='XSV', start=22, len=1), 
  Token(form='์–ด', tag='EC', start=22, len=1), 
  Token(form='์ฃผ', tag='VX', start=23, len=1), 
  Token(form='์–ด', tag='EF', start=23, len=1)
 ])]

glue(text_chunks, return_space_insertions=False) ์—ฌ๋Ÿฌ ํ…์ŠคํŠธ ์กฐ๊ฐ์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋˜, ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•ด ์ ์ ˆํ•œ ๊ณต๋ฐฑ์„ ์‚ฌ์ด์— ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ OCR๋กœ ์ƒ์„ฑ๋˜๊ฑฐ๋‚˜ PDF ๋“ฑ์—์„œ ๋ณต์‚ฌํ•˜์—ฌ ๊ฐ•์ œ ๊ฐœํ–‰์ด ํฌํ•จ๋œ ํ…์ŠคํŠธ๋ฅผ ์ด์–ด ๋ถ™์ด๋Š”๋ฐ์— ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.
  • text_chunks: ํ•ฉ์น  ํ…์ŠคํŠธ ์กฐ๊ฐ๋“ค์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค.
  • return_space_insertions: True์ธ ๊ฒฝ์šฐ, ๊ฐ ์กฐ๊ฐ๋ณ„ ๊ณต๋ฐฑ ์‚ฝ์ž… ์œ ๋ฌด๋ฅผ List[bool]๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
>> kiwi.glue([
    "๊ทธ๋Ÿฌ๋‚˜  ์•Œ๊ณ ๋ณด๋‹ˆ ๊ทธ ๋ด‰",
    "์ง€ ์•ˆ์— ์žˆ๋˜ ๊ฒƒ์€ ๋ฐ”๋กœ",
    "๋ ˆ๋ชฌ์ด์—ˆ๋˜ ๊ฒƒ์ด๋‹ค."])
"๊ทธ๋Ÿฌ๋‚˜  ์•Œ๊ณ ๋ณด๋‹ˆ ๊ทธ ๋ด‰์ง€ ์•ˆ์— ์žˆ๋˜ ๊ฒƒ์€ ๋ฐ”๋กœ ๋ ˆ๋ชฌ์ด์—ˆ๋˜ ๊ฒƒ์ด๋‹ค."

>> kiwi.glue([
    "๊ทธ๋Ÿฌ๋‚˜  ์•Œ๊ณ ๋ณด๋‹ˆ ๊ทธ ๋ด‰",
    "์ง€ ์•ˆ์— ์žˆ๋˜ ๊ฒƒ์€ ๋ฐ”๋กœ",
    "๋ ˆ๋ชฌ์ด์—ˆ๋˜ ๊ฒƒ์ด๋‹ค."], return_space_insertions=True)
("๊ทธ๋Ÿฌ๋‚˜  ์•Œ๊ณ ๋ณด๋‹ˆ ๊ทธ ๋ด‰์ง€ ์•ˆ์— ์žˆ๋˜ ๊ฒƒ์€ ๋ฐ”๋กœ ๋ ˆ๋ชฌ์ด์—ˆ๋˜ ๊ฒƒ์ด๋‹ค.", [False, True])

space(text, reset_whitespace=False) ์ž…๋ ฅ ํ…์ŠคํŠธ์—์„œ ๋„์–ด์“ฐ๊ธฐ๋ฅผ ๊ต์ •ํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • text: ๋ถ„์„ํ•  ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค. ์ด ์ธ์ž๋ฅผ ๋‹จ์ผ str๋กœ ์ค„ ๊ฒฝ์šฐ, ์‹ฑ๊ธ€์Šค๋ ˆ๋“œ์—์„œ ์ฒ˜๋ฆฌํ•˜๋ฉฐ str์˜ Iterable๋กœ ์ค„ ๊ฒฝ์šฐ, ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋“œ๋กœ ๋ถ„๋ฐฐํ•˜์—ฌ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • reset_whitespace True์ธ ๊ฒฝ์šฐ ์ด๋ฏธ ๋„์–ด์“ฐ๊ธฐ๋œ ๋ถ€๋ถ„์„ ๋ถ™์ด๋Š” ๊ต์ •๋„ ์ ๊ทน์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ False๋กœ, ์ด ๊ฒฝ์šฐ์—๋Š” ๋ถ™์–ด ์žˆ๋Š” ๋‹จ์–ด๋ฅผ ๋„์–ด์“ฐ๋Š” ๊ต์ • ์œ„์ฃผ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฉ”์†Œ๋“œ์˜ ๋„์–ด์“ฐ๊ธฐ ๊ต์ • ๊ธฐ๋Šฅ์€ ํ˜•ํƒœ์†Œ ๋ถ„์„์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ˜•ํƒœ์†Œ ์ค‘๊ฐ„์— ๊ณต๋ฐฑ์ด ์‚ฝ์ž…๋œ ๊ฒฝ์šฐ ๊ต์ • ๊ฒฐ๊ณผ๊ฐ€ ๋ถ€์ •ํ™•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ Kiwi.space_tolerance๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ํ˜•ํƒœ์†Œ ๋‚ด ๊ณต๋ฐฑ์„ ๋ฌด์‹œํ•˜๊ฑฐ๋‚˜, reset_whitespace=True๋กœ ์„ค์ •ํ•˜์—ฌ ์•„์˜ˆ ๊ธฐ์กด ๊ณต๋ฐฑ์„ ๋ฌด์‹œํ•˜๊ณ  ๋„์–ด์“ฐ๊ธฐ๋ฅผ ํ•˜๋„๋ก ํ•˜๋ฉด ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

>> kiwi.space("๋„์–ด์“ฐ๊ธฐ์—†์ด์ž‘์„ฑ๋œํ…์ŠคํŠธ๋„ค์ด๊ฑธ๊ต์ •ํ•ด์ค˜")
"๋„์–ด์“ฐ๊ธฐ ์—†์ด ์ž‘์„ฑ๋œ ํ…์ŠคํŠธ๋„ค ์ด๊ฑธ ๊ต์ •ํ•ด ์ค˜."
>> kiwi.space("๋„ ์–ด ์“ฐ ๊ธฐ ๋ฌธ ์ œ ๊ฐ€ ์žˆ ์Šต ๋‹ˆ ๋‹ค")
"๋„์–ด ์“ฐ๊ธฐ ๋ฌธ ์ œ ๊ฐ€ ์žˆ ์Šต ๋‹ˆ ๋‹ค"
>> kiwi.space_tolerance = 2 # ํ˜•ํƒœ์†Œ ๋‚ด ๊ณต๋ฐฑ์„ ์ตœ๋Œ€ 2๊ฐœ๊นŒ์ง€ ํ—ˆ์šฉ
>> kiwi.space("๋„ ์–ด ์“ฐ ๊ธฐ ๋ฌธ ์ œ ๊ฐ€ ์žˆ ์Šต ๋‹ˆ ๋‹ค")
"๋„์–ด ์“ฐ๊ธฐ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค"
>> kiwi.space("๋„ ์–ด ์“ฐ ๊ธฐ ๋ฌธ ์ œ ๊ฐ€ ์žˆ ์Šต ๋‹ˆ ๋‹ค", reset_whitespace=True) # ๊ธฐ์กด ๊ณต๋ฐฑ ์ „๋ถ€ ๋ฌด์‹œ
"๋„์–ด์“ฐ๊ธฐ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค"

join(morphs, lm_search=True) ํ˜•ํƒœ์†Œ๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฌธ์žฅ์œผ๋กœ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค. ์กฐ์‚ฌ๋‚˜ ์–ด๋ฏธ๋Š” ์•ž ํ˜•ํƒœ์†Œ์— ๋งž์ถฐ ์ ์ ˆํ•œ ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.
  • morphs: ๊ฒฐํ•ฉํ•  ํ˜•ํƒœ์†Œ์˜ ๋ชฉ๋ก์ž…๋‹ˆ๋‹ค. ๊ฐ ํ˜•ํƒœ์†Œ๋Š” Kiwi.tokenizer์—์„œ ์–ป์–ด์ง„ Token ํƒ€์ž…์ด๊ฑฐ๋‚˜, (ํ˜•ํƒœ, ํ’ˆ์‚ฌ)๋กœ ๊ตฌ์„ฑ๋œ tuple ํƒ€์ž…์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • lm_search: ๋‘˜ ์ด์ƒ์˜ ํ˜•ํƒœ๋กœ ๋ณต์› ๊ฐ€๋Šฅํ•œ ๋ชจํ˜ธํ•œ ํ˜•ํƒœ์†Œ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ, ์ด ๊ฐ’์ด True๋ฉด ์–ธ์–ด ๋ชจ๋ธ ํƒ์ƒ‰์„ ํ†ตํ•ด ์ตœ์ ์˜ ํ˜•ํƒœ์†Œ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. False์ผ ๊ฒฝ์šฐ ํƒ์ƒ‰์„ ์‹ค์‹œํ•˜์ง€ ์•Š์ง€๋งŒ ๋” ๋น ๋ฅธ ์†๋„๋กœ ๋ณต์›์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฉ”์†Œ๋“œ๋Š” ํ˜•ํƒœ์†Œ๋ฅผ ๊ฒฐํ•ฉํ•  ๋•Œ space์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ๊ทœ์น™์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๋ฐฑ์„ ์ ์ ˆํžˆ ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค. ํ˜•ํƒœ์†Œ ๊ทธ ์ž์ฒด์—๋Š” ๊ณต๋ฐฑ ๊ด€๋ จ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์œผ๋ฏ€๋กœ ํŠน์ • ํ…์ŠคํŠธ๋ฅผ tokenize๋กœ ๋ถ„์„ ํ›„ ๋‹ค์‹œ join์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ๋„ ์›๋ณธ ํ…์ŠคํŠธ๊ฐ€ ๊ทธ๋Œ€๋กœ ๋ณต์›๋˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

>> kiwi.join([('๋ฅ', 'VA'), ('์–ด', 'EC')])
'๋”์›Œ'
>> tokens = kiwi.tokenize("๋ถ„์„๋œ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œํ•ฉ์น ์ˆ˜์žˆ๋‹ค!")
# ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋ณต์›. 
# ๋ณต์› ์‹œ ๊ณต๋ฐฑ์€ ๊ทœ์น™์— ์˜ํ•ด ์‚ฝ์ž…๋˜๋ฏ€๋กœ ์›๋ฌธ ํ…์ŠคํŠธ๊ฐ€ ๊ทธ๋Œ€๋กœ ๋ณต์›๋˜์ง€๋Š” ์•Š์Œ.
>> kiwi.join(tokens)
'๋ถ„์„๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์‹œ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค!'
>> tokens[3]
Token(form='๊ฒฐ๊ณผ', tag='NNG', start=4, len=2)
>> tokens[3] = ('๋‚ด์šฉ', 'NNG') # 4๋ฒˆ์งธ ํ˜•ํƒœ์†Œ๋ฅผ ๊ฒฐ๊ณผ->๋‚ด์šฉ์œผ๋กœ ๊ต์ฒด
>> kiwi.join(tokens) # ๋‹ค์‹œ joinํ•˜๋ฉด ๊ฒฐ๊ณผ๋ฅผ->๋‚ด์šฉ์„ ๋กœ ๊ต์ฒด๋œ ๊ฑธ ํ™•์ธ ๊ฐ€๋Šฅ
'๋ถ„์„๋œ ๋‚ด์šฉ์„ ๋‹ค์‹œ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค!'

# ๋ถˆ๊ทœ์น™ ํ™œ์šฉ์—ฌ๋ถ€๊ฐ€ ๋ชจํ˜ธํ•œ ๊ฒฝ์šฐ lm_search=True์ธ ๊ฒฝ์šฐ ๋งฅ๋ฝ์„ ๊ณ ๋ คํ•ด ์ตœ์ ์˜ ํ›„๋ณด๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
>> kiwi.join([('๊ธธ', 'NNG'), ('์„', 'JKO'), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')])
'๊ธธ์„ ๋ฌผ์–ด์š”'
>> kiwi.join([('ํ™', 'NNG'), ('์ด', 'JKS'), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')])
'ํ™์ด ๋ฌป์–ด์š”'
# lm_search=False์ด๋ฉด ํƒ์ƒ‰์„ ์‹ค์‹œํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
>> kiwi.join([('๊ธธ', 'NNG'), ('์„', 'JKO'), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')], lm_search=False)
'๊ธธ์„ ๋ฌป์–ด์š”'
>> kiwi.join([('ํ™', 'NNG'), ('์ด', 'JKS'), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')], lm_search=False)
'ํ™์ด ๋ฌป์–ด์š”'
# ๋™์‚ฌ/ํ˜•์šฉ์‚ฌ ํ’ˆ์‚ฌ ํƒœ๊ทธ ๋’ค์— -R(๊ทœ์น™ ํ™œ์šฉ), -I(๋ถˆ๊ทœ์น™ ํ™œ์šฉ)์„ ๋ง๋ถ™์—ฌ ํ™œ์šฉ๋ฒ•์„ ์ง์ ‘ ๋ช…์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>> kiwi.join([('๋ฌป', 'VV-R'), ('์–ด์š”', 'EF')])
'๋ฌป์–ด์š”'
>> kiwi.join([('๋ฌป', 'VV-I'), ('์–ด์š”', 'EF')])
'๋ฌผ์–ด์š”'

# 0.15.2๋ฒ„์ „๋ถ€ํ„ฐ๋Š” Tuple์˜ ์„ธ๋ฒˆ์งธ ์š”์†Œ๋กœ ๋„์–ด์“ฐ๊ธฐ ์œ ๋ฌด๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
# True์ผ ๊ฒฝ์šฐ ๊ฐ•์ œ๋กœ ๋„์–ด์“ฐ๊ธฐ, False์ผ ๊ฒฝ์šฐ ๊ฐ•์ œ๋กœ ๋ถ™์—ฌ์“ฐ๊ธฐ๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
>> kiwi.join([('๊ธธ', 'NNG'), ('์„', 'JKO', True), ('๋ฌป', 'VV'), ('์–ด์š”', 'EF')])
'๊ธธ ์„ ๋ฌผ์–ด์š”'
>> kiwi.join([('๊ธธ', 'NNG'), ('์„', 'JKO'), ('๋ฌป', 'VV', False), ('์–ด์š”', 'EF')])
'๊ธธ์„๋ฌผ์–ด์š”'

# ๊ณผ๊ฑฐํ˜• ์„ ์–ด๋ง์–ด๋ฏธ๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ์˜ˆ์‹œ
>> remove_past = lambda s: kiwi.join(t for t in kiwi.tokenize(s) if t.tagged_form != '์—ˆ/EP')
>> remove_past('๋จน์—ˆ๋‹ค')
'๋จน๋‹ค'
>> remove_past('๋จผ ๊ธธ์„ ๊ฑธ์—ˆ๋‹ค')
'๋จผ ๊ธธ์„ ๊ฑท๋‹ค'
>> remove_past('์ „ํ™”๋ฅผ ๊ฑธ์—ˆ๋‹ค.')
'์ „ํ™”๋ฅผ ๊ฑธ๋‹ค.'

template(format_str, cache=True) ํ˜•ํƒœ์†Œ๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฌธ์žฅ์œผ๋กœ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค. ์กฐ์‚ฌ๋‚˜ ์–ด๋ฏธ๋Š” ์•ž ํ˜•ํƒœ์†Œ์— ๋งž์ถฐ ์ ์ ˆํ•œ ํ˜•ํƒœ๋กœ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.
  • format_str: ํ…œํ”Œ๋ฆฟ ๋ฌธ์ž์—ด์ž…๋‹ˆ๋‹ค. Python์˜ str.format(https://docs.python.org/ko/3/library/string.html#formatstrings )๊ณผ ๋™์ผํ•œ ๋ฌธ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • cache: ํ…œํ”Œ๋ฆฟ์˜ ์บ์‹œ ์—ฌ๋ถ€์ž…๋‹ˆ๋‹ค.

์ด ๋ฉ”์†Œ๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด Kiwi.join์˜ ํ˜•ํƒœ์†Œ ๊ฒฐํ•ฉ ๊ธฐ๋Šฅ์„ ๋”์šฑ ๊ฐ„ํŽธํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋„์™€์ค๋‹ˆ๋‹ค.

# ๋นˆ์นธ์€ {}๋กœ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. 
# ์ด ์ž๋ฆฌ์— ํ˜•ํƒœ์†Œ ํ˜น์€ ๊ธฐํƒ€ Python ๊ฐ์ฒด๊ฐ€ ๋“ค์–ด๊ฐ€์„œ ๋ฌธ์ž์—ด์„ ์™„์„ฑ์‹œํ‚ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
>>> tpl = kiwi.template("{}๊ฐ€ {}์„ {}์—ˆ๋‹ค.")

# template ๊ฐ์ฒด๋Š” format ๋ฉ”์†Œ๋“œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 
# ์ด ๋ฉ”์†Œ๋“œ๋ฅผ ํ†ตํ•ด ๋นˆ ์นธ์„ ์ฑ„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# ํ˜•ํƒœ์†Œ๋Š” `kiwipiepy.Token` ํƒ€์ž…์ด๊ฑฐ๋‚˜ 
# (ํ˜•ํƒœ, ํ’ˆ์‚ฌ) ํ˜น์€ (ํ˜•ํƒœ, ํ’ˆ์‚ฌ, ์™ผ์ชฝ ๋„์–ด์“ฐ๊ธฐ ์œ ๋ฌด)๋กœ ๊ตฌ์„ฑ๋œ tuple ํƒ€์ž…์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
>>> tpl.format(("๋‚˜", "NP"), ("๊ณต๋ถ€", "NNG"), ("ํ•˜", "VV"))
'๋‚ด๊ฐ€ ๊ณต๋ถ€๋ฅผ ํ–ˆ๋‹ค.'

>>> tpl.format(("๋„ˆ", "NP"), ("๋ฐฅ", "NNG"), ("๋จน", "VV"))
'๋„ค๊ฐ€ ๋ฐฅ์„ ๋จน์—ˆ๋‹ค.'

>>> tpl.format(("์šฐ๋ฆฌ", "NP"), ("๊ธธ", "NNG"), ("๋ฌป", "VV-I"))
'์šฐ๋ฆฌ๊ฐ€ ๊ธธ์„ ๋ฌผ์—ˆ๋‹ค.'

# ํ˜•ํƒœ์†Œ๊ฐ€ ์•„๋‹Œ Python ๊ฐ์ฒด๊ฐ€ ์ž…๋ ฅ๋˜๋Š” ๊ฒฝ์šฐ `str.format`๊ณผ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.
>>> tpl.format(5, "str", {"dict":"dict"})
"5๊ฐ€ str๋ฅผ {'dict': 'dict'}์—ˆ๋‹ค."

# ์ž…๋ ฅํ•œ ๊ฐ์ฒด๊ฐ€ ํ˜•ํƒœ์†Œ๊ฐ€ ์•„๋‹Œ Python ๊ฐ์ฒด๋กœ ์ฒ˜๋ฆฌ๋˜๊ธธ ์›ํ•˜๋Š” ๊ฒฝ์šฐ !s ๋ณ€ํ™˜ ํ”Œ๋ž˜๊ทธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
>>> tpl = kiwi.template("{!s}๊ฐ€ {}์„ {}์—ˆ๋‹ค.")
>>> tpl.format(("๋‚˜", "NP"), ("๊ณต๋ถ€", "NNG"), ("ํ•˜", "VV"))
"('๋‚˜', 'NP')๊ฐ€ ๊ณต๋ถ€๋ฅผ ํ–ˆ๋‹ค."

# Python ๊ฐ์ฒด์— ๋Œ€ํ•ด์„œ๋Š” `str.format`๊ณผ ๋™์ผํ•œ ์„œ์‹ ์ง€์ •์ž๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> tpl = kiwi.template("{:.5f}๊ฐ€ {!r}์„ {}์—ˆ๋‹ค.")
>>> tpl.format(5, "str", {"dict":"dict"})
"5.00000๊ฐ€ 'str'๋ฅผ {'dict': 'dict'}์—ˆ๋‹ค."

# ์„œ์‹ ์ง€์ •์ž๊ฐ€ ์ฃผ์–ด์ง„ ์นธ์— ํ˜•ํƒœ์†Œ๋ฅผ ๋Œ€์ž…ํ•  ๊ฒฝ์šฐ ValueError๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
>>> tpl.format(("์šฐ๋ฆฌ", "NP"), "str", ("๋ฌป", "VV-I"))
ValueError: cannot specify format specifier for Kiwi Token

# ์น˜ํ™˜ ํ•„๋“œ์— index๋‚˜ name์„ ์ง€์ •ํ•˜์—ฌ ๋Œ€์ž… ์ˆœ์„œ๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
>>> tpl = kiwi.template("{0}๊ฐ€ {obj}๋ฅผ {verb}\ใ„ด๋‹ค. {1}๋Š” {obj}๋ฅผ ์•ˆ {verb}์—ˆ๋‹ค.")
>>> tpl.format(
    [("์šฐ๋ฆฌ", "NP"), ("๋“ค", "XSN")], 
    [("๋„ˆํฌ", "NP"), ("๋“ค", "XSN")], 
    obj=("๊ธธ", "NNG"), 
    verb=("๋ฌป", "VV-I")
)
'์šฐ๋ฆฌ๋“ค์ด ๊ธธ์„ ๋ฌป๋Š”๋‹ค. ๋„ˆํฌ๋“ค์€ ๊ธธ์„ ์•ˆ ๋ฌผ์—ˆ๋‹ค.'

# ์œ„์˜ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์ข…์„ฑ ์ž์Œ์€ ํ˜ธํ™˜์šฉ ์ž๋ชจ ์ฝ”๋“œ ์•ž์— \\๋กœ ์ด์Šค์ผ€์ดํ”„๋ฅผ ์‚ฌ์šฉํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
# ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ข…์„ฑ์ด ์•„๋‹Œ ์ดˆ์„ฑ์œผ๋กœ ์ธ์‹๋ฉ๋‹ˆ๋‹ค.
>>> tpl = kiwi.template("{0}๊ฐ€ {obj}๋ฅผ {verb}ใ„ด๋‹ค. {1}๋Š” {obj}๋ฅผ ์•ˆ {verb}์—ˆ๋‹ค.")
>>> tpl.format(
    [("์šฐ๋ฆฌ", "NP"), ("๋“ค", "XSN")], 
    [("๋„ˆํฌ", "NP"), ("๋“ค", "XSN")], 
    obj=("๊ธธ", "NNG"), 
    verb=("๋ฌป", "VV-I")
)
'์šฐ๋ฆฌ๋“ค์ด ๊ธธ์„ ๋ฌป แ„‚์ด๋‹ค. ๋„ˆํฌ๋“ค์€ ๊ธธ์„ ์•ˆ ๋ฌผ์—ˆ๋‹ค.'

ํ’ˆ์‚ฌ ํƒœ๊ทธ

์„ธ์ข… ํ’ˆ์‚ฌ ํƒœ๊ทธ๋ฅผ ๊ธฐ์ดˆ๋กœ ํ•˜๋˜, ์ผ๋ถ€ ํ’ˆ์‚ฌ ํƒœ๊ทธ๋ฅผ ์ถ”๊ฐ€/์ˆ˜์ •ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋Œ€๋ถ„๋ฅ˜ ํƒœ๊ทธ ์„ค๋ช…
์ฒด์–ธ(N) NNG ์ผ๋ฐ˜ ๋ช…์‚ฌ
NNP ๊ณ ์œ  ๋ช…์‚ฌ
NNB ์˜์กด ๋ช…์‚ฌ
NR ์ˆ˜์‚ฌ
NP ๋Œ€๋ช…์‚ฌ
์šฉ์–ธ(V) VV ๋™์‚ฌ
VA ํ˜•์šฉ์‚ฌ
VX ๋ณด์กฐ ์šฉ์–ธ
VCP ๊ธ์ • ์ง€์‹œ์‚ฌ(์ด๋‹ค)
VCN ๋ถ€์ • ์ง€์‹œ์‚ฌ(์•„๋‹ˆ๋‹ค)
๊ด€ํ˜•์‚ฌ MM ๊ด€ํ˜•์‚ฌ
๋ถ€์‚ฌ(MA) MAG ์ผ๋ฐ˜ ๋ถ€์‚ฌ
MAJ ์ ‘์† ๋ถ€์‚ฌ
๊ฐํƒ„์‚ฌ IC ๊ฐํƒ„์‚ฌ
์กฐ์‚ฌ(J) JKS ์ฃผ๊ฒฉ ์กฐ์‚ฌ
JKC ๋ณด๊ฒฉ ์กฐ์‚ฌ
JKG ๊ด€ํ˜•๊ฒฉ ์กฐ์‚ฌ
JKO ๋ชฉ์ ๊ฒฉ ์กฐ์‚ฌ
JKB ๋ถ€์‚ฌ๊ฒฉ ์กฐ์‚ฌ
JKV ํ˜ธ๊ฒฉ ์กฐ์‚ฌ
JKQ ์ธ์šฉ๊ฒฉ ์กฐ์‚ฌ
JX ๋ณด์กฐ์‚ฌ
JC ์ ‘์† ์กฐ์‚ฌ
์–ด๋ฏธ(E) EP ์„ ์–ด๋ง ์–ด๋ฏธ
EF ์ข…๊ฒฐ ์–ด๋ฏธ
EC ์—ฐ๊ฒฐ ์–ด๋ฏธ
ETN ๋ช…์‚ฌํ˜• ์ „์„ฑ ์–ด๋ฏธ
ETM ๊ด€ํ˜•ํ˜• ์ „์„ฑ ์–ด๋ฏธ
์ ‘๋‘์‚ฌ XPN ์ฒด์–ธ ์ ‘๋‘์‚ฌ
์ ‘๋ฏธ์‚ฌ(XS) XSN ๋ช…์‚ฌ ํŒŒ์ƒ ์ ‘๋ฏธ์‚ฌ
XSV ๋™์‚ฌ ํŒŒ์ƒ ์ ‘๋ฏธ์‚ฌ
XSA ํ˜•์šฉ์‚ฌ ํŒŒ์ƒ ์ ‘๋ฏธ์‚ฌ
XSM ๋ถ€์‚ฌ ํŒŒ์ƒ ์ ‘๋ฏธ์‚ฌ*
์–ด๊ทผ XR ์–ด๊ทผ
๋ถ€ํ˜ธ, ์™ธ๊ตญ์–ด, ํŠน์ˆ˜๋ฌธ์ž(S) SF ์ข…๊ฒฐ ๋ถ€ํ˜ธ(. ! ?)
SP ๊ตฌ๋ถ„ ๋ถ€ํ˜ธ(, / : ;)
SS ์ธ์šฉ ๋ถ€ํ˜ธ ๋ฐ ๊ด„ํ˜ธ(' " ( ) [ ] < > { } โ€• โ€˜ โ€™ โ€œ โ€ โ‰ช โ‰ซ ๋“ฑ)
SSO SS ์ค‘ ์—ฌ๋Š” ๋ถ€ํ˜ธ*
SSC SS ์ค‘ ๋‹ซ๋Š” ๋ถ€ํ˜ธ*
SE ์ค„์ž„ํ‘œ(โ€ฆ)
SO ๋ถ™์ž„ํ‘œ(- ~)
SW ๊ธฐํƒ€ ํŠน์ˆ˜ ๋ฌธ์ž
SL ์•ŒํŒŒ๋ฒณ(A-Z a-z)
SH ํ•œ์ž
SN ์ˆซ์ž(0-9)
SB ์ˆœ์„œ ์žˆ๋Š” ๊ธ€๋จธ๋ฆฌ(๊ฐ€. ๋‚˜. 1. 2. ๊ฐ€) ๋‚˜) ๋“ฑ)*
๋ถ„์„ ๋ถˆ๋Šฅ UN ๋ถ„์„ ๋ถˆ๋Šฅ*
์›น(W) W_URL URL ์ฃผ์†Œ*
W_EMAIL ์ด๋ฉ”์ผ ์ฃผ์†Œ*
W_HASHTAG ํ•ด์‹œํƒœ๊ทธ(#abcd)*
W_MENTION ๋ฉ˜์…˜(@abcd)*
W_SERIAL ์ผ๋ จ๋ฒˆํ˜ธ(์ „ํ™”๋ฒˆํ˜ธ, ํ†ต์žฅ๋ฒˆํ˜ธ, IP์ฃผ์†Œ ๋“ฑ)*
๊ธฐํƒ€ Z_CODA ๋ง๋ถ™์€ ๋ฐ›์นจ*
USER0~4 ์‚ฌ์šฉ์ž ์ •์˜ ํƒœ๊ทธ*

* ์„ธ์ข… ํ’ˆ์‚ฌ ํƒœ๊ทธ์™€ ๋‹ค๋ฅธ ๋…์ž์ ์ธ ํƒœ๊ทธ์ž…๋‹ˆ๋‹ค.

0.12.0 ๋ฒ„์ „๋ถ€ํ„ฐ VV, VA, VX, XSA ํƒœ๊ทธ์— ๋ถˆ๊ทœ์น™ ํ™œ์šฉ์—ฌ๋ถ€๋ฅผ ๋ช…์‹œํ•˜๋Š” ์ ‘๋ฏธ์‚ฌ -R์™€ -I์ด ๋ง๋ถ™์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. -R์€ ๊ทœ์น™ ํ™œ์šฉ,-I์€ ๋ถˆ๊ทœ์น™ ํ™œ์šฉ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๋ฌธ์žฅ ๋ถ„๋ฆฌ ๊ธฐ๋Šฅ

0.10.3 ๋ฒ„์ „๋ถ€ํ„ฐ ๋ฌธ์žฅ ๋ถ„๋ฆฌ ๊ธฐ๋Šฅ์„ ์‹คํ—˜์ ์œผ๋กœ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. 0.11.0 ๋ฒ„์ „๋ถ€ํ„ฐ๋Š” ์ •ํ™•๋„๊ฐ€ ์ œ๋ฒ• ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์žฅ ๋ถ„๋ฆฌ ๊ธฐ๋Šฅ์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•ด์„œ๋Š” ์ด ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.

๋ชจํ˜ธ์„ฑ ํ•ด์†Œ ์„ฑ๋Šฅ

ํ•œ ๋‹จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๋กœ ํ˜•ํƒœ์†Œ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•˜์—ฌ ๋งฅ๋ฝ์„ ๋ณด๋Š” ๊ฒŒ ํ•„์ˆ˜์ ์ธ ์ƒํ™ฉ์—์„œ Kiwi๊ฐ€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์ด ํ™•์ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ชจํ˜ธ์„ฑ ํ•ด์†Œ ์„ฑ๋Šฅ์— ๋Œ€ํ•ด์„œ๋Š” ์ด ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.

์ธ์šฉํ•˜๊ธฐ

์ธ์šฉ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” Kiwi#์ธ์šฉํ•˜๊ธฐ๋ฅผ ์ฐธ์กฐํ•ด์ฃผ์„ธ์š”.