pyko

Korean Text Processor


Keywords
natural, language, processing, text, korean, korean-nlp, korean-text-processing, korean-tokenizer, machine-learning, natural-language-processing, nlp, python, python3
Install
pip install pyko==0.4.3

Documentation

pyko

pyko[ํŒŒ์ด์ฝ”]๋Š” ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ํ•œ๊ตญ์–ด๊ฐ€ ๊ฐ–๋Š” ๋…์ž์ ์ธ ํŠน์ง•์„ ๋ฐ˜์˜ํ•ด ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์„ค์น˜

PyPI์— ๋“ฑ๋ก๋œ ํŒจํ‚ค์ง€๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

pip install pyko

์„ธ์ข…๋ง๋ญ‰์น˜

์„ธ์ข…๋ง๋ญ‰์น˜๋ฅผ NLTK CorpusReader๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ธ์ข…๋ง๋ญ‰์น˜๋Š” ๊ตญ๋ฆฝ๊ตญ์–ด์› ์–ธ์–ด์ •๋ณด๋‚˜๋ˆ”ํ„ฐ์—์„œ ํš๋“ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ์˜ˆ์‹œ:

from pyko.reader import SejongCorpusReader

์„ธ์ข…๋ง๋ญ‰์น˜ = SejongCorpusReader(root, fileids)
ํŒŒ์ผ๋ชฉ๋ก = ์„ธ์ข…๋ง๋ญ‰์น˜.fileids()

ํ˜•ํƒœ๋ถ„์„๋ชฉ๋ก = ์„ธ์ข…๋ง๋ญ‰์น˜.words(tagged=True)
print(ํ˜•ํƒœ๋ถ„์„๋ชฉ๋ก)
"""
[('๋ญ', (('๋ญ', 'NP'),)), ('ํƒ€๊ณ ', (('ํƒ€', 'VV'), ('๊ณ ', 'EC'))), ('๊ฐ€?', (('๊ฐ€', 'VV'), ('ใ…', 'EF'), ('?', 'SF'))), ('์ง€ํ•˜์ฒ .', (('์ง€ํ•˜์ฒ ', 'NNG'), ('.', 'SF'))), ('๊ธฐ์ฐจ?', (('๊ธฐ์ฐจ', 'NNG'), ('?', 'SF'))), ('์•„์นจ์—', (('์•„์นจ', 'NNG'), ('์—', 'JKB'))), ...]
"""

ํ˜•ํƒœ๋ถ„์„๋ฌธ์žฅ๋ชฉ๋ก = ์„ธ์ข…๋ง๋ญ‰์น˜.sents(tagged=True)
print(ํ˜•ํƒœ๋ถ„์„๋ฌธ์žฅ๋ชฉ๋ก[0])
"""
[('๋ญ', (('๋ญ', 'NP'),)),
 ('ํƒ€๊ณ ', (('ํƒ€', 'VV'), ('๊ณ ', 'EC'))),
 ('๊ฐ€?', (('๊ฐ€', 'VV'), ('ใ…', 'EF'), ('?', 'SF')))]
"""

ํ˜•ํƒœ์†Œ ๋ถ„๋ฆฌ ๋ฐ ํ’ˆ์‚ฌ ์˜ˆ์ธก

v0.4.0+

ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์นด์นด์˜ค ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ, kakao/khaiii๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ํŒจํ‚ค์ง€๊ฐ€ ์‹œ์Šคํ…œ์— ์„ค์น˜๋œ ๊ฒƒ์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ํ™˜๊ฒฝ์ด ๋ฏธ๋ฆฌ ์„ค์ •๋œ ๋„์ปค(docker) ์ด๋ฏธ์ง€๋ฅผ ํ™œ์šฉํ•˜๋ฉด ํŽธ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

pyko ๋„์ปค ์ด๋ฏธ์ง€: codebasic/pyko

๋„์ปค ์ด๋ฏธ์ง€ ์‚ฌ์šฉ ์˜ˆ์‹œ

$ docker run -it codebasic/pyko

์‚ฌ์šฉ์˜ˆ์‹œ:

from pyko import tokenizer as ํ˜•ํƒœ์†Œ_๋ถ„์„๊ธฐ

์˜ˆ๋ฌธ = 'ํ•œ๊ตญ์–ด๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.'

ํ˜•ํƒœ์†Œ๋ชฉ๋ก = ํ˜•ํƒœ์†Œ_๋ถ„์„๊ธฐ.tokenize(์˜ˆ๋ฌธ)
print(ํ˜•ํƒœ์†Œ๋ชฉ๋ก)
"""
['ํ•œ๊ตญ์–ด', '๋ฅผ', '์ž˜', '์ฒ˜๋ฆฌ', 'ํ•˜', '๋Š”์ง€', '๊ถ๊ธˆ', 'ํ•˜', 'ใ…‚๋‹ˆ๋‹ค', '.']
"""

ํ˜•ํƒœ๋ถ„์„๊ฒฐ๊ณผ = ํ˜•ํƒœ์†Œ_๋ถ„์„๊ธฐ.tokenize(์˜ˆ๋ฌธ, tagged=True)
print(ํ˜•ํƒœ๋ถ„์„๊ฒฐ๊ณผ)
"""
[('ํ•œ๊ตญ์–ด', 'NNP'),
 ('๋ฅผ', 'JKO'),
 ('์ž˜', 'MAG'),
 ('์ฒ˜๋ฆฌ', 'NNG'),
 ('ํ•˜', 'XSV'),
 ('๋Š”์ง€', 'EC'),
 ('๊ถ๊ธˆ', 'XR'),
 ('ํ•˜', 'XSA'),
 ('ใ…‚๋‹ˆ๋‹ค', 'EF'),
 ('.', 'SF')]
"""

NLTK ์—ฐ๋™

๋ง๋ญ‰์น˜ ๊ด€๋ฆฌ๋ฅผ ์œ„ํ•ด NLTK CourpusReader์™€ ์—ฐ๋™ํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ์˜ˆ์‹œ:

from pyko import tokenizer as ํ˜•ํƒœ์†Œ_๋ถ„์„๊ธฐ
from nltk.corpus import PlaintextCorpusReader

reader = PlaintextCorpusReader(root, fileids, word_tokenizer=ํ˜•ํƒœ์†Œ_๋ถ„์„๊ธฐ)
ํ˜•ํƒœ๋ถ„์„๊ฒฐ๊ณผ = reader.words()
print(ํ˜•ํƒœ๋ถ„์„๊ฒฐ๊ณผ)
"""
['์„ธ์ข…', '(', 'ไธ–ๅฎ—', ',', '1397', '๋…„', '5', '์›”', '7', '์ผ', '(', '์Œ๋ ฅ', '4', '์›”', ...]
"""