Unsupervised Korean Natural Language Processing Toolkits


Keywords
korean-nlp, korean-text-processing, nlp, tokenizer, postagging, word-extraction
License
GPL-3.0
Install
pip install soynlp==0.0.493

Documentation

soynlp

ํ•œ๊ตญ์–ด ๋ถ„์„์„ ์œ„ํ•œ pure python code ์ž…๋‹ˆ๋‹ค. ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ฐพ๊ฑฐ๋‚˜, ๋ฌธ์žฅ์„ ๋‹จ์–ด์—ด๋กœ ๋ถ„ํ•ด, ํ˜น์€ ํ’ˆ์‚ฌ ํŒ๋ณ„์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋น„์ง€๋„ํ•™์Šต ์ ‘๊ทผ๋ฒ•์„ ์ง€ํ–ฅํ•ฉ๋‹ˆ๋‹ค.

Guide

Usage guide

soynlp ์—์„œ ์ œ๊ณตํ•˜๋Š” WordExtractor ๋‚˜ NounExtractor ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฌธ์„œ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•๋“ค์€ ํ†ต๊ณ„์  ํŒจํ„ด์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ๋ฌธ์žฅ ํ˜น์€ ๋ฌธ์„œ์—์„œ ๋ณด๋‹ค๋Š” ์–ด๋Š ์ •๋„ ๊ทœ๋ชจ๊ฐ€ ์žˆ๋Š” ๋™์ผํ•œ ์ง‘๋‹จ์˜ ๋ฌธ์„œ (homogeneous documents) ์—์„œ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์˜ํ™” ๋Œ“๊ธ€๋“ค์ด๋‚˜ ํ•˜๋ฃจ์˜ ๋‰ด์Šค ๊ธฐ์‚ฌ์ฒ˜๋Ÿผ ๊ฐ™์€ ๋‹จ์–ด๋ฅผ ์ด์šฉํ•˜๋Š” ์ง‘ํ•ฉ์˜ ๋ฌธ์„œ๋งŒ ๋ชจ์•„์„œ Extractors ๋ฅผ ํ•™์Šตํ•˜์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค. ์ด์งˆ์ ์ธ ์ง‘๋‹จ์˜ ๋ฌธ์„œ๋“ค์€ ํ•˜๋‚˜๋กœ ๋ชจ์•„ ํ•™์Šตํ•˜๋ฉด ๋‹จ์–ด๊ฐ€ ์ž˜ ์ถ”์ถœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Parameter naming

soynlp=0.0.46 ๊นŒ์ง€๋Š” min_score, minimum_score, l_len_min ์ฒ˜๋Ÿผ ์ตœ์†Œ๊ฐ’์ด๋‚˜ ์ตœ๋Œ€๊ฐ’์„ ์š”๊ตฌํ•˜๋Š” parameters ์˜ ์ด๋ฆ„๋“ค์— ๊ทœ์น™์ด ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€ ์ž‘์—…ํ•˜์‹  ์ฝ”๋“œ๋“ค ์ค‘์—์„œ ์ง์ ‘ parameters ๋ฅผ ์„ค์ •ํ•˜์‹  ๋ถ„๋“ค์—๊ฒŒ ํ˜ผ๋ž€์„ ๋“œ๋ฆด ์ˆ˜ ์žˆ์œผ๋‚˜, ๋” ๋Šฆ๊ธฐ์ „์— ์ดํ›„์— ๋ฐœ์ƒํ•  ๋ถˆํŽธํ•จ์„ ์ค„์ด๊ธฐ ์œ„ํ•˜์—ฌ ๋ณ€์ˆ˜ ๋ช…์„ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

0.0.47 ์ดํ›„ minimum, maximum ์˜ ์˜๋ฏธ๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ๋ณ€์ˆ˜๋ช…์€ min, max ๋กœ ์ค„์—ฌ ๊ธฐ์ž…ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋’ค์— ์–ด๋–ค ํ•ญ๋ชฉ์˜ threshold parameter ์ธ์ง€ ์ด๋ฆ„์„ ๊ธฐ์ž…ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŒจํ„ด์œผ๋กœ parameter ์ด๋ฆ„์„ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค. {min, max}_{noun, word}_{score, threshold} ๋“ฑ์œผ๋กœ ์ด๋ฆ„์„ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค. ํ•ญ๋ชฉ์ด ์ž๋ช…ํ•œ ๊ฒฝ์šฐ์—๋Š” ์ด๋ฅผ ์ƒ๋žตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

soynlp ์—์„œ๋Š” substring counting ์„ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ๋นˆ๋„์ˆ˜์™€ ๊ด€๋ จ๋œ parameter ๋Š” count ๊ฐ€ ์•„๋‹Œ frequency ๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.

index ์™€ idx ๋Š” idx ๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.

์ˆซ์ž๋ฅผ ์˜๋ฏธํ•˜๋Š” num ๊ณผ n ์€ num ์œผ๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.

Setup

$ pip install soynlp

Python version

  • Python 3.5+ ๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. 3.x ์—์„œ ์ฃผ๋กœ ์ž‘์—…์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 3.x ๋กœ ์ด์šฉํ•˜์‹œ๊ธธ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.
  • Python 2.x ๋Š” ๋ชจ๋“  ๊ธฐ๋Šฅ์— ๋Œ€ํ•ด์„œ ํ…Œ์ŠคํŠธ๊ฐ€ ๋๋‚˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

Requires

  • numpy >= 1.12.1
  • psutil >= 5.0.1
  • scipy >= 1.1.0
  • scikit-learn >= 0.20.0

Noun Extractor

๋ช…์‚ฌ ์ถ”์ถœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์‹œ๋„๋ฅผ ํ•œ ๊ฒฐ๊ณผ, v1, news, v2 ์„ธ ๊ฐ€์ง€ ๋ฒ„์ „์ด ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์€ v2 ์ž…๋‹ˆ๋‹ค.

WordExtractor ๋Š” ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๊ฒฝ๊ณ„ ์ ์ˆ˜๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ผ ๋ฟ, ๊ฐ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ํŒ๋‹จํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋•Œ๋กœ๋Š” ๊ฐ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์•Œ์•„์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋‹ค๋ฅธ ํ’ˆ์‚ฌ๋ณด๋‹ค๋„ ๋ช…์‚ฌ์—์„œ ์ƒˆ๋กœ์šด ๋‹จ์–ด๊ฐ€ ๊ฐ€์žฅ ๋งŽ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค. ๋ช…์‚ฌ์˜ ์˜ค๋ฅธ์ชฝ์—๋Š” -์€, -๋Š”, -๋ผ๋Š”, -ํ•˜๋Š” ์ฒ˜๋Ÿผ ํŠน์ • ๊ธ€์ž๋“ค์ด ์ž์ฃผ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์„œ์˜ ์–ด์ ˆ (๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€ ์œ ๋‹›)์—์„œ ์™ผ์ชฝ์— ์œ„์น˜ํ•œ substring ์˜ ์˜ค๋ฅธ์ชฝ์— ์–ด๋–ค ๊ธ€์ž๋“ค์ด ๋“ฑ์žฅํ•˜๋Š”์ง€ ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณด๋ฉด ๋ช…์‚ฌ์ธ์ง€ ์•„๋‹Œ์ง€ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. soynlp ์—์„œ๋Š” ๋‘ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋‘˜ ๋ชจ๋‘ ๊ฐœ๋ฐœ ๋‹จ๊ณ„์ด๊ธฐ ๋•Œ๋ฌธ์— ์–ด๋–ค ๊ฒƒ์ด ๋” ์šฐ์ˆ˜ํ•˜๋‹ค ๋งํ•˜๊ธฐ๋Š” ์–ด๋ ต์Šต๋‹ˆ๋‹ค๋งŒ, NewsNounExtractor ๊ฐ€ ์ข€ ๋” ๋งŽ์€ ๊ธฐ๋Šฅ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ถ”ํ›„, ๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ๋Š” ํ•˜๋‚˜์˜ ํด๋ž˜์Šค๋กœ ์ •๋ฆฌ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

Noun Extractor ver 1 & News Noun Extractor

from soynlp.noun import LRNounExtractor
noun_extractor = LRNounExtractor()
nouns = noun_extractor.train_extract(sentences) # list of str like

from soynlp.noun import NewsNounExtractor
noun_extractor = NewsNounExtractor()
nouns = noun_extractor.train_extract(sentences) # list of str like

2016-10-20 ์˜ ๋‰ด์Šค๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๋ช…์‚ฌ์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

๋ด๋งˆํฌ  ์›ƒ๋ˆ  ๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด  ๊ฐ€๋ฝ๋™  ๋งค๋‰ด์–ผ  ์ง€๋„๊ต์ˆ˜
์ „๋ง์น˜  ๊ฐ•๊ตฌ  ์–ธ๋‹ˆ๋“ค  ์‹ ์‚ฐ์—…  ๊ธฐ๋ขฐ์ „  ๋…ธ์Šค
ํ• ๋ฆฌ์šฐ๋“œ  ํ”Œ๋ผ์ž  ๋ถˆ๋ฒ•์กฐ์—…  ์›”์ŠคํŠธ๋ฆฌํŠธ์ €๋„  2022๋…„  ๋ถˆํ—ˆ
๊ณ ์”จ  ์–ดํ”Œ  1987๋…„  ๋ถˆ์”จ  ์ ๊ธฐ  ๋ ˆ์Šค
์Šคํ€˜์–ด  ์ถฉ๋‹น๊ธˆ  ๊ฑด์ถ•๋ฌผ  ๋‰ด์งˆ๋žœ๋“œ  ์‚ฌ๊ฐ  ํ•˜๋‚˜์”ฉ
๊ทผ๋Œ€  ํˆฌ์ž์ฃผ์ฒด๋ณ„  4์œ„  ํƒœ๊ถŒ  ๋„คํŠธ์›์Šค  ๋ชจ๋ฐ”์ผ๊ฒŒ์ž„
์—ฐ๋™  ๋Ÿฐ์นญ  ๋งŒ์„ฑ  ์†์งˆ  ์ œ์ž‘๋ฒ•  ํ˜„์‹คํ™”
์˜คํ•ด์˜  ์‹ฌ์‚ฌ์œ„์›๋“ค  ๋‹จ์   ๋ถ€์žฅ์กฐ๋ฆฌ  ์ฐจ๊ด€๊ธ‰  ๊ฒŒ์‹œ๋ฌผ
์ธํ„ฐํฐ  ์›ํ™”  ๋‹จ๊ธฐ๊ฐ„  ํŽธ๊ณก  ๋ฌด์‚ฐ  ์™ธ๊ตญ์ธ๋“ค
์„ธ๋ฌด์กฐ์‚ฌ  ์„์œ ํ™”ํ•™  ์›Œํ‚น  ์›ํ”ผ์Šค  ์„œ์žฅ  ๊ณต๋ฒ”

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

Noun Extractor ver 2

soynlp=0.0.46+ ์—์„œ๋Š” ๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ version 2 ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ๋ฒ„์ „์˜ ๋ช…์‚ฌ ์ถ”์ถœ์˜ ์ •ํ™•์„ฑ๊ณผ ํ•ฉ์„ฑ๋ช…์‚ฌ ์ธ์‹ ๋Šฅ๋ ฅ, ์ถœ๋ ฅ๋˜๋Š” ์ •๋ณด์˜ ์˜ค๋ฅ˜๋ฅผ ์ˆ˜์ •ํ•œ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ๋ฒ•์€ version 1 ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.

from soynlp.utils import DoublespaceLineCorpus
from soynlp.noun import LRNounExtractor_v2

corpus_path = '2016-10-20-news'
sents = DoublespaceLineCorpus(corpus_path, iter_sent=True)

noun_extractor = LRNounExtractor_v2(verbose=True)
nouns = noun_extractor.train_extract(sents)

์ถ”์ถœ๋œ nouns ๋Š” {str:namedtuple} ํ˜•์‹์ž…๋‹ˆ๋‹ค.

print(nouns['๋‰ด์Šค']) # NounScore(frequency=4319, score=1.0)

_compounds_components ์—๋Š” ๋ณตํ•ฉ๋ช…์‚ฌ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋‹จ์ผ๋ช…์‚ฌ๋“ค์˜ ์ •๋ณด๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. '๋Œ€ํ•œ๋ฏผ๊ตญ', '๋…น์ƒ‰์„ฑ์žฅ'๊ณผ ๊ฐ™์ด ์‹ค์ œ๋กœ๋Š” ๋ณตํ•ฉํ˜•ํƒœ์†Œ์ด์ง€๋งŒ, ๋‹จ์ผ ๋ช…์‚ฌ๋กœ ์ด์šฉ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ๋‹จ์ผ ๋ช…์‚ฌ๋กœ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค.

list(noun_extractor._compounds_components.items())[:5]

# [('์ž ์ˆ˜ํ•จ๋ฐœ์‚ฌํƒ„๋„๋ฏธ์‚ฌ์ผ', ('์ž ์ˆ˜ํ•จ', '๋ฐœ์‚ฌ', 'ํƒ„๋„๋ฏธ์‚ฌ์ผ')),
#  ('๋ฏธ์‚ฌ์ผ๋Œ€์‘๋Šฅ๋ ฅ์œ„์›ํšŒ', ('๋ฏธ์‚ฌ์ผ', '๋Œ€์‘', '๋Šฅ๋ ฅ', '์œ„์›ํšŒ')),
#  ('๊ธ€๋กœ๋ฒŒ๋…น์ƒ‰์„ฑ์žฅ์—ฐ๊ตฌ์†Œ', ('๊ธ€๋กœ๋ฒŒ', '๋…น์ƒ‰์„ฑ์žฅ', '์—ฐ๊ตฌ์†Œ')),
#  ('์‹œ์นด๊ณ ์˜ต์…˜๊ฑฐ๋ž˜์†Œ', ('์‹œ์นด๊ณ ', '์˜ต์…˜', '๊ฑฐ๋ž˜์†Œ')),
#  ('๋Œ€ํ•œ๋ฏผ๊ตญํŠน์ˆ˜์ž„๋ฌด์œ ๊ณต', ('๋Œ€ํ•œ๋ฏผ๊ตญ', 'ํŠน์ˆ˜', '์ž„๋ฌด', '์œ ๊ณต')),

LRGraph ๋Š” ํ•™์Šต๋œ corpus ์— ๋“ฑ์žฅํ•œ ์–ด์ ˆ์˜ L-R ๊ตฌ์กฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. get_r ๊ณผ get_l ์„ ์ด์šฉํ•˜์—ฌ ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

noun_extractor.lrgraph.get_r('์•„์ด์˜ค์•„์ด')

# [('', 123),
#  ('์˜', 47),
#  ('๋Š”', 40),
#  ('์™€', 18),
#  ('๊ฐ€', 18),
#  ('์—', 7),
#  ('์—๊ฒŒ', 6),
#  ('๊นŒ์ง€', 2),
#  ('๋ž‘', 2),
#  ('๋ถ€ํ„ฐ', 1)]

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ 2์— ์žˆ์Šต๋‹ˆ๋‹ค.

Word Extraction

2016 ๋…„ 10์›”์˜ ์—ฐ์˜ˆ๊ธฐ์‚ฌ ๋‰ด์Šค์—๋Š” 'ํŠธ์™€์ด์Šค', '์•„์ด์˜ค์•„์ด' ์™€ ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ง๋ญ‰์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ ํ’ˆ์‚ฌ ํŒ๋ณ„๊ธฐ / ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๋Š” ์ด๋Ÿฐ ๋‹จ์–ด๋ฅผ ๋ณธ ์ ์ด ์—†์Šต๋‹ˆ๋‹ค. ๋Š˜ ์ƒˆ๋กœ์šด ๋‹จ์–ด๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ ๋‹จ์–ด๋ฅผ ์ œ๋Œ€๋กœ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฏธ๋“ฑ๋ก๋‹จ์–ด ๋ฌธ์ œ (out of vocabulry, OOV) ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ์‹œ๊ธฐ์— ์ž‘์„ฑ๋œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์—ฐ์˜ˆ ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ์ฝ๋‹ค๋ณด๋ฉด 'ํŠธ์™€์ด์Šค', '์•„์ด์˜ค์•„์ด' ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , ์‚ฌ๋žŒ์€ ์ด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฌธ์„œ์ง‘ํ•ฉ์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์—ฐ์†๋œ ๋‹จ์–ด์—ด์„ ๋‹จ์–ด๋ผ ์ •์˜ํ•œ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ†ต๊ณ„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹จ์–ด(์˜ ๊ฒฝ๊ณ„)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•ฉ๋‹ˆ๋‹ค. soynlp๋Š” ๊ทธ ์ค‘, Cohesion score, Branching Entropy, Accessor Variety ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

from soynlp.word import WordExtractor

word_extractor = WordExtractor(min_frequency=100,
    min_cohesion_forward=0.05, 
    min_right_branching_entropy=0.0
)
word_extractor.train(sentences) # list of str or like
words = word_extractor.extract()

words ๋Š” Scores ๋ผ๋Š” namedtuple ์„ value ๋กœ ์ง€๋‹ˆ๋Š” dict ์ž…๋‹ˆ๋‹ค.

words['์•„์ด์˜ค์•„์ด']

Scores(cohesion_forward=0.30063636035733476,
        cohesion_backward=0,
        left_branching_entropy=0,
        right_branching_entropy=0,
        left_accessor_variety=0,
        right_accessor_variety=0,
        leftside_frequency=270,
        rightside_frequency=0
)

2016-10-26 ์˜ ๋‰ด์Šค ๊ธฐ์‚ฌ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ ๋‹จ์–ด ์ ์ˆ˜ (cohesion * branching entropy) ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

๋‹จ์–ด   (๋นˆ๋„์ˆ˜, cohesion, branching entropy)

์ดฌ์˜     (2222, 1.000, 1.823)
์„œ์šธ     (25507, 0.657, 2.241)
๋“ค์–ด     (3906, 0.534, 2.262)
๋กฏ๋ฐ     (1973, 0.999, 1.542)
ํ•œ๊ตญ     (9904, 0.286, 2.729)
๋ถํ•œ     (4954, 0.766, 1.729)
ํˆฌ์ž     (4549, 0.630, 1.889)
๋–จ์–ด     (1453, 0.817, 1.515)
์ง„ํ–‰     (8123, 0.516, 1.970)
์–˜๊ธฐ     (1157, 0.970, 1.328)
์šด์˜     (4537, 0.592, 1.768)
ํ”„๋กœ๊ทธ๋žจ  (2738, 0.719, 1.527)
ํด๋ฆฐํ„ด   (2361, 0.751, 1.420)
๋›ฐ์–ด     (927, 0.831, 1.298)
๋“œ๋ผ๋งˆ   (2375, 0.609, 1.606)
์šฐ๋ฆฌ     (7458, 0.470, 1.827)
์ค€๋น„     (1736, 0.639, 1.513)
๋ฃจ์ด     (1284, 0.743, 1.354)
ํŠธ๋Ÿผํ”„   (3565, 0.712, 1.355)
์ƒ๊ฐ     (3963, 0.335, 2.024)
ํŒฌ๋“ค     (999, 0.626, 1.341)
์‚ฐ์—…     (2203, 0.403, 1.769)
10      (18164, 0.256, 2.210)
ํ™•์ธ     (3575, 0.306, 2.016)
ํ•„์š”     (3428, 0.635, 1.279)
๋ฌธ์ œ     (4737, 0.364, 1.808)
ํ˜์˜     (2357, 0.962, 0.830)
ํ‰๊ฐ€     (2749, 0.362, 1.787)
20      (59317, 0.667, 1.171)
์Šคํฌ์ธ     (3422, 0.428, 1.604)

์ž์„ธํ•œ ๋‚ด์šฉ์€ word extraction tutorial ์— ์žˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ๋ฒ„์ „์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ธฐ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Tokenizer

WordExtractor ๋กœ๋ถ€ํ„ฐ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ํ•™์Šตํ•˜์˜€๋‹ค๋ฉด, ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๊ฒฝ๊ณ„๋ฅผ ๋”ฐ๋ผ ๋ฌธ์žฅ์„ ๋‹จ์–ด์—ด๋กœ ๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. soynlp ๋Š” ์„ธ ๊ฐ€์ง€ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ž˜ ๋˜์–ด ์žˆ๋‹ค๋ฉด LTokenizer ๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด ์–ด์ ˆ์˜ ๊ตฌ์กฐ๋ฅผ "๋ช…์‚ฌ + ์กฐ์‚ฌ" ์ฒ˜๋Ÿผ "L + [R]" ๋กœ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

LTokenizer

L parts ์—๋Š” ๋ช…์‚ฌ/๋™์‚ฌ/ํ˜•์šฉ์‚ฌ/๋ถ€์‚ฌ๊ฐ€ ์œ„์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด์ ˆ์—์„œ L ๋งŒ ์ž˜ ์ธ์‹ํ•œ๋‹ค๋ฉด ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์ด R parts ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. LTokenizer ์—๋Š” L parts ์˜ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.

from soynlp.tokenizer import LTokenizer

scores = {'๋ฐ์ด':0.5, '๋ฐ์ดํ„ฐ':0.5, '๋ฐ์ดํ„ฐ๋งˆ์ด๋‹':0.5, '๊ณต๋ถ€':0.5, '๊ณต๋ถ€์ค‘':0.45}
tokenizer = LTokenizer(scores=scores)

sent = '๋ฐ์ดํ„ฐ๋งˆ์ด๋‹์„ ๊ณต๋ถ€ํ•œ๋‹ค'

print(tokenizer.tokenize(sent, flatten=False))
#[['๋ฐ์ดํ„ฐ๋งˆ์ด๋‹', '์„'], ['๊ณต๋ถ€', '์ค‘์ด๋‹ค']]

print(tokenizer.tokenize(sent))
# ['๋ฐ์ดํ„ฐ๋งˆ์ด๋‹', '์„', '๊ณต๋ถ€', '์ค‘์ด๋‹ค']

๋งŒ์•ฝ WordExtractor ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค๋ฉด, ๋‹จ์–ด ์ ์ˆ˜ ์ค‘ ํ•˜๋‚˜๋ฅผ ํƒํ•˜์—ฌ scores ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” Forward cohesion ์˜ ์ ์ˆ˜๋งŒ์„ ์ด์šฉํ•˜๋Š” ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ๊ทธ ์™ธ์—๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from soynlp.word import WordExtractor
from soynlp.utils import DoublespaceLineCorpus

file_path = 'your file path'
corpus = DoublespaceLineCorpus(file_path, iter_sent=True)

word_extractor = WordExtractor(
    min_frequency=100, # example
    min_cohesion_forward=0.05,
    min_right_branching_entropy=0.0
)

word_extractor.train(sentences)
words = word_extractor.extract()

cohesion_score = {word:score.cohesion_forward for word, score in words.items()}
tokenizer = LTokenizer(scores=cohesion_score)

๋ช…์‚ฌ ์ถ”์ถœ๊ธฐ์˜ ๋ช…์‚ฌ ์ ์ˆ˜์™€ Cohesion ์„ ํ•จ๊ป˜ ์ด์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ ์˜ˆ๋กœ, "Cohesion ์ ์ˆ˜ + ๋ช…์‚ฌ ์ ์ˆ˜"๋ฅผ ๋‹จ์–ด ์ ์ˆ˜๋กœ ์ด์šฉํ•˜๋ ค๋ฉด ์•„๋ž˜์ฒ˜๋Ÿผ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

from soynlp.noun import LRNounExtractor_2
noun_extractor = LRNounExtractor_v2()
nouns = noun_extractor.train_extract(corpus) # list of str like

noun_scores = {noun:score.score for noun, score in nouns.items()}
combined_scores = {noun:score + cohesion_score.get(noun, 0)
    for noun, score in noun_scores.items()}
combined_scores = combined_scores.update(
    {subword:cohesion for subword, cohesion in cohesion_score.items()
    if not (subword in combine_scores)}
)

tokenizer = LTokenizer(scores=combined_scores)

MaxScoreTokenizer

๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ œ๋Œ€๋กœ ์ง€์ผœ์ง€์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋ผ๋ฉด, ๋ฌธ์žฅ์˜ ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋‰˜์–ด์ง„ ๋‹จ์œ„๊ฐ€ L + [R] ๊ตฌ์กฐ๋ผ ๊ฐ€์ •ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ์€ ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์ง€์ผœ์ง€์ง€ ์•Š์€ ๋ฌธ์žฅ์—์„œ ์ต์ˆ™ํ•œ ๋‹จ์–ด๋ถ€ํ„ฐ ๋ˆˆ์— ๋“ค์–ด์˜ต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์„ ๋ชจ๋ธ๋กœ ์˜ฎ๊ธด MaxScoreTokenizer ์—ญ์‹œ ๋‹จ์–ด ์ ์ˆ˜๋ฅผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.

from soynlp.tokenizer import MaxScoreTokenizer

scores = {'ํŒŒ์Šค': 0.3, 'ํŒŒ์Šคํƒ€': 0.7, '์ข‹์•„์š”': 0.2, '์ข‹์•„':0.5}
tokenizer = MaxScoreTokenizer(scores=scores)

print(tokenizer.tokenize('๋‚œํŒŒ์Šคํƒ€๊ฐ€์ข‹์•„์š”'))
# ['๋‚œ', 'ํŒŒ์Šคํƒ€', '๊ฐ€', '์ข‹์•„', '์š”']

print(tokenizer.tokenize('๋‚œํŒŒ์Šคํƒ€๊ฐ€ ์ข‹์•„์š”'), flatten=False)
# [[('๋‚œ', 0, 1, 0.0, 1), ('ํŒŒ์Šคํƒ€', 1, 4, 0.7, 3),  ('๊ฐ€', 4, 5, 0.0, 1)],
#  [('์ข‹์•„', 0, 2, 0.5, 2), ('์š”', 2, 3, 0.0, 1)]]

MaxScoreTokenizer ์—ญ์‹œ WordExtractor ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ด์šฉํ•˜์‹ค ๋•Œ์—๋Š” ์œ„์˜ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์ ์ ˆํžˆ scores ๋ฅผ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ ์•Œ๋ ค์ง„ ๋‹จ์–ด ์‚ฌ์ „์ด ์žˆ๋‹ค๋ฉด ์ด ๋‹จ์–ด๋“ค์€ ๋‹ค๋ฅธ ์–ด๋–ค ๋‹จ์–ด๋ณด๋‹ค๋„ ๋” ํฐ ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌํ•˜๋ฉด ๊ทธ ๋‹จ์–ด๋Š” ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์ž˜๋ผ๋ƒ…๋‹ˆ๋‹ค.

RegexTokenizer

๊ทœ์น™ ๊ธฐ๋ฐ˜์œผ๋กœ๋„ ๋‹จ์–ด์—ด์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์–ด๊ฐ€ ๋ฐ”๋€Œ๋Š” ๋ถ€๋ถ„์—์„œ ์šฐ๋ฆฌ๋Š” ๋‹จ์–ด์˜ ๊ฒฝ๊ณ„๋ฅผ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "์•„์ด๊ณ ใ…‹ใ…‹ใ…œใ…œ์ง„์งœ?" ๋Š” [์•„์ด๊ณ , ใ…‹ใ…‹, ใ…œใ…œ, ์ง„์งœ, ?]๋กœ ์‰ฝ๊ฒŒ ๋‹จ์–ด์—ด์„ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.

from soynlp.tokenizer import RegexTokenizer

tokenizer = RegexTokenizer()

print(tokenizer.tokenize('์ด๋ ‡๊ฒŒ์—ฐ์†๋œ๋ฌธ์žฅ์€์ž˜๋ฆฌ์ง€์•Š์Šต๋‹ˆ๋‹ค๋งŒ'))
# ['์ด๋ ‡๊ฒŒ์—ฐ์†๋œ๋ฌธ์žฅ์€์ž˜๋ฆฌ์ง€์•Š์Šต๋‹ˆ๋‹ค๋งŒ']

print(tokenizer.tokenize('์ˆซ์ž123์ด์˜์–ดabc์—์„ž์—ฌ์žˆ์œผ๋ฉดใ…‹ใ…‹์ž˜๋ฆฌ๊ฒ ์ฃ '))
# ['์ˆซ์ž', '123', '์ด์˜์–ด', 'abc', '์—์„ž์—ฌ์žˆ์œผ๋ฉด', 'ใ…‹ใ…‹', '์ž˜๋ฆฌ๊ฒ ์ฃ ']

Part of Speech Tagger

๋‹จ์–ด ์‚ฌ์ „์ด ์ž˜ ๊ตฌ์ถ•๋˜์–ด ์žˆ๋‹ค๋ฉด, ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ์ „ ๊ธฐ๋ฐ˜ ํ’ˆ์‚ฌ ํŒ๋ณ„๊ธฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ, ํ˜•ํƒœ์†Œ๋ถ„์„์„ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— 'ํ•˜๋Š”', 'ํ•˜๋‹ค', 'ํ•˜๊ณ '๋Š” ๋ชจ๋‘ ๋™์‚ฌ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. Lemmatizer ๋Š” ํ˜„์žฌ ๊ฐœ๋ฐœ/์ •๋ฆฌ ์ค‘์ž…๋‹ˆ๋‹ค.

pos_dict = {
    'Adverb': {'๋„ˆ๋ฌด', '๋งค์šฐ'}, 
    'Noun': {'๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด', '์•„์ด์˜ค์•„์ด', '์•„์ด', '๋…ธ๋ž˜', '์˜ค', '์ด', '๊ณ ์–‘'},
    'Josa': {'๋Š”', '์˜', '์ด๋‹ค', '์ž…๋‹ˆ๋‹ค', '์ด', '์ด๋Š”', '๋ฅผ', '๋ผ', '๋ผ๋Š”'},
    'Verb': {'ํ•˜๋Š”', 'ํ•˜๋‹ค', 'ํ•˜๊ณ '},
    'Adjective': {'์˜ˆ์œ', '์˜ˆ์˜๋‹ค'},
    'Exclamation': {'์šฐ์™€'}    
}

from soynlp.postagger import Dictionary
from soynlp.postagger import LRTemplateMatcher
from soynlp.postagger import LREvaluator
from soynlp.postagger import SimpleTagger
from soynlp.postagger import UnknowLRPostprocessor

dictionary = Dictionary(pos_dict)
generator = LRTemplateMatcher(dictionary)    
evaluator = LREvaluator()
postprocessor = UnknowLRPostprocessor()
tagger = SimpleTagger(generator, evaluator, postprocessor)

sent = '๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด๋Š”์•„์ด์˜ค์•„์ด์˜๋…ธ๋ž˜์ž…๋‹ˆ๋‹ค!!'
print(tagger.tag(sent))
# [('๋„ˆ๋ฌด๋„ˆ๋ฌด๋„ˆ๋ฌด', 'Noun'),
#  ('๋Š”', 'Josa'),
#  ('์•„์ด์˜ค์•„์ด', 'Noun'),
#  ('์˜', 'Josa'),
#  ('๋…ธ๋ž˜', 'Noun'),
#  ('์ž…๋‹ˆ๋‹ค', 'Josa'),
#  ('!!', None)]

๋” ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ ์‚ฌ์šฉ๋ฒ• ํŠœํ† ๋ฆฌ์–ผ ์— ๊ธฐ์ˆ ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐœ๋ฐœ๊ณผ์ • ๋…ธํŠธ๋Š” ์—ฌ๊ธฐ์— ๊ธฐ์ˆ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

Vetorizer

ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ•™์Šตํ•˜๊ฑฐ๋‚˜, ํ˜น์€ ํ•™์Šต๋œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ sparse matrix ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค. minimum / maximum of term frequency / document frequency ๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Verbose mode ์—์„œ๋Š” ํ˜„์žฌ์˜ ๋ฒกํ„ฐ๋ผ์ด์ง• ์ƒํ™ฉ์„ print ํ•ฉ๋‹ˆ๋‹ค.

vectorizer = BaseVectorizer(
    tokenizer=tokenizer,
    min_tf=0,
    max_tf=10000,
    min_df=0,
    max_df=1.0,
    stopwords=None,
    lowercase=True,
    verbose=True
)

corpus.iter_sent = False
x = vectorizer.fit_transform(corpus)

๋ฌธ์„œ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๊ฑฐ๋‚˜, ๊ณง๋ฐ”๋กœ sparse matrix ๋ฅผ ์ด์šฉํ•  ๊ฒƒ์ด ์•„๋‹ˆ๋ผ๋ฉด ์ด๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ฆฌ์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ํŒŒ์ผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. fit_to_file() ํ˜น์€ to_file() ํ•จ์ˆ˜๋Š” ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— ๋Œ€ํ•œ term frequency vector ๋ฅผ ์–ป๋Š”๋Œ€๋กœ ํŒŒ์ผ์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. BaseVectorizer ์—์„œ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋Š” parameters ๋Š” ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

vectorizer = BaseVectorizer(min_tf=1, tokenizer=tokenizer)
corpus.iter_sent = False

matrix_path = 'YOURS'
vectorizer.fit_to_file(corpus, matrix_path)

ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋ฅผ sparse matrix ๊ฐ€ ์•„๋‹Œ list of int ๋กœ ์ถœ๋ ฅ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ vectorizer.vocabulary_ ์— ํ•™์Šต๋˜์ง€ ์•Š์€ ๋‹จ์–ด๋Š” encoding ์ด ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

vectorizer.encode_a_doc_to_bow('์˜ค๋Š˜ ๋‰ด์Šค๋Š” ์ด๊ฒƒ์ด ์ „๋ถ€๋‹ค')
# {3: 1, 258: 1, 428: 1, 1814: 1}

list of int ๋Š” list of str ๋กœ decoding ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

vectorizer.decode_from_bow({3: 1, 258: 1, 428: 1, 1814: 1})
# {'๋‰ด์Šค': 1, '๋Š”': 1, '์˜ค๋Š˜': 1, '์ด๊ฒƒ์ด': 1}

dict ํ˜•์‹์˜ bag of words ๋กœ๋„ encoding ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

vectorizer.encode_a_doc_to_list('์˜ค๋Š˜์˜ ๋‰ด์Šค๋Š” ๋งค์šฐ ์‹ฌ๊ฐํ•ฉ๋‹ˆ๋‹ค')
# [258, 4, 428, 3, 333]

dict ํ˜•์‹์˜ bag of words ๋Š” decoding ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

vectorizer.decode_from_list([258, 4, 428, 3, 333])
['์˜ค๋Š˜', '์˜', '๋‰ด์Šค', '๋Š”', '๋งค์šฐ']

Normalizer

๋Œ€ํ™” ๋ฐ์ดํ„ฐ, ๋Œ“๊ธ€ ๋ฐ์ดํ„ฐ์— ๋“ฑ์žฅํ•˜๋Š” ๋ฐ˜๋ณต๋˜๋Š” ์ด๋ชจํ‹ฐ์ฝ˜์˜ ์ •๋ฆฌ ๋ฐ ํ•œ๊ธ€, ํ˜น์€ ํ…์ŠคํŠธ๋งŒ ๋‚จ๊ธฐ๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

from soynlp.normalizer import *

emoticon_normalize('ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œใ…œใ…œใ…œ', num_repeats=3)
# 'ใ…‹ใ…‹ใ…‹ใ…œใ…œใ…œ'

repeat_normalize('์™€ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•˜ํ•ซ', num_repeats=2)
# '์™€ํ•˜ํ•˜ํ•ซ'

only_hangle('๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ')
# '๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œ ์•„ํ•ซ'

only_hangle_number('๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ')
# '๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œ 123 ์•„ํ•ซ'

only_text('๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ')
# '๊ฐ€๋‚˜๋‹คใ…ใ…‘ใ…“ใ…‹ใ…‹์ฟ ใ…œใ…œใ…œabcd123!!์•„ํ•ซ'

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

Point-wise Mutual Information (PMI)

์—ฐ๊ด€์–ด ๋ถ„์„์„ ์œ„ํ•œ co-occurrence matrix ๊ณ„์‚ฐ๊ณผ ์ด๋ฅผ ์ด์šฉํ•œ Point-wise Mutual Information (PMI) ๊ณ„์‚ฐ์„ ์œ„ํ•œ ํ•จ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜ sent_to_word_contexts_matrix ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ (word, context words) matrix ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. x ๋Š” scipy.sparse.csr_matrix ์ด๋ฉฐ, (n_vocabs, n_vocabs) ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค. idx2vocab ์€ x ์˜ ๊ฐ row, column ์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ list of str ์ž…๋‹ˆ๋‹ค. ๋ฌธ์žฅ์˜ ์•ž/๋’ค windows ๋‹จ์–ด๋ฅผ context ๋กœ ์ธ์‹ํ•˜๋ฉฐ, min_tf ์ด์ƒ์˜ ๋นˆ๋„์ˆ˜๋กœ ๋“ฑ์žฅํ•œ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ„์‚ฐ์„ ํ•ฉ๋‹ˆ๋‹ค. dynamic_weight ๋Š” context ๊ธธ์ด์— ๋ฐ˜๋น„๋ก€ํ•˜์—ฌ weighting ์„ ํ•ฉ๋‹ˆ๋‹ค. windows ๊ฐ€ 3 ์ผ ๊ฒฝ์šฐ, 1, 2, 3 ์นธ ๋–จ์–ด์ง„ ๋‹จ์–ด์˜ co-occurrence ๋Š” 1, 2/3, 1/3 ์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

from soynlp.vectorizer import sent_to_word_contexts_matrix

x, idx2vocab = sent_to_word_contexts_matrix(
    corpus,
    windows=3,
    min_tf=10,
    tokenizer=tokenizer, # (default) lambda x:x.split(),
    dynamic_weight=False,
    verbose=True
)

Co-occurrence matrix ์ธ x ๋ฅผ pmi ์— ์ž…๋ ฅํ•˜๋ฉด row ์™€ column ์„ ๊ฐ ์ถ•์œผ๋กœ PMI ๊ฐ€ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค. pmi_dok ์€ scipy.sparse.dok_matrix ํ˜•์‹์ž…๋‹ˆ๋‹ค. min_pmi ์ด์ƒ์˜ ๊ฐ’๋งŒ ์ €์žฅ๋˜๋ฉฐ, default ๋Š” min_pmi = 0 ์ด๊ธฐ ๋•Œ๋ฌธ์— Positive PMI (PPMI) ์ž…๋‹ˆ๋‹ค. alpha ๋Š” PMI(x,y) = p(x,y) / ( p(x) * ( p(y) + alpha ) ) ์— ์ž…๋ ฅ๋˜๋Š” smoothing parameter ์ž…๋‹ˆ๋‹ค. ๊ณ„์‚ฐ ๊ณผ์ •์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— verbose = True ๋กœ ์„ค์ •ํ•˜๋ฉด ํ˜„์žฌ์˜ ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

from soynlp.word import pmi

pmi_dok = pmi(
    x,
    min_pmi=0,
    alpha=0.0001,
    verbose=True
)

๋” ์ž์„ธํ•œ ์„ค๋ช…์€ ํŠœํ† ๋ฆฌ์–ผ์— ์žˆ์Šต๋‹ˆ๋‹ค.

notes

Slides

  • slide files์— ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์›๋ฆฌ ๋ฐ ์„ค๋ช…์„ ์ ์–ด๋’€์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์•ผ๋†€์ž์—์„œ ๋ฐœํ‘œํ–ˆ๋˜ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค.
  • textmining tutorial ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. soynlp project ์—์„œ ๊ตฌํ˜„ ์ค‘์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์„ค๋ช… ๋ฐ ํ…์ŠคํŠธ ๋งˆ์ด๋‹์— ์ด์šฉ๋˜๋Š” ๋จธ์‹  ๋Ÿฌ๋‹ ๋ฐฉ๋ฒ•๋“ค์„ ์„ค๋ช…ํ•˜๋Š” slides ์ž…๋‹ˆ๋‹ค.

Blogs

  • github io blog ์—์„œ slides ์— ์žˆ๋Š” ๋‚ด์šฉ๋“ค์˜ ํ…์ŠคํŠธ ์„ค๋ช… ๊ธ€๋“ค์„ ์˜ฌ๋ฆฌ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Slides ์˜ ๋‚ด์šฉ์— ๋Œ€ํ•ด ๋” ์ž์„ธํ•˜๊ฒŒ ๋ณด๊ณ  ์‹ถ์œผ์‹ค ๋•Œ ์ฝ์œผ์‹œ๊ธธ ๊ถŒํ•ฉ๋‹ˆ๋‹ค.

ํ•จ๊ป˜ ์ด์šฉํ•˜๋ฉด ์ข‹์€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค

์„ธ์ข… ๋ง๋ญ‰์น˜ ์ •์ œ๋ฅผ ์œ„ํ•œ utils

์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•˜์—ฌ ์„ธ์ข… ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋“ค์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํ˜•ํƒœ์†Œ/ํ’ˆ์‚ฌ ํ˜•ํƒœ๋กœ ์ •์ œ๋œ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜, ์šฉ์–ธ์˜ ํ™œ์šฉ ํ˜•ํƒœ๋ฅผ ์ •๋ฆฌํ•˜์—ฌ ํ…Œ์ด๋ธ”๋กœ ๋งŒ๋“œ๋Š” ํ•จ์ˆ˜, ์„ธ์ข… ๋ง๋ญ‰์น˜์˜ ํ’ˆ์‚ฌ ์ฒด๊ณ„๋ฅผ ๋‹จ์ˆœํ™” ์‹œํ‚ค๋Š” ํ•จ์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

soyspacing

๋„์–ด์“ฐ๊ธฐ ์˜ค๋ฅ˜๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ์ด๋ฅผ ์ œ๊ฑฐํ•˜๋ฉด ํ…์ŠคํŠธ ๋ถ„์„์ด ์‰ฌ์›Œ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ถ„์„ํ•˜๋ ค๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ ์—”์ง„์„ ํ•™์Šตํ•˜๊ณ , ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋„์–ด์“ฐ๊ธฐ ์˜ค๋ฅ˜๋ฅผ ๊ต์ •ํ•ฉ๋‹ˆ๋‹ค.

KR-WordRank

ํ† ํฌ๋‚˜์ด์ €๋‚˜ ๋‹จ์–ด ์ถ”์ถœ๊ธฐ๋ฅผ ํ•™์Šตํ•  ํ•„์š”์—†์ด, HITS algorithm ์„ ์ด์šฉํ•˜์—ฌ substring graph ์—์„œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

soykeyword

ํ‚ค์›Œ๋“œ ์ถ”์ถœ๊ธฐ์ž…๋‹ˆ๋‹ค. Logistic Regression ์„ ์ด์šฉํ•˜๋Š” ๋ชจ๋ธ๊ณผ ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ, ๋‘ ์ข…๋ฅ˜์˜ ํ‚ค์›Œ๋“œ ์ถ”์ถœ๊ธฐ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. scipy.sparse ์˜ sparse matrix ํ˜•์‹๊ณผ ํ…์ŠคํŠธ ํŒŒ์ผ ํ˜•์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

Analytics