KR-WordRank: Korean Unsupervised Word/Keyword Extractor


Keywords
Korean, word, keyword, extraction, keysentence-extraction, keyword-extraction, korean-nlp, korean-text-processing, nlp, text-summarization
License
Other
Install
pip install krwordrank==1.0.1

Documentation

KR-WordRank: Unsupervised Korean Word & Keyword Extractor

Keyword extraction

Substring graph๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•˜์—ฌ substring์˜ ์ตœ์†Œ ๋“ฑ์žฅ ๋นˆ๋„์ˆ˜ (min count)์™€ substring์˜ ์ตœ๋Œ€ ๊ธธ์ด (max length)๋ฅผ ์ž…๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from krwordrank.word import KRWordRank

min_count = 5   # ๋‹จ์–ด์˜ ์ตœ์†Œ ์ถœํ˜„ ๋นˆ๋„์ˆ˜ (๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ ์‹œ)
max_length = 10 # ๋‹จ์–ด์˜ ์ตœ๋Œ€ ๊ธธ์ด
wordrank_extractor = KRWordRank(min_count=min_count, max_length=max_length)

KR-WordRank๋Š” PageRank ์™€ ๋น„์Šทํ•œ graph ranking ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค (HITS algorithm ์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค). Substring graph์—์„œ node (substrig) ๋žญํ‚น์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ graph ranking ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ parameters ๊ฐ€ ์ž…๋ ฅ๋˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.

beta = 0.85    # PageRank์˜ decaying factor beta
max_iter = 10
texts = ['์˜ˆ์‹œ ๋ฌธ์žฅ ์ž…๋‹ˆ๋‹ค', '์—ฌ๋Ÿฌ ๋ฌธ์žฅ์˜ list of str ์ž…๋‹ˆ๋‹ค', ... ]
keywords, rank, graph = wordrank_extractor.extract(texts, beta, max_iter)

Graph ranking ์ด ๋†’์€ ๋…ธ๋“œ๋“ค(substrings)์ด ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋‹จ์–ด๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. ์˜ํ™” '๋ผ๋ผ๋žœ๋“œ'์˜ ์˜ํ™” ํ‰ ๋ฐ์ดํ„ฐ์—์„œ ํ‚ค์›Œ๋“œ (๋‹จ์–ด) ์ถ”์ถœ์„ ํ•œ ๊ฒฐ๊ณผ ์˜ˆ์‹œ๊ฐ€ tutorials์— ์žˆ์Šต๋‹ˆ๋‹ค.

for word, r in sorted(keywords.items(), key=lambda x:x[1], reverse=True)[:30]:
        print('%8s:\t%.4f' % (word, r))
  ์˜ํ™”:    229.7889
 ๊ด€๋žŒ๊ฐ:   112.3404
  ๋„ˆ๋ฌด:    78.4055
  ์Œ์•…:    37.6247
  ์ •๋ง:    37.2504
        ....

Python ์˜ wordcloud package ๋ฅผ ์ด์šฉํ•˜๋ฉด ํ‚ค์›Œ๋“œ์— ๊ด€ํ•œ word cloud figure ๋ฅผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Figure ์— ๋‚˜ํƒ€๋‚ด์ง€ ์•Š์„ ์ผ๋ฐ˜์ ์ธ ๋‹จ์–ด (stopwords) ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ passwords ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. dict ํ˜•์‹์œผ๋กœ {๋‹จ์–ด:์ ์ˆ˜} ํ˜•์‹์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

stopwords = {'์˜ํ™”', '๊ด€๋žŒ๊ฐ', '๋„ˆ๋ฌด', '์ •๋ง', '๋ณด๊ณ '}
passwords = {word:score for word, score in sorted(
    keywords.items(), key=lambda x:-x[1])[:300] if not (word in stopwords)}

ํ˜น์€ ์œ„์˜ ๊ณผ์ •์„ ๊ฐ„๋‹จํžˆ summarize_with_keywords ํ•จ์ˆ˜๋กœ ์ง„ํ–‰ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

from krwordrank.word import summarize_with_keywords

keywords = summarize_with_keywords(texts, min_count=5, max_length=10,
    beta=0.85, max_iter=10, stopwords=stopwords, verbose=True)
keywords = summarize_with_keywords(texts) # with default arguments

wordcloud ์˜ ์„ค์น˜๋Š” ์•„๋ž˜์˜ ๋ช…๋ น์–ด๋กœ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

pip install wordcloud

wordcloud ๊ฐ€ ์ด์šฉํ•˜๋Š” ๊ธฐ๋ณธ ํฐํŠธ๋Š” ํ•œ๊ธ€ ์ง€์›์ด ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•œ๊ธ€์„ ์ง€์›ํ•˜๋Š” ๋ณธ์ธ์˜ ํฐํŠธ๋ฅผ ์ฐพ์•„ font_path ๋ฅผ ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆผ์˜ ํฌ๊ธฐ (width, height) ์™€ ๋ฐฐ๊ฒฝ์ƒ‰ (background_color) ๋“ฑ์„ ์ง€์ •ํ•œ ๋’ค, generate_from_frequencies() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.

from wordcloud import WordCloud

# Set your font path
font_path = 'YOUR_FONT_DIR/truetype/nanum/NanumBarunGothic.ttf'

krwordrank_cloud = WordCloud(
    font_path = font_path,
    width = 800,
    height = 800,
    background_color="white"
)

krwordrank_cloud = krwordrank_cloud.generate_from_frequencies(passwords)

Jupyter notebook ์—์„œ ๊ทธ๋ฆผ์„ ๊ทธ๋ฆด ๋•Œ์—๋Š” ๋ฐ˜๋“œ์‹œ ์•„๋ž˜์ฒ˜๋Ÿผ %matplotlib inline ์„ ์ž…๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. .py ํŒŒ์ผ๋กœ ๋งŒ๋“ค ๋•Œ์—๋Š” ์ด๋ฅผ ์ž…๋ ฅํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

%matplotlib inline
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10, 10))
plt.imshow(krwordrank_cloud, interpolation="bilinear")
plt.show()

๊ทธ๋ ค์ง„ ๊ทธ๋ฆผ์„ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

fig.savefig('./lalaland_wordcloud.png')

์ €์žฅ๋œ ๊ทธ๋ฆผ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Key-sentence extraction

KR-WordRank >= 1.0.0 ๋ถ€ํ„ฐ๋Š” key sentence extraction ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. KR-WordRank ๋Š” ํ•œ๊ตญ์–ด์˜ ํ† ํฌ๋‚˜์ด์ € ๊ธฐ๋Šฅ์ด ๋‚ด์ œ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ† ํฌ๋‚˜์ด์ง•์ด ๋œ ๋ฌธ์žฅ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ด์šฉํ•˜๋Š” TextRank ๋ฐฉ์‹์„ ์ด์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  KR-WordRank ์—์„œ๋Š” keywords ๋ฅผ ๋งŽ์ด ํฌํ•จํ•œ ๋ฌธ์žฅ์„ ํ•ต์‹ฌ ๋ฌธ์žฅ์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์žฅ์„ ์ถ”์ถœํ•˜๋Š” ์›๋ฆฌ๋Š” ์ถ”์ถœ๋œ ํ‚ค์›Œ๋“œ์˜ ๋žญํฌ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ํ‚ค์›Œ๋“œ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“  ๋’ค, ์ฝ”์‹ธ์ธ ์œ ์‚ฌ๋„ ๊ธฐ์ค€์œผ๋กœ ์ž…๋ ฅ๋œ ๋ฌธ์žฅ ๋ฒกํ„ฐ๊ฐ€ ํ‚ค์›Œ๋“œ ๋ฒกํ„ฐ์™€ ์œ ์‚ฌํ•œ ๋ฌธ์žฅ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

summarize_with_sentences ํ•จ์ˆ˜์— texts ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด KR-WordRank ๋ฅผ ํ•™์Šตํ•˜์—ฌ ํ‚ค์›Œ๋“œ์™€ ์ด๋ฅผ ์ด์šฉํ•œ ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

from krwordrank.sentence import summarize_with_sentences

texts = [] # ๋ผ๋ผ๋žœ๋“œ ์˜ํ™”ํ‰
keywords, sents = summarize_with_sentences(texts, num_keywords=100, num_keysents=10)

keywords ์—๋Š” KR-WordRank ๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ num_keywords ๊ฐœ์ˆ˜์˜ ํ‚ค์›Œ๋“œ์™€ ์ด๋“ค์˜ ๋žญํฌ ๊ฐ’์ด dict{str:float} ํ˜•์‹์œผ๋กœ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

{'์˜ํ™”': 201.02402099523516,
 '๋„ˆ๋ฌด': 81.53699026386887,
 '์ •๋ง': 40.53709233921311,
 '์Œ์•…': 40.43446188536923,
 '๋งˆ์ง€๋ง‰': 38.598509495213484,
 '๋ฎค์ง€์ปฌ': 23.198810378709844,
 '์ตœ๊ณ ': 21.810147306627464,
 '์‚ฌ๋ž‘': 20.638511587426862,
 '๊ฟˆ์„': 20.43744237599688,
 '์•„๋ฆ„': 20.324710458174806,
 '์˜์ƒ': 20.283994278960186,
 '์—ฌ์šด์ด': 19.471356929084546,
 '์ง„์งœ': 19.06433920013137,
 '๋…ธ๋ž˜': 18.732801785265316,
 ...
}

sents ์—๋Š” num_sents ๊ฐœ์˜ ํ•ต์‹ฌ ๋ฌธ์žฅ์ด list of str ํ˜•์‹์œผ๋กœ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

['์—ฌ์šด์ด ํฌ๊ฒŒ๋‚จ๋Š”์˜ํ™” ์— ๋งˆ์Šคํ†ค ๋„ˆ๋ฌด ์‚ฌ๋ž‘์Šค๋Ÿฝ๊ณ  ๋ผ์ด์–ธ๊ณ ์Šฌ๋ง ๋‚จ์ž๊ฐ€๋ด๋„ ์ •๋ง ๋งค๋ ฅ์ ์ธ ๋ฐฐ์šฐ์ธ๋“ฏ ์˜์ƒ๋ฏธ ์Œ์•… ์—ฐ๊ธฐ ๊ตฌ์„ฑ ์ „๋ถ€ ์ข‹์•˜๊ณ  ๋งˆ์ง€๋ง‰ ์—”๋”ฉ๊นŒ์ง€ ์‹ ์„ ํ•˜๋ฉด์„œ ์• ํ‹‹ํ•˜๊ตฌ์š” 30์ค‘๋ฐ˜์— ๊ฐ์ •์ด ๋งŽ์ด ๋ฉ”๋ง๋ผ์žˆ์—ˆ๋Š”๋ฐ ์˜ค๋žœ๋งŒ์— ๊ฐ€์Šด์ด ์ด‰์ด‰ํ•ด์ง€๋„ค์š”',
 '์˜์ƒ๋ฏธ๋„ ๋„ˆ๋ฌด ์•„๋ฆ„๋‹ต๊ณ  ์‹ ๋‚˜๋Š” ์Œ์•…๋„ ์ข‹์•˜๋‹ค ๋งˆ์ง€๋ง‰ ์„ธ๋ฐ”์Šค์ฐฌ๊ณผ ๋ฏธ์•„์˜ ๋ˆˆ๋น›๊ตํ™˜์€ ์ •๋ง ๋งˆ์Œ ์•„ํŒ ์Œ ์˜ํ™”๊ด€์— ๊ณ ๋”ฉ๋“ค์ด ์—„์ฒญ ๋งŽ๋˜๋ฐ ๊ณ ๋”ฉ๋“ค์€ ์˜ํ™” ๋‚ด์šฉ ์ดํ•ด๋ฅผ ๋ชปํ•˜๋”๋ผใ…กใ…ก์‚ฌ๋ž‘์„ ๊นŠ๊ฒŒ ํ•ด๋ณธ ์‚ฌ๋žŒ์ด๋ผ๋ฉด ๋ˆ„๊ตฌ๋‚˜ ๋Š๊ปด๋ณผ์ˆ˜์žˆ๋Š” ๋จน๋จนํ•จ์ด ์žˆ๋‹ค',
 '์ •๋ง ์˜์ƒ๋ฏธ๋ž‘ ์Œ์•…์€ ์ตœ๊ณ ์˜€๋‹ค ๊ทธ๋ฆฌ๊ณ  ์‹ ์„ ํ–ˆ๋‹ค ์Œ์•…์ด ๋„ˆ๋ฌด ๋ฉ‹์žˆ์–ด์„œ ์—ฐ๊ธฐ๋ฅผ ๋ด์•ผ ํ• ์ง€ ๋…ธ๋ž˜๋ฅผ ๋“ค์–ด์•ผ ํ• ์ง€ ๋ชจ๋ฅผ ์ •๋„๋กœ ๊ทธ๋ฆฌ๊ณ  ๋ณด๊ณ  ๋‚˜์„œ ์ƒ๊ฐ ์ข€ ๋งŽ์•„์ง„ ์˜ํ™” ์ •๋ง ์ด ์—ฐ๋ง์— ๋ณด๊ธฐ ์ข‹์€ ์˜ํ™” ์ธ ๊ฒƒ ๊ฐ™๋‹ค',
 '๋ฌด์–ธ์˜ ๋งˆ์ง€๋ง‰ ํ”ผ์•„๋…ธ์—ฐ์ฃผ ์™„์ „ ์Šฌํ””ใ… ๋ณด๋Š”์ด๋“ค์—๊ฒŒ ๊ฟˆ์„ ์ƒ๊ธฐ์‹œ์ผœ์ค„๋“ฏ ๋˜ ๋ณด๊ณ  ์‹ถ์€ ๋‚ด์ƒ์— ์ตœ๊ณ ์˜ ๋ฎค์ง€์ปฌ์˜ํ™”์˜€์Œ ๋‹จ์ˆœํ• ์ˆ˜ ์žˆ๋Š” ๋‚ด์šฉ์— ๋ฎค์ง€์ปฌ์„ ๊ฐ€๋ฏธ์‹œ์ผœ์งธ์ฆˆ์Œ์•…๊ณผ ์ถค์œผ๋กœ ์ง€๋ฃจํ• ํ‹ˆ์—†์ด ๋น ์ ธ์„œ๋ด„ ost๋„ˆ๋ฌด์ข‹์•˜์Œ',
 '์ฒ˜์Œ์—” ์ดˆ๋”ฉ๋“ค ๋ณด๋Š” ๊ทธ๋ƒฅ ๊ทธ๋Ÿฐ์˜ํ™”์ธ์ค„ ์•Œ์•˜๋Š”๋ฐ ์ •๋ง๋กœ ๋ˆˆ๊ณผ ๊ท€๊ฐ€ ์ฆ๊ฑฐ์šด ์˜ํ™”์˜€์Šต๋‹ˆ๋‹ค ์–ด์ฐŒ๋ณด๋ฉด ๋ป”ํ•œ ์Šคํ† ๋ฆฌ์ผ์ง€ ๋ชฐ๋ผ๋„ ๊ทธ๋ƒฅ ๋ณด๊ณ  ๋“ฃ๋Š”๊ฒŒ ์ฆ๊ฑฐ์šด ๊ทธ๋Ÿฌ๋‹ค๊ฐ€ ์ •๋ง ๋งˆ์ง€๋ง‰์—” ๋„ˆ๋ฌด ์•„๋ฆ„๋‹ต๊ณ  ์Šฌํ”ˆ ์Œ์•…์ด ๋˜์–ด๋ฒ„๋ฆฐ',
 '์ •๋ง ๋ฉ‹์ง„ ๋…ธ๋ž˜์™€ ์Œ์•…๊ณผ ์˜์ƒ๋ฏธ๊นŒ์ง€ ์ •๋ง ๋„ˆ๋ฌด ๋ฉ‹์žˆ๋Š” ์˜ํ™” ๋ˆˆ๋ฌผ์„ ํ˜๋ฆฌ๋ฉด์„œ ๋ดค์Šต๋‹ˆ๋‹ค ์˜ํ™”๊ฐ€ ๋๋‚œ ์ˆœ๊ฐ„ ๊ฐํƒ„๊ณผ ๋™์‹œ์— ์—ฌ์šด์ด ๊ธธ๊ฒŒ ๋‚จ์•„ ๋˜ ๋ˆˆ๋ฌผ์„ ํ˜๋ ธ๋˜๋‚ด ์ธ์ƒ ์ตœ๊ณ ์˜ ๋ฎค์ง€์ปฌ ์˜ํ™”',
 'ํ‰์†Œ ๋ฎค์ง€์ปฌ ์˜ํ™” ์ข‹์•„ํ•˜๋Š” ํŽธ์ธ๋ฐ๋„ ํ‰์ ์— ๋น„ํ•ด ๋„ˆ๋ฌด๋‚˜ ๋ณ„๋กœ์˜€๋˜ ์˜ํ™” ์žฌ์ฆˆ์Œ์•…์ด๋‚˜ ์˜์ƒ๋ฏธ ๊ฐ™์€ ๊ฑด ์ข‹์•˜์ง€๋งŒ ์ค„๊ฑฐ๋ฆฌ๋„ ๊ธ€์Ž„ ๊ฒฐ๋ง์€ ์ •๋ง ๋ณ„๋กœ 6 7์  ์ •๋„ ์ฃผ๋Š”๊ฒŒ ๋งž๋‹ค๊ณ  ์ƒ๊ฐํ•˜์ง€๋งŒ ๊ฐœ์ธ์ ์œผ๋กœ ํ›„๋ฐ˜๋ถ€๊ฐ€ ๋„ˆ๋ฌด ๋ณ„๋กœ์—ฌ์„œ',
 '์˜ค๋žœ๋งŒ์— ์ข‹์€ ์˜ํ™”๋ดค๋‹ค๋Š” ์ƒ๊ฐ๋“ค์—ˆ๊ตฌ์š” ์Œ์•…๋„ ์˜์ƒ๋„ ์Šคํ† ๋ฆฌ๋„ ๋„ˆ๋ฌด๋‚˜์ข‹์•˜๊ณ  ๋ฌด์—‡๋ณด๋‹ค ์ง„ํ•œ ์—ฌ์šด์ด ๋‚จ๋Š” ์˜ํ™”๋Š” ์ •๋ง ์˜ค๋žœ๋งŒ์ด์—ˆ์–ด์š” ์—ฐ์ธ๋ผ๋ฆฌ ๊ฐ€์„œ ๋ณด๊ธฐ ์ •๋ง ์ข‹์€์˜ํ™” ๋„ˆ๋ฎค๋„ˆ๋ฎค๋„ˆ๋ฎค ์žฌ๋ฐŒ๊ฒŒ ์ž˜ ๋ดค์Šต๋‹ˆ๋‹ค',
 '์Œ์•… ๋ฏธ์ˆ  ์—ฐ๊ธฐ ๋“ฑ ๋ชจ๋“  ๊ฒƒ์ด ์ข‹์•˜์ง€๋งŒ ๋งˆ์ง€๋ง‰ ๊ฒฐ๋ง์ด ๋„ˆ๋ฌด ํ˜„์‹ค์— ๋’ค๋–จ์–ด์ง„ ๊ฟˆ๋งŒ ๊ฐ™๋‹ค ๊ฟˆ์„ ์ด์•ผ๊ธฐํ•˜๋Š” ์˜ํ™”์ง€๋งŒ ๊ณผ์ •๊ณผ ๊ฒฐ๊ณผ์— ์žˆ์–ด ์˜ˆ์ˆ ๊ฐ€๋“ค์˜ ํ˜„์‹ค์„ ๋„ˆ๋ฌด ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ ๊ฒƒ์ด ์•„๋‹Œ๊ฐ€ํ•˜๋Š” ์ƒ๊ฐ์ด๋“ ๋‹ค ๊ทธ๋ž˜์„œ ๋ณด๊ณ  ๋‚œ ๋’ค ๋‚˜๋Š” ๊ฟˆ์„ ๊ฟ”์•ผํ•˜๋Š”๋ฐ ํ—ˆํƒˆํ–ˆ๋‹ค',
 '๋งˆ์ง€๋ง‰ ํšŒ์ƒ์”ฌ์˜ ๊ฐ๋™์ด ์žŠํ˜€์ง€์งˆ์•Š๋Š”๋‹ค๋งˆ์ง€๋ง‰ ์‹ญ๋ถ„๋งŒ์œผ๋กœ ํ‹ฐ์ผ“๊ฐ’์ด ์•„๊น์ง€์•Š์€ ์˜ํ™” ์Œ์•…๋“ค๋„ ๋„ˆ๋ฌด ์•„๋ฆ„๋‹ค์› ๋‹ค์˜›๋‚  ๋ฎค์ง€์ปฌ ๊ฐ™์€ ๋นˆํ‹ฐ์ง€์˜์ƒ๋ฏธ๋„ ์ตœ๊ณ ']

๋ช‡ ๊ฐ€์ง€ ํŒจ๋Ÿฌ๋งคํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธธ์ด๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ๊ธธ๊ฑฐ๋‚˜ ์งง์€ ๋ฌธ์žฅ์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด penalty ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” ๊ธธ์ด๊ฐ€ 25 ๊ธ€์ž๋ถ€ํ„ฐ 80 ๊ธ€์ž์ธ ๋ฌธ์žฅ์„ ์„ ํ˜ธํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. stopwords ๋Š” ํ‚ค์›Œ๋“œ์—์„œ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค์€ ํ‚ค์›Œ๋“œ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค ๋•Œ์—๋„ ์ด์šฉ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ‚ค์›Œ๋“œ ๋ฒกํ„ฐ์™€ ์œ ์‚ฌํ•œ ๋ฌธ์žฅ์„ ์šฐ์„ ์ ์œผ๋กœ ์„ ํƒํ•˜๋‹ค๋ณด๋ฉด ์ด์ „์— ์„ ํƒ๋œ ๋ฌธ์žฅ๊ณผ ์ค‘๋ณต๋˜๋Š” ๋ฌธ์žฅ๋“ค์ด ์„ ํƒ๋˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” diversity ๋ฅผ ์ด์šฉํ•˜์—ฌ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. diversity ๋Š” ์ฝ”์‹ธ์ธ ์œ ์‚ฌ๋„ ๊ธฐ์ค€ ํ•ต์‹ฌ๋ฌธ์žฅ ๊ฐ„์˜ ์ตœ์†Œ ๊ฑฐ๋ฆฌ ์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์ด ํด์ˆ˜๋ก ๋‹ค์–‘ํ•œ ๋ฌธ์žฅ์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.

penalty = lambda x:0 if (25 <= len(x) <= 80) else 1
stopwords = {'์˜ํ™”', '๊ด€๋žŒ๊ฐ', '๋„ˆ๋ฌด', '์ •๋ง', '์ง„์งœ'}

keywords, sents = summarize_with_sentences(
    texts,
    penalty=penalty,
    stopwords = stopwords,
    diversity=0.5,
    num_keywords=100,
    num_keysents=10,
    verbose=False
)

์ด๋ฒˆ์— ์ถ”์ถœ๋œ ํ‚ค์›Œ๋“œ์—๋Š” ์˜ํ™”, ๊ด€๋žŒ๊ฐ, ๋„ˆ๋ฌด ์™€ ๊ฐ™์€ stopwords ๊ฐ€ ์ œ๊ฑฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

{'์Œ์•…': 40.43446188536923,
 '๋งˆ์ง€๋ง‰': 38.598509495213484,
 '๋ฎค์ง€์ปฌ': 23.198810378709844,
 '์ตœ๊ณ ': 21.810147306627464,
 '์‚ฌ๋ž‘': 20.638511587426862,
 '๊ฟˆ์„': 20.43744237599688,
 '์•„๋ฆ„': 20.324710458174806,
 '์˜์ƒ': 20.283994278960186,
 '์—ฌ์šด์ด': 19.471356929084546,
 '๋…ธ๋ž˜': 18.732801785265316,
 ...
}

ํ•ต์‹ฌ ๋ฌธ์žฅ๋„ ๊ธธ์ด๊ฐ€ 25 ~ 80 ๊ธ€์ž์ธ ๋ฌธ์žฅ๋“ค์„ ์„ ํ˜ธํ•ฉ๋‹ˆ๋‹ค.

['์ตœ๊ณ ๋ผ๋Š” ๋ง๋ฐ–์—” ์Œ์•… ์—ฐ์ถœ ์˜์ƒ ์Šคํ† ๋ฆฌ ๋ชจ๋‘์™„๋ฒฝ ๋งˆ์ง€๋ง‰ 10๋ถ„์žŠ์„์ˆ˜์—†๋‹ค ํ•œํŽธ์˜ ๋ฎค์ง€์ปฌ์„๋ณธ๋“ฏํ•œ ๋Š๋‚Œ์ธ์ƒ์˜ํ™”',
 '๊ธฐ๋Œ€ํ–ˆ์—ˆ๋Š”๋ฐ ์ €ํ•œํ… ์Šคํ† ๋ฆฌ๋„ ์Œ์•…๋„ ํ‰๋ฒ”ํ–ˆ์–ด์š” ์˜ํ™”๋ณด๋Š”๋‚ด๋‚ด ์ง€๋ฃจํ•˜๋‹ค๋Š” ๋Š๋‚Œ์„ ๋งŽ์ด ๋ฐ›์•˜๋Š”๋ฐ ์‹ ๊ธฐํ•˜๊ฒŒ๋„ ๋งˆ์ง€๋ง‰ ์”ฌ์„ ๋ณด๊ณ ๋‚˜๋‹ˆ ์—ฌ์šด์ด ๋‚จ๋„ค์š”',
 '์ŠฌํŽ์ง€๋งŒ ์•„๋ฆ„๋‹ค์› ๋˜ ๋‘์‚ฌ๋žŒ์˜ ์‚ฌ๋ž‘๊ณผ ๊ฐˆ๋“ฑ ๊ทธ๋ฆฌ๊ณ  ์Œ์•… ๋งˆ์ง€๋ง‰ ์˜ค๋ฒ„๋žฉ์€ ๊ทธ๋ƒฅ ํ• ๋ง์„ ์žƒ์—ˆ์Šต๋‹ˆ๋‹ค ์—ฌ์šด์ด ๋‚จ๋Š” ์˜ํ™”',
 '๋งˆ์ง€๋ง‰ ํšŒ์ƒ์‹ ์—์„œ ๋ˆˆ๋ฌผ์ด ์™ˆ์นต ์Ÿ์•„์งˆ๋ป”ํ–ˆ๋‹ค ์˜ฌํ•ด์ค‘ ์ตœ๊ณ ์˜ ์˜ํ™”๋ฅผ ๋ณธ๊ฑฐ ๊ฐ™๋‹ค์Œ์•…์ด๋ฉฐ ๋ฐฐ์šฐ๋“ค์ด๋ฉฐ ์˜์ƒ์ด๋ฉฐ ๋‹ค์‹œ ๋˜ ๋ณด๊ณ ์‹ถ์€ ๊ทธ๋Ÿฐ ์˜ํ™”์ด๋‹ค',
 '์˜ˆ์œ ์˜์ƒ๊ณผ ์•„๋ฆ„๋‹ค์šด ์Œ์•… ๊ฟˆ์„ ์ซ’๋Š” ๋‘์‚ฌ๋žŒ์˜ ์„ ํƒ์ด ๋‹ฌ๋ž๋‹ค๋ฉด ์–ด๋• ์„๊นŒ ์ƒ์ƒํ•˜๋Š” ์žฅ๋ฉด์ด ์ธ์ƒ๊นŠ์—ˆ๋‹ค ์“ธ์“ธํ•˜์ง€๋งŒ ํ˜„์‹ค์ ์ธ ์‚ฌ๋ž‘์ด๋ž„๊นŒ',
 '์Œ์•…๋„ ์ข‹๊ณ  ๋ฏธ์•„์™€ ์„ธ๋ฐ”์Šคํ‹ฐ์•ˆ์˜ ์•„๋ฆ„๋‹ค์šด ์‚ฌ๋ž‘๊ณผ ์˜ˆ์ˆ ์— ๋Œ€ํ•œ ์—ด์ •์ด ๊ฐ๋™์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค ์žฌ์ฆˆ์Œ์•…์„ ์‚ฌ๋ž‘ํ•˜๊ณ  ๋ฎค์ง€์ปฌ์„ ์ข‹์•„ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๊ฐ•์ถ”ํ•ฉ๋‹ˆ๋‹ค',
 '์ƒ๊ฐ๋ณด๋‹ค ๊ต‰์žฅํžˆ ์žฌ๋ฏธ์žˆ๋Š” ๋ป”ํ•œ ๊ฒฐ๋ง๋„ ์•„๋‹ˆ๊ณ  ์•„๋ฆ„๋‹ค์šด ์Œ์•…๊ณผ ํ˜„์‹ค์ ์ธ ์Šคํ† ๋ฆฌ๊ตฌ์„ฑ ๋ชจ๋‘์—๊ฒŒ ์™€๋‹ฟ์„๋ฒ•ํ•œ ์šธ๋ฆผ๋“ค์ด ์ฐจ ์ข‹์•˜์–ด์š” ์ถ”์ฒœ',
 '์ตœ๊ณ ์ž…๋‹ˆ๋‹ค ๋งˆ์ง€๋ง‰ ์žฅ๋ฉด์„ ์œ„ํ•ด ์Œ์•…๊ณผ ํ•จ๊ป˜ ๋‹ฌ๋ ค์™”๊ณ ํ˜„์‹ค์ ์ด์ง€๋งŒ ๋ชจ๋‘์˜ ๊ฐ€์Šด์„ ๋ญ‰ํดํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฉ‹์ง„ ๊ฒฐ๋ง์ž…๋‹ˆ๋‹ค ๋…ธ๋ž˜๊ฐ€ ๋จธ๋ฆฌ์†์—์„œ ๋– ๋‚˜์งˆ์•Š๋„ค์š”',
 '๋จผ์ € ์Œ์•…์ด ๋„ˆ๋ฌด ์ข‹๊ณ ์•„๋ฆ„๋‹ค์šด ์˜์ƒ๋ฏธ๋งŒ์œผ๋กœ๋„ ์ตœ๊ณ ๋„ค์š” ์•„๋ฆ„๋‹ต์ง€๋งŒ ์ง ๋‚ด๋„ ๋‚˜๊ตฌ์š” ๋ณ„ ์ƒ๊ฐ์—†์ด ๋ดค๋Š”๋ฐ ๊ฐ•์ถ”์ž…๋‹ˆ๋‹ค ์˜ํ™”๋ณด๊ณ  ๊ณ„์† ์Œ์•…์ด ๊ท€์— ๋งด๋Œ์•„์š”',
 '์ดˆ๋ฐ˜์— ์ข€ ์ง€๋ฃจํ•˜๋‚˜ ์Œ์•…๋„ ์ข‹๊ณ  ์˜์ƒ๋„ ์ข‹์•„์„œ ๋ณด๋Š” ๋ง›์ด ์žˆ์–ด์š” ๋งˆ์ง€๋ง‰์ด ์ข‹์•˜์–ด์š”']

๋งŒ์•ฝ ๋งˆ์ง€๋ง‰์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ๋„ ํ•ต์‹ฌ ๋ฌธ์žฅ์—์„œ ์ œ๊ฑฐํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์ฒ˜๋Ÿผ penalty ํ•จ์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

penalty=lambda x:0 if (25 <= len(x) <= 80 and not '๋งˆ์ง€๋ง‰' in x) else 1,
keywords, sents = summarize_with_sentences(
    texts,
    penalty=penalty,
    stopwords = stopwords,
    diversity=0.5,
    num_keywords=100,
    num_keysents=10,
    verbose=False
)

print(sents)
['์˜ˆ์œ ์˜์ƒ๊ณผ ์•„๋ฆ„๋‹ค์šด ์Œ์•… ๊ฟˆ์„ ์ซ’๋Š” ๋‘์‚ฌ๋žŒ์˜ ์„ ํƒ์ด ๋‹ฌ๋ž๋‹ค๋ฉด ์–ด๋• ์„๊นŒ ์ƒ์ƒํ•˜๋Š” ์žฅ๋ฉด์ด ์ธ์ƒ๊นŠ์—ˆ๋‹ค ์“ธ์“ธํ•˜์ง€๋งŒ ํ˜„์‹ค์ ์ธ ์‚ฌ๋ž‘์ด๋ž„๊นŒ',
 '์Œ์•…๋„ ์ข‹๊ณ  ๋ฏธ์•„์™€ ์„ธ๋ฐ”์Šคํ‹ฐ์•ˆ์˜ ์•„๋ฆ„๋‹ค์šด ์‚ฌ๋ž‘๊ณผ ์˜ˆ์ˆ ์— ๋Œ€ํ•œ ์—ด์ •์ด ๊ฐ๋™์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค ์žฌ์ฆˆ์Œ์•…์„ ์‚ฌ๋ž‘ํ•˜๊ณ  ๋ฎค์ง€์ปฌ์„ ์ข‹์•„ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ๊ฐ•์ถ”ํ•ฉ๋‹ˆ๋‹ค',
 '์ƒ๊ฐ๋ณด๋‹ค ๊ต‰์žฅํžˆ ์žฌ๋ฏธ์žˆ๋Š” ๋ป”ํ•œ ๊ฒฐ๋ง๋„ ์•„๋‹ˆ๊ณ  ์•„๋ฆ„๋‹ค์šด ์Œ์•…๊ณผ ํ˜„์‹ค์ ์ธ ์Šคํ† ๋ฆฌ๊ตฌ์„ฑ ๋ชจ๋‘์—๊ฒŒ ์™€๋‹ฟ์„๋ฒ•ํ•œ ์šธ๋ฆผ๋“ค์ด ์ฐจ ์ข‹์•˜์–ด์š” ์ถ”์ฒœ',
 '๋จผ์ € ์Œ์•…์ด ๋„ˆ๋ฌด ์ข‹๊ณ ์•„๋ฆ„๋‹ค์šด ์˜์ƒ๋ฏธ๋งŒ์œผ๋กœ๋„ ์ตœ๊ณ ๋„ค์š” ์•„๋ฆ„๋‹ต์ง€๋งŒ ์ง ๋‚ด๋„ ๋‚˜๊ตฌ์š” ๋ณ„ ์ƒ๊ฐ์—†์ด ๋ดค๋Š”๋ฐ ๊ฐ•์ถ”์ž…๋‹ˆ๋‹ค ์˜ํ™”๋ณด๊ณ  ๊ณ„์† ์Œ์•…์ด ๊ท€์— ๋งด๋Œ์•„์š”',
 '์‚ฌ๋ž‘ ๊ฟˆ ํ˜„์‹ค ๋ชจ๋“ ๊ฑธ ๋‹ค์‹œํ•œ๋ฒˆ ์ƒ๊ฐํ•˜๊ฒŒ ํ•˜๋Š” ์˜ํ™”์˜€์–ด์š” ์˜์ƒ๋ฏธ๋„ ๋„ˆ๋ฌด ์˜ˆ์˜๊ณ  ์ฃผ์ธ๊ณต๋„ ์˜ˆ์˜๊ณ  ๋‚ด์šฉ๋„ ์•„๋ฆ„๋‹ต๋„ค์š”ใ… ใ…  ์ธ์ƒ ์˜ํ™”',
 '๋„ˆ๋ฌด ์ข‹์€ ์˜ํ™” ์Šคํ† ๋ฆฌ๋Š” ๋น„์ˆซํ•œ๊ฒƒ๊ฐ™์•„์š” ๊ทธ๋ž˜๋„ ์Œ์•… ์˜์ƒ ์ด๋ฃจ์–ด์ง€์ง€์•Š๋Š” ์‚ฌ๋ž‘์„ ๋” ๋งค๋ ฅ์ ์œผ๋กœ ์ „๋‹ฌํ•œ์˜ํ™”์ธ๊ฒƒ๊ฐ™์•„์š” ๋ณด๊ณ ๋‚˜์„œ๋„ ์—ฌ์šด์ด ๋‚จ๋Š”',
 '๋…ธ๋ž˜๋„ ์ข‹๊ณ  ์˜์ƒ๋ฏธ๋„ ์ข‹๊ณ  ๊ทธ๋ฆฌ๊ณ  ๋ฐฐ์šฐ๋“ค ์—ฐ๊ธฐ๊นŒ์ง€ ์ •๋ง ์ข‹์•˜์–ด์š” ๊ฐœ์ธ์ ์œผ๋กœ ๋ฎค์ง€์ปฌ ํ˜•์‹ ์˜ํ™”๋ฅผ ์•ˆ์ข‹์•„ํ•˜๋Š” ํŽธ์ธ๋ฐ ์žฌ๋ฐŒ๊ฒŒ ๋ดค์Šต๋‹ˆ๋‹ค',
 '16๋…„ ์ตœ๊ณ ์˜์˜ํ™” ์ธ์ƒ์˜ํ™”์ž…๋‹ˆ๋‹ค ์˜์ƒ๋ฏธ ์ƒ‰๊ฐ ์Œ์•… ๊ฐ์ •์„  ๋‹ค์ข‹์•˜๋Š”๋ฐ ์—”๋”ฉ์ด ์ฐธํ˜„์‹ค์ ์ด๋„ค์š” ใ…Žใ…Ž ์ฐธ ๊ณต๊ฐ๋˜๊ณ  ๊ฐ๋™๋ฐ›์•˜์Šต๋‹ˆ๋‹ค ์”์“ธํ•˜๋‹ˆ ์ •๋ง์ž˜๋ดค์–ด์š”',
 '์‚ฌ์‹ค ๋‘๋ฒˆ์งธ ๋ณด๋Š” ์˜ํ™”์ž…๋‹ˆ๋‹ค ์˜์ƒ ํŽธ์ง‘๊ณผ ์Œ์•…์ด ๋„ˆ๋ฌด ์ข‹์•„์š” ์–ด๋–ป๊ฒŒ ๋ณด๋ฉด ๋„ˆ๋ฌด๋‚˜ ํ˜„์‹ค์ ์ผ ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๋ง์ด ์Šฌํ”„๊ธฐํ•˜์ง€๋งŒ ์•„๋ฆ„๋‹ต์Šต๋‹ˆ๋‹ค',
 '์˜ํ™”์‚ฌ์— ๋‚จ์„ ์ตœ๊ณ ์˜ ๋ฎค์ง€์ปฌ์˜ํ™”์ž…๋‹ˆ๋‹ค ์Œ์•…๊ณผ ์˜์ƒ์ด ๋„ˆ๋ฌด ์•„๋ฆ„๋‹ต๊ณ  ๋‘ ์ฃผ์—ฐ๋ฐฐ์šฐ์˜ ์—ฐ๊ธฐ๋Š” ๋งค์šฐ ๊ฐ๋™์ ์ž…๋‹ˆ๋‹ค ๋ฌด์กฐ๊ฑด ๋ณด์„ธ์š” ์ตœ๊ณ ']

๋” ์ž์„ธํ•œ key sentence extraction tutorials ์€ tutorials ํด๋”์˜ krwordrank_keysentence.ipynb ํŒŒ์ผ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”.

Setup

pip install krwordrank

tested in

  • python 3.5.9
  • python 3.7.7

Requirements

  • Python >= 3.5
  • numpy
  • scipy

Analytics