Lucene Nori, Korean Mopological Analyzer, in Python


License
Apache-2.0
Install
pip install pynori==0.2.4

Documentation

Pynori

Pynori is python version of Nori, Korean Analyzer in Apache Lucene and Elasticsearch.

  • Nori
    • ์•„ํŒŒ์น˜ ๋ฃจ์”ฌ ๋ฐ ์—˜๋ผ์Šคํ‹ฑ์„œ์น˜์— ํฌํ•จ๋œ ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ ํ”Œ๋Ÿฌ๊ทธ์ธ (์ž๋ฐ”๋กœ ์ž‘์„ฑ)
    • mecab / kuromoji ๊ธฐ๋ฐ˜์˜ ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ (mecab-ko-dic-2.1.1-20180720 ์‚ฌ์šฉ)
    • ๋ฃจ์”ฌ ๋˜๋Š” ์—˜๋ผ์Šคํ‹ฑ์„œ์น˜ ์—”์ง„์— ์ข…์†๋œ ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ
  • Pynori
    • Nori์˜ ํŒŒ์ด์ฌ ๋ฒ„์ „ & ์ˆœ์ˆ˜ ํŒŒ์ด์ฌ ์Šคํฌ๋ฆฝํŠธ๋กœ ์ž‘์„ฑ (ref.Property & Comparision Study)
    • ์›๋ณธ๊ณผ ๊ฐ™์€ ์œ ๋‹›ํ…Œ์ŠคํŠธ๋ฅผ ์‹ค์‹œํ•˜์—ฌ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์Œ. (ref.Test)
    • ๋…๋ฆฝ๋œ ๋ชจ๋“ˆ๋กœ ํŒŒ์ด์ฌ ํ”„๋กœ์ ํŠธ ํ™œ์šฉ ๊ฐ€๋Šฅ
    • ์›๋ณธ Nori ๋Œ€๋น„ ๊ฐœ์„  ๊ธฐ๋Šฅ (ref.Property)

๋…ธ๋ฆฌ ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ๋…ธ๋ฆฌ Deep Dive ๋ธ”๋กœ๊ทธ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.

pynori์— ๋Œ€ํ•œ ์ด์Šˆ ์‚ฌํ•ญ์€ issue์— ๋“ฑ๋กํ•ด์ฃผ์„ธ์š”.

Install

pip install pynori

Usage

# ๋””ํดํŠธ ์˜ต์…˜: /pynori/config.ini ํŒŒ์ผ ์ฐธ๊ณ 

from pynori.korean_analyzer import KoreanAnalyzer
nori = KoreanAnalyzer(decompound_mode='DISCARD', # DISCARD or MIXED or NONE
                      infl_decompound_mode='DISCARD', # DISCARD or MIXED or NONE
                      discard_punctuation=True,
                      output_unknown_unigrams=False,
                      pos_filter=False, stop_tags=['JKS', 'JKB', 'VV', 'EF'],
                      synonym_filter=False, mode_synonym='NORM') # NORM or EXTENSION

print(nori.do_analysis("์•„๋น ๊ฐ€ ๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค."))
{'termAtt': ['์•„๋น ', '๊ฐ€', '๋ฐฉ', '์—', '๋“ค์–ด๊ฐ€', '์‹œ', 'แ†ซ๋‹ค'],
 'offsetAtt': [(0, 2), (2, 3), (4, 5), (5, 6), (7, 10), (10, 12), (10, 12)],
 'posLengthAtt': [1, 1, 1, 1, 1, 1, 1],
 'posTypeAtt': ['MORP', 'MORP', 'MORP', 'MORP', 'MORP', 'MORP', 'MORP'],
 'posTagAtt': ['NNG', 'JKS', 'NNG', 'JKB', 'VV', 'EP', 'EF'],
 'dictTypeAtt': ['KN', 'KN', 'KN', 'KN', 'KN', 'KN', 'KN']}
  • KoreanAnalyzer arg.
    • decompound_mode / infl_decompound_mode - ๋ณตํ•ฉ๋ช…์‚ฌ / ๊ตด์ ˆ์–ด ์ฒ˜๋ฆฌ ๋ฐฉ์‹ ๊ฒฐ์ •
      • 'MIXED': ์›ํ˜•๊ณผ ์„œ๋ธŒ๋‹จ์–ด ๋ชจ๋‘ ์ถœ๋ ฅ
      • 'DISCARD': ์„œ๋ธŒ๋‹จ์–ด๋งŒ ์ถœ๋ ฅ
      • 'NONE': ์›ํ˜•๋งŒ ์ถœ๋ ฅ
    • discard_punctuation - ๊ตฌ๋‘์  ์ œ๊ฑฐ ์—ฌ๋ถ€
    • output_unknown_unigrams - ์–ธ๋…ผ ๋‹จ์–ด๋ฅผ ์Œ์ ˆ ๋‹จ์œ„๋กœ ์ชผ๊ฐฌ ์—ฌ๋ถ€
    • pos_filter - POS ํ•„ํ„ฐ ์‹คํ–‰ ์—ฌ๋ถ€
    • stop_tags - ํ•„ํ„ฐ๋ง๋˜๋Š” POS ํƒœ๊ทธ ๋ฆฌ์ŠคํŠธ (pos_filter=True์ผ ๋•Œ๋งŒ ํ™œ์„ฑ)
    • synonym_filter - ๋™์˜์–ด ํ•„ํ„ฐ ์‹คํ–‰ ์—ฌ๋ถ€
    • mode_synonym - ๋™์˜์–ด ์ฒ˜๋ฆฌ ๋ชจ๋“œ (NORM or EXTENSION) (synonym_filter=True์ผ ๋•Œ๋งŒ ํ™œ์„ฑ)

๋‹ค์Œ๊ณผ ๊ฐ™์ด KoreanAnalyzer์˜ ์˜ต์…˜์„ ๋™์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

print(nori.do_analysis("๊ฐ€๋ฒผ์šด ๋ƒ‰์žฅ๊ณ ")['termAtt'])
# ['๊ฐ€๋ณ', 'แ†ซ', '๋ƒ‰์žฅ', '๊ณ ']

## ํ† ํฌ๋‚˜์ด์ € ์˜ต์…˜ ์„ธํŒ…
nori.set_option_tokenizer(decompound_mode='MIXED', infl_decompound_mode='MIXED')
print(nori.do_analysis("๊ฐ€๋ฒผ์šด ๋ƒ‰์žฅ๊ณ ")['termAtt'])
# ['๊ฐ€๋ฒผ์šด', '๊ฐ€๋ณ', 'แ†ซ', '๋ƒ‰์žฅ๊ณ ', '๋ƒ‰์žฅ', '๊ณ ']

## POS ํ•„ํ„ฐ ์˜ต์…˜ ์„ธํŒ…
nori.set_option_filter(pos_filter=True, stop_tags=['ETM', 'VA'])
print(nori.do_analysis("๊ฐ€๋ฒผ์šด ๋ƒ‰์žฅ๊ณ ")['termAtt'])
# ['๋ƒ‰์žฅ๊ณ ', '๋ƒ‰์žฅ', '๊ณ ']

## ๋™์˜์–ด ํ•„ํ„ฐ ์˜ต์…˜ ์„ธํŒ…
nori.set_option_filter(synonym_filter=True, mode_synonym='NORM')
print(nori.do_analysis("NLP ๊ฐœ๋ฐœ์ž")['termAtt'])
# ['์ž์—ฐ์–ด์ฒ˜๋ฆฌ', '์ž์—ฐ์–ด', '์ฒ˜๋ฆฌ', '๊ฐœ๋ฐœ์ž', '๊ฐœ๋ฐœ', '์ž']

nori.set_option_tokenizer(decompound_mode='DISCARD', infl_decompound_mode='DISCARD') # DISCARD ๋กœ ๋ณ€๊ฒฝ.
nori.set_option_filter(mode_synonym='EXTENSION')
print(nori.do_analysis("AI ๊ฐœ๋ฐœ์ž")['termAtt'])
# ['์ธ๊ณต', '์ง€๋Šฅ', 'ai', 'aritificial', 'intelligence', '๊ฐœ๋ฐœ', '์ž', 'developer']

Usage - Multiprocessing

# ๋””ํดํŠธ ์˜ต์…˜: /pynori/config.ini ํŒŒ์ผ ์ฐธ๊ณ 

from pynori.multiprocessor import KoreanAnalyzerMultiprocessing
nori_mp = KoreanAnalyzerMultiprocessing(decompound_mode='MIXED', # DISCARD or MIXED or NONE
		                                  infl_decompound_mode='DISCARD', # DISCARD or MIXED or NONE
		                                  #discard_punctuation=True,
                                        #output_unknown_unigrams=False,
                                        #pos_filter=False, stop_tags=['JKS', 'JKB', 'VV', 'EF'],
                                        #synonym_filter=False, mode_synonym='NORM'
)

nori_mp.run(num_workers=3, 
            read_path="your/read/file/path", 
            write_path="your/write/file/path")

# multiprocessing ์€ file-to-file ํฌ๋งท์œผ๋กœ ์‹คํ–‰
# num_workers ๋ฅผ ํ†ตํ•ด ๋ณ‘๋ ฌ ํ”„๋กœ์„ธ์Šค ๊ฐœ์ˆ˜ ์„ค์ •
# line-by-line ์œผ๋กœ read. ๊ฐ line ์˜ ํ…์ŠคํŠธ๋Š” ๋ฌธ์žฅ์œผ๋กœ ๊ฐ„์ฃผ (๋ฌธ์„œ์ผ ๊ฒฝ์šฐ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋‹ˆ ๋ฌธ์žฅ ๋ถ„๋ฆฌ๊ธฐ ํ™œ์šฉ ์ถ”์ฒœ)
# termAtt ๋งŒ ์ถœ๋ ฅ. ๋›ฐ์–ด์“ฐ๊ธฐ๊ฐ€ ๋œ ํ† ํฌ๋‚˜์ด์ง•๋œ ๋ฌธ์ž์—ด์„ ์ถœ๋ ฅ (๋‹ค๋ฅธ key ํ•„์š”ํ•˜๋ฉด multiprocessor.py ์˜ 24 line ์ˆ˜์ •)

Resources

  • ์‹œ์Šคํ…œ ์‚ฌ์ „์€ ~/pynori/resources/mecab-ko-dic-2.1.1-20180720 ์—์„œ ์ˆ˜์ •
    • ์‚ฌ์ „ ๋ณ€๊ฒฝ์‚ฌํ•ญ์€ ๋‹ค์Œ ๋‘ ํ•ญ๋ชฉ์„ ์‹ค์‹œํ•˜๋ฉด ๊ณง๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅ
      • ๊ธฐ์กด csv ํŒŒ์ผ ์ˆ˜์ •/์‚ญ์ œ or ์ƒˆ๋กœ์šด csv ํŒŒ์ผ ์ถ”๊ฐ€ (์ฃผ์˜. mecab ๋‹จ์–ด ์ž‘์„ฑ ๊ทœ์น™)
      • ๊ธฐ์กด ~/pynori/resources/pkl_mecab_csv/mecab_csv.pkl ์‚ญ์ œ
      • (์ฐธ๊ณ . mecab_csv.pkl ํŒŒ์ผ์ด ์—†์œผ๋ฉด KoreanAnalyzer ์ดˆ๊ธฐํ™” ์‹œ์— ์ตœ์‹  csv ํŒŒ์ผ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์žฌ์ƒ์„ฑ)
      • (์ฐธ๊ณ . ~/pynori/resources/pkl_mecab_matrix/matrix_def.pkl ํŒŒ์ผ์€ ์ˆ˜์ •/์‚ญ์ œํ•˜์ง€ ๋ง ๊ฒƒ)
      • (์ฐธ๊ณ . ๋‹ค๋ฅธ ๋ฒ„์ „์˜ mecab-ko-dic ์ ์šฉ์„ ์œ„ํ•ด์„œ๋Š” ์ฝ”๋“œ ๋‚ด์˜ path ์ˆ˜์ • ํ•„์š”)
  • ์‚ฌ์šฉ์ž ์‚ฌ์ „์€ ~/pynori/resources/userdict_ko.txt ์—์„œ ์ˆ˜์ • (๊ณง๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅ)
  • ๋™์˜์–ด ์‚ฌ์ „์€ ~/pynori/resources/synonyms.txt.txt ์—์„œ ์ˆ˜์ • (๊ณง๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅ)

Test

git clone https://github.com/gritmind/python-nori.git
cd python-nori
python -m unittest -v pynori.tests.test_korean_analyzer
python -m unittest -v pynori.tests.test_korean_tokenizer

Property

  • [์›๋ณธ] ๋ฃจ์”ฌ(lucene), ๋…ธ๋ฆฌ(nori) ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ (ref.1)
  • ์›๋ณธ ์ฝ”๋“œ์™€ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•˜๊ฒŒ ๊ตฌํ˜„ (๋ณ€์ˆ˜/ํŒŒ์ผ๋ช…, ์ฝ”๋“œ ํŒจํ„ด ๋“ฑ)
  • ์–ธ์–ด ๋ฆฌ์†Œ์Šค๋กœ mecab-ko-dic-2.1.1-20180720 ์‚ฌ์šฉ
  • ์‚ฌ์ „ ๋ฃฉ์—…์„ ์œ„ํ•ด Trie ์ž๋ฃŒ๊ตฌ์กฐ ์‚ฌ์šฉ (FST ๋ณด์™„ ํ•„์š”)
  • token & dictionary objects ์ˆ˜์ •
  • circular buffer & wordID ๋ฏธํ™œ์šฉ

์›๋ณธ Nori ๋Œ€๋น„ ๊ฐœ์„  ๊ธฐ๋Šฅ

  • ํ† ํฐ ์ •๋ณด (Unknown/Known/User, POS type) ์ถœ๋ ฅ
  • ํŠน์ˆ˜๋ฌธ์ž๋กœ ์‹œ์ž‘/ํฌํ•จํ•˜๋Š” ์‚ฌ์šฉ์ž ๋‹จ์–ด๊ฐ€ ์žˆ์„ ์‹œ ๋™์˜์–ด ํŒŒ์‹ฑ ์˜ค๋ฅ˜ ํ•ด๊ฒฐ
  • infl_decompound_mode ๋ชจ๋“œ ์ถ”๊ฐ€
  • KoreanAnalyzer ์˜ต์…˜์„ ๋™์ ์œผ๋กœ ์ œ์–ดํ•˜๋Š” ๊ธฐ๋Šฅ ์ถ”๊ฐ€
  • ๋™์˜์–ด ํ•„ํ„ฐ๋ง - ๋Œ€ํ‘œ์–ด ์ฒ˜๋ฆฌ ๊ธฐ๋Šฅ ์ถ”๊ฐ€
  • Unknown ๊ธธ์ด๊ฐ€ ๋ฌด๋ถ„๋ณ„ํ•˜๊ฒŒ ๊ธธ์–ด์ง€๋Š” ํ˜„์ƒ ํ•ด๊ฒฐ
  • ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ง€์›

TODO

  • ํ•„ํ„ฐ ํ›„ ํ† ํฐ ์ธ๋ฑ์Šค/ํฌ์ง€์…˜ ์žฌ๋ฐฐ์—ด
  • KoreanTokenizer TODO List (MAX_BACKTRACE_GAP, isLowSurrogate, UnicodeScript ...)
  • ์†๋„ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐ ์ž๋ฃŒ๊ตฌ์กฐ ์ตœ์ ํ™”

Comparision Study

ํ•œ๋‚˜๋ˆ” 0.8.4 ๊ผฌ๊ผฌ๋งˆ 2.0 ํŠธ์œ„ํ„ฐ 1.14.7 Pynori 0.1.0
1 ๊ฐœ 0.00138 sec 0.00244 sec 0.00051 sec 0.00279 sec
10 ๊ฐœ 0.03467 sec 0.07546 sec 0.01188 sec 0.09655 sec
100 ๊ฐœ 0.28960 sec 0.70480 sec 0.09319 sec 0.72207 sec
1000 ๊ฐœ 2.59061 sec 6.38031 sec 0.94029 sec 6.46660 sec
10000 ๊ฐœ 27.61180 sec 77.73616 sec 11.43677 sec 68.20249 sec
100000 ๊ฐœ 262.72305 sec 699.70416 sec 95.79926 sec 672.83272 sec
  • ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด์„œ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์™€ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๋น„๊ต. (์ฐธ๊ณ  ./tests/test_compare_morphs.py).
  • ๋น„๊ต ๋Œ€์ƒ์€ ๋ชจ๋‘ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ(konlpy)์— ๋ชจ๋‘ ์†ํ•ด ์žˆ์ง€๋งŒ ๋‚ด๋ถ€์ ์œผ๋กœ JVM ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ํ•จ.
  • pynori๋Š” ์ˆœ์ˆ˜ ํŒŒ์ด์ฌ ์Šคํฌ๋ฆฝํŠธ๋กœ ์‹คํ–‰๋˜์ง€๋งŒ, ํŠธ์œ„ํ„ฐ๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ํฐ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๊ณ , ๊ผฌ๊ผฌ๋งˆ 2.0๋ณด๋‹ค๋Š” ๋น ๋ฆ„.

Release History

๋ฒ„์ „ ์ฃผ์š” ๋‚ด์šฉ ๋‚ ์งœ
pynori 0.1.0 ๋…ธ๋ฆฌ ๊ธฐ๋ณธ ๋ชจ๋“ˆ ํŒŒ์ด์ฌ ํฌํŒ… & ์œ ๋‹›ํ…Œ์ŠคํŠธ ๊ตฌํ˜„ ์™„๋ฃŒ Nov 17, 2019
pynori 0.1.1 KoreanAnalyzer ์ดˆ๊ธฐํ™” ์†๋„ ํ–ฅ์ƒ (1min 15s -> 12.9s) Apr 16, 2020
pynori 0.1.2 infl_decompound_mode ๋ชจ๋“œ ์ถ”๊ฐ€ Apr 23, 2020
pynori 0.1.3 KoreanAnalyzer ์˜ต์…˜์„ ๋™์ ์œผ๋กœ ์ œ์–ดํ•˜๋Š” ๊ธฐ๋Šฅ ์ถ”๊ฐ€ Apr 25, 2020
pynori 0.2.0 ๋™์˜์–ด ์ฒ˜๋ฆฌ ๋ชจ๋“ˆ (SynonymGraphFilter) ์ถ”๊ฐ€ Jun 6, 2020
pynori 0.2.1 Long Unknown ํ† ํฐ ์™„ํ™” ๋กœ์ง ์ถ”๊ฐ€ Jul 19, 2020
pynori 0.2.4 gc.disable๋กœ ์ดˆ๊ธฐํ™” ์†๋„ ํ–ฅ์ƒ (-> 5s) & ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ง€์› Aug 18, 2021

License

  • Apache License 2.0

Reference

  1. (Github) Lucene-solr - Nori
  2. (Github) Mecab-ko-dic
  3. (Blog) ์—˜๋ผ์Šคํ‹ฑ์„œ์น˜ ๊ณต์‹ ํ•œ๊ตญ์–ด ๋ถ„์„ ํ”Œ๋Ÿฌ๊ทธ์ธ '๋…ธ๋ฆฌ'
  4. (Blog) ๋…ธ๋ฆฌ(Nori) ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ Deep Dive