data augmentation tool for Korean


Keywords
text, augmentation, korean
License
Other
Install
pip install ktextaug==0.1.9rc8

Documentation

ktextaug

Data augmentation Toolkit for Korean text. It provides transformative text augmentation methods. We will release generative text augmentation methods (mid of April, hopefully)

ν•œκ΅­μ–΄ ν…μŠ€νŠΈ 증강 기법을 λͺ¨μ•„λ‘” νŒ¨ν‚€μ§€μž…λ‹ˆλ‹€. ν˜„μž¬λŠ” λ³€ν˜•μ  ν…μŠ€νŠΈ μ¦κ°•κΈ°λ²•λ§Œμ„ κ΅¬ν˜„ν•΄λ‘μ—ˆμœΌλ©°, 생성적 ν…μŠ€νŠΈ 증강기법 λͺ¨λΈ λ˜ν•œ 좔가될 μ˜ˆμ •μž…λ‹ˆλ‹€. transformers νŒ¨ν‚€μ§€ λ‚΄λΆ€λ₯Ό μ°Έκ³ ν•˜λ©΄μ„œ λ§Œλ“€κ³  μžˆμŠ΅λ‹ˆλ‹€.

ν˜„μž¬ 버젼: 0.1.9

  • TextAugmentation() 을 톡해 bulk, 즉 λŒ€λŸ‰μ˜ 데이터λ₯Ό multiprocessing ν•˜λ„λ‘ κ΅¬ν˜„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  • multiprocessing 이 κ°€λŠ₯ν•˜λ„λ‘ μ½”λ“œλ₯Ό μˆ˜μ •ν–ˆμŠ΅λ‹ˆλ‹€.
  • λ…Έμ΄μ¦ˆκ°€ ν¬ν•¨λœ vocab을 가진 κΈ°λ³Έ subword tokenizerκ³Ό λ‹€λ₯Έ ν† ν¬λ‚˜μ΄μ €λ“€μ„ λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€.

일정

  • 4μ›” 말 : 생성 λͺ¨λΈ μΆ”κ°€ (속도 이슈 ν•΄κ²° ν•„μš”)
  • 5μ›” : ν…ŒμŠ€νŠΈ 및 첫 번째 곡식 릴리즈 ?

Installation

Prerequisites

  • Python >= 3.6

  • Beautifulsoup4>=4.6.0 # for synonym search

  • Googletrans==3.1.0a0 # for backtranslation

  • konlpy>=0.5.2 # for Mecab tokenizer

  • PyKomoran>=0.1.5 # for Komoran tokenizer

  • transformers>=2.6.0 # for subword tokenizer

예제λ₯Ό ν…ŒμŠ€νŠΈν•˜κΈ° μœ„ν•΄μ„  pandas, parmap 이 ν•„μš”ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

command line μ„€μΉ˜ μ˜ˆμ‹œ:

pip install ktextaug

build from source(latest):

git clone https://github.com/jucho2725/ktextaug.git

python setup.py

Getting Started

ktextaugλ₯Ό μ‚¬μš©ν•˜λŠ” κ°„λ‹¨ν•œ μ˜ˆμ œμž…λ‹ˆλ‹€.

νŒ¨ν‚€μ§€ 0.1.9 버젼뢀턴 기본적으둜 TextAugmentation() 을 μ‚¬μš©ν•˜μ—¬ μ²˜λ¦¬ν•˜λŠ” 것을 ꢌμž₯ν•©λ‹ˆλ‹€. multiprocessing 을 μ΄μš©ν•˜μ—¬ λŒ€μš©λŸ‰μ˜ 데이터λ₯Ό λΉ λ₯΄κ²Œ μ²˜λ¦¬ν•  수 μžˆλ„λ‘ λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€.

from ktextaug import TextAugmentation

sample_text = 'λ‹¬λ¦¬λŠ” κΈ°μ°¨ μœ„μ— 쀑립은 μ—†λ‹€. 미ꡭ의 μ‚¬νšŒ μš΄λ™κ°€μ΄μž μ—­μ‚¬ν•™μžμΈ ν•˜μ›Œλ“œ 진이 남긴 격언이닀.'
sample_texts = ['ν”„λ‘œκ·Έλž¨ 개발이 λλ‚˜κ³  μ„œλΉ„μŠ€κ°€ μ§„ν–‰λœλ‹€.', '도움말을 보고 μ‹Άλ‹€λ©΄ --helpλ₯Ό μž…λ ₯ν•˜λ©΄ λœλ‹€.']
agent = TextAugmentation(tokenize_fn="mecab")
print(agent.generate(sample_text)) # default is back_translation

ν•¨μˆ˜λ₯Ό 직접 λΆˆλŸ¬μ˜€λŠ” 것 λ˜ν•œ κ°€λŠ₯ν•©λ‹ˆλ‹€.

from ktextaug import random_swap

text = "이 λ¬Έμž₯은 λ³€ν˜•μ  데이터 μ¦κ°•κΈ°λ²•μ˜ μ˜ˆμ‹œ λ¬Έμž₯μž…λ‹ˆλ‹€."
tokenizer = bring_it_your_own   # ν† ν¬λ‚˜μ΄μ €λŠ” μ–΄λ–€ ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•˜λ”λΌλ„ μƒκ΄€μ—†μŠ΅λ‹ˆλ‹€.
tokens = tokenizer.tokenize(text) 
result = random_swap(tokens, 2) # 토큰 μ‹œν€€μŠ€ λ‚΄ 두 λ‹¨μ–΄μ˜ μœ„μΉ˜λ₯Ό λ³€κ²½ν•˜λŠ” μž‘μ—…(random swap)을 2회 μ‹œν–‰ν•©λ‹ˆλ‹€. 
print(result)
>>> ['이', 'λ¬Έμž₯', '은', 'μ˜ˆμ‹œ', '적', '데이터', '기법', '증강', '의', 'λ¬Έμž₯', 'λ³€ν˜•', 'μž…λ‹ˆλ‹€', '.']

νŒ¨ν‚€μ§€μ—μ„œ μ œκ³΅ν•˜λŠ” ν˜•νƒœμ†Œ 뢄석기(ν† ν¬λ‚˜μ΄μ €) λͺ¨λ“ˆμ€ mecab λ˜λŠ” komoran을 λΆˆλŸ¬μ˜΅λ‹ˆλ‹€. 두 ν† ν¬λ‚˜μ΄μ € λͺ¨λ‘ λ³„λ„μ˜ μ„€μΉ˜κ³Όμ •μ΄ ν•„μš”ν•˜λ‹ˆ μ•„λž˜ 링크λ₯Ό μ°Έκ³ ν•΄μ£Όμ„Έμš”. μ›ν•˜λŠ” ν† ν¬λ‚˜μ΄μ €λ₯Ό μ‚¬μš©ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

  • Mecab μ„€μΉ˜ 방법 [링크] - fabric 으둜 μ‰½κ²Œ μ„€μΉ˜
  • PyKomoran μ„€μΉ˜ 방법 [링크]
from ktextaug.tokenization_utils import get_tokenize_fn
from ktextaug import random_swap

# get_tokenize_fn ν•¨μˆ˜μ˜ μ‚¬μš©μ˜ˆμ‹œ
tokenize_fn = get_tokenize_fn("mecab")

# OR you can use your own tokenizer
result = random_swap(text_or_words=text,
                     tokenize_fn=lambda x: x.split(" "), # lambdaλ‘œλ„ μ‚¬μš© κ°€λŠ₯
                     rng=Random(seed=2021),
                     n_swap=2)

More examples

  • How_to_use 에 기본적인 μ‚¬μš©λ²•κ³Ό λ…Έμ΄μ¦ˆ 생성과 κ΄€λ ¨λœ μ˜ˆμ‹œμ— λŒ€ν•΄ μ ν˜€μžˆμŠ΅λ‹ˆλ‹€.

더 μžμ„Έν•œ μ‚¬μš© μ˜ˆμ‹œλŠ” examples 폴더 λ‚΄μ˜ μ˜ˆμ‹œλ“€μ„ ν™•μΈν•΄μ£Όμ„Έμš”.(0.1.8 μ—μ„œ ν…ŒμŠ€νŠΈ)

  • summarize.py : 각 기법을 μ‚¬μš©ν•œ μ˜ˆμ‹œλ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.
  • multiprocessing.py : .csv ν˜•μ‹μ˜ 데이터셋을 λ°›μ•„ μ¦κ°•λœ 데이터셋 νŒŒμΌμ„ μ œκ³΅ν•΄μ€λ‹ˆλ‹€. μ‹œκ°„μ΄ 많이 μ†Œμš”λ˜λŠ” 기법듀을 multiprocessing 을 μ΄μš©ν•˜μ—¬ μ²˜λ¦¬ν–ˆμŠ΅λ‹ˆλ‹€.

Test it with sample data(0.1.8 μ—μ„œ ν…ŒμŠ€νŠΈ)

데이터 μ¦κ°•κΈ°λ²•μ˜ μ„±λŠ₯을 ν™•μΈν•˜μ‹€ 수 μžˆλ„λ‘, 맀우 μž‘μ€ 데이터셋을 examples/data/ 에 μ˜¬λ €λ‘μ—ˆμŠ΅λ‹ˆλ‹€. 이 λ°μ΄ν„°λŠ” nsmc λ°μ΄ν„°μ…‹μ˜ ν›ˆλ ¨ 데이터셋을 1000개 랜덀 μƒ˜ν”Œλ§ν•œ κ²°κ³Όμž…λ‹ˆλ‹€. (좜처: https://github.com/e9t/nsmc)

ν•΄λ‹Ή 데이터λ₯Ό 가지고 증강기법을 μ μš©ν•΄μ„œ 결과의 차이λ₯Ό ν™•μΈν•΄μ£Όμ„Έμš”! (.csv νŒŒμΌμ„ λ‹€λ£¨λŠ” μ˜ˆμ‹œλŠ” multiprocessing.py μ—μ„œ 확인 κ°€λŠ₯ν•©λ‹ˆλ‹€)

Things to know

  1. λ…Έμ΄μ¦ˆ 생성은 @hkjeon13(μ „ν˜„κ·œ) 의 λ…Έμ΄μ¦ˆ 생성을 λ”°λžμŠ΅λ‹ˆλ‹€

https://github.com/hkjeon13/noising-korean

  1. ν•œκ΅­μ–΄ λΆˆμš©μ–΄ μ‚¬μ „μ˜ 경우 λ‹€μŒ 링크의 νŒŒμΌμ„ κ·ΈλŒ€λ‘œ κ°€μ Έμ™”μŠ΅λ‹ˆλ‹€. https://github.com/stopwords-iso/stopwords-ko/blob/master/stopwords-ko.txt

Contribution

이 νŒ¨ν‚€μ§€λŠ” μ„±κ· κ΄€λŒ€ν•™κ΅ μ •μœ€κ²½ κ΅μˆ˜λ‹˜ 연ꡬ싀 ING-lab μ—μ„œ μ§„ν–‰ν•œ ν”„λ‘œμ νŠΈλ‘œ μ‹œμž‘λ˜μ—ˆμœΌλ©°, λ‹Ήμ‹œ μ°Έμ—¬ν•œ μ‚¬λžŒλ“€μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.

  • μ‘°μ§„μš±, μ „ν˜„κ·œ, λ°•μ’…ν˜, μ΄μ •ν›ˆ, μ •λ―Όμˆ˜

λ³΄λ‹€μ‹œν”Ό 아직 νŒ¨ν‚€μ§€μ— λΆ€μ‘±ν•œ 뢀뢄이 λ§ŽμŠ΅λ‹ˆλ‹€. Contributorκ°€ 되고 μ‹ΆμœΌμ‹œλ‹€λ©΄, μ–Έμ œλ“  issue, PR, 등을 λΆ€νƒλ“œλ¦½λ‹ˆλ‹€ :)

Contact: cju2725@gmail.com

TO DO

  1. Generative Models μΆ”κ°€ μ˜ˆμ • (4μ›” 말)
  2. synonym search λ™μ˜μ–΄ λͺ»μ°Ύμ„μ‹œ 문제 ν•΄κ²°
  3. documentation μž‘μ„±
  4. pkg_resources μ„±λŠ₯ μ˜€λ²„ν—€λ“œ κ΄€λ ¨ 이슈 https://docs.python.org/ko/3/library/importlib.html#module-importlib.resources

Acknowledgement

β€œμ΄ κΈ°μˆ μ€ κ³Όν•™κΈ°μˆ μ •λ³΄ν†΅μ‹ λΆ€ 및 μ •λ³΄ν†΅μ‹ κΈ°νšν‰κ°€μ›μ˜ 인곡지λŠ₯ν•΅μ‹¬μΈμž¬μ–‘μ„±μ‚¬μ—…(인곡지λŠ₯λŒ€ν•™μ›μ§€μ›(μ„±κ· κ΄€λŒ€ν•™κ΅), No.2019-0-00421)의 μ—°κ΅¬κ²°κ³Όλ‘œ κ°œλ°œν•œ κ²°κ³Όλ¬Όμž…λ‹ˆλ‹€.”