BABIP

Easy Tools for Improving Machine Learning Models


Keywords
machine, learning, deep, artificial, intelligence, korean, easy, hdydtdxdt
Install
pip install BABIP==0.0.1

Documentation

BABIP

์‰ฌ์šด ๋จธ์‹  ๋Ÿฌ๋‹ ๋ชจ๋ธ ๊ฐœ์„ ์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

Features

  • ๊ฐœ๋ฐœ ์˜๋„: ์ธ๊ณต ์ง€๋Šฅ ๋ถ„์•ผ์˜ ๋†’์€ ์ง„์ž…์žฅ๋ฒฝ์„ ์—†์• ๊ณ ์ž ๊ฐœ๋ฐœ
  • ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์ตœ์ ํ™”: ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์— ์–ด๋ ค์›€์„ ๊ฒช์ง€ ์•Š๋„๋ก ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ์ตœ์ ํ™”
  • ์‰ฌ์šด ์‚ฌ์šฉ: ๋ชจ๋“  ๊ธฐ๋Šฅ์„ ์ฝ”๋“œ ํ•œ ์ค„๋กœ ๊ตฌํ˜„ ๊ฐ€๋Šฅ

Installation

Dependencies

BABIP requires:

  • scikit-learn

User Installation

pip install BABIP

Modules

rec


์ถ”์ฒœ ์‹œ์Šคํ…œ ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ๋ชจ๋“ˆ

  • tfidf(): TF-IDF ํ–‰๋ ฌ
rec.tfidf(data, col_contents)
  • cos_similarity(): ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„
rec.cos_similarity(data, col_contents)
  • recommend(): ์ถ”์ฒœ ์‹œ์Šคํ…œ
rec.recommend(title, data, col_title, col_contents)
์˜ˆ์‹œ ๋ฐ์ดํ„ฐ (movie_data.csv)
title story
ํด๋ž˜์‹ ๊ฐ™์€ ๋Œ€ํ•™์— ๋‹ค๋‹ˆ๋Š” ์ง€ํ˜œ(์†์˜ˆ์ง„ ๋ถ„)์™€ ์ˆ˜๊ฒฝ(์ด์ˆ˜์ธ ๋ถ„)์€ ์—ฐ๊ทน๋ฐ˜ ์„ ๋ฐฐ ์ƒ๋ฏผ(์กฐ์ธ์„ฑ ๋ถ„)์„ ์ข‹์•„ํ•œ๋‹ค. ํ•˜์ง€๋งŒ...
์ธํ„ฐ์Šคํ…”๋ผ ์„ธ๊ณ„ ๊ฐ๊ตญ์˜ ์ •๋ถ€์™€ ๊ฒฝ์ œ๊ฐ€ ์™„์ „ํžˆ ๋ถ•๊ดด๋œ ๋ฏธ๋ž˜๊ฐ€ ๋‹ค๊ฐ€์˜จ๋‹ค. ์ง€๋‚œ 20์„ธ๊ธฐ์— ๋ฒ”ํ•œ ์ž˜๋ชป์ด ์ „ ์„ธ๊ณ„์ ์ธ ์‹๋Ÿ‰ ๋ถ€์กฑ์„ ๋ถˆ๋Ÿฌ์™”๊ณ , NASA๋„ ํ•ด์ฒด๋˜์—ˆ๋‹ค. ์ด๋•Œ...
์ธ์…‰์…˜ ๋“œ๋ฆผ๋จธ์‹ ์ด๋ผ๋Š” ๊ธฐ๊ณ„๋กœ ํƒ€์ธ์˜ ๊ฟˆ๊ณผ ์ ‘์†ํ•ด ์ƒ๊ฐ์„ ๋นผ๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋ฏธ๋ž˜์‚ฌํšŒ.โ€˜๋” ์ฝ”๋ธŒโ€™(๋ ˆ์˜ค๋‚˜๋ฅด๋„ ๋””์นดํ”„๋ฆฌ์˜ค)๋Š” ์ƒ๊ฐ์„ ์ง€ํ‚ค๋Š” ํŠน์ˆ˜๋ณด์•ˆ์š”์›์ด๋ฉด์„œ ๋˜ํ•œ ์ตœ๊ณ ์˜ ์‹ค๋ ฅ์œผ๋กœ ์ƒ๊ฐ์„ ํ›”์น˜๋Š” ๋„๋‘‘์ด๋‹ค...
์˜ˆ์‹œ ์ฝ”๋“œ
from babip import rec
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
data_path = './../movie_data.csv'
data = pd.read_csv(data_path, encoding='cp949')

# ์นผ๋Ÿผ ์ด๋ฆ„
col_title = 'title' # ์•„์ดํ…œ ์ด๋ฆ„ ์ •๋ณด๋ฅผ ๋‹ด๋Š” ์นผ๋Ÿผ์˜ ์ด๋ฆ„
col_contents = 'story' # ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋น„๊ต์— ์“ฐ์ผ ์ž์—ฐ์–ด ์ •๋ณด๋ฅผ ๋‹ด๋Š” ์นผ๋Ÿผ์˜ ์ด๋ฆ„

# TF-IDF ํ–‰๋ ฌ
matrix = rec.tfidf(data, col_contents)

# ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„
similarity = rec.cos_similarity(data, col_contents)

# ์ถ”์ฒœ ์‹œ์Šคํ…œ
title = '์ธํ„ฐ์Šคํ…”๋ผ' # col_title ์นผ๋Ÿผ์˜ ๊ฐ’ ์ค‘ ํ•˜๋‚˜ ์„ ํƒ
recommendations = rec.recommend(title, data, col_title, col_contents)

koda


ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ์œ„ํ•œ ๋ชจ๋“ˆ

  • synonym_replace(): ์œ ์˜์–ด ๋Œ€์ฒด
koda.synonym_replace(sentence)
  • random_insert(): ๋‹จ์–ด ๋žœ๋ค ์‚ฝ์ž…
koda.random_insert(sentence)
  • random_swap(): ๋‹จ์–ด ์ž๋ฆฌ ๋ฐ”๊พธ๊ธฐ
koda.random_swap(sentence)
  • random_delete(): ๋‹จ์–ด ๋žœ๋ค ์‚ญ์ œ
koda.random_delete(sentence)
์˜ˆ์‹œ ์ฝ”๋“œ
from babip import koda


# ์˜ˆ์‹œ ๋ฌธ์žฅ
sentence = '๋กฏ๋ฐ ์ž์ด์–ธ์ธ  ์œ ๋‹ˆํผ์„ 16๋…„์งธ ์ž…๊ณ  ์žˆ๋Š” ์ „์ค€์šฐ๋Š” ํฌ์ŠคํŠธ์‹œ์ฆŒ ์ง„์ถœ์ด ๋„ˆ๋ฌด๋‚˜๋„ ๊ฐ„์ ˆํ•˜๋‹ค. ๊ทธ๋Š” "์˜ฌํ•ด ์šฐ๋ฆฌ ํŒ€์ด ๊ฐ€์„์•ผ๊ตฌ์— ์ง„์ถœํ•˜์ง€ ๋ชปํ•˜๋ฉด ๋„ˆ๋ฌด ์•„์‰ฌ์šธ ๊ฒƒ ๊ฐ™๋‹ค"๊ณ  ๋งํ–ˆ๋‹ค.'

# ์œ ์˜์–ด ๋Œ€์ฒด
sr = koda.synonym_replace(sentence)

# ๋‹จ์–ด ๋žœ๋ค ์‚ฝ์ž…
ri = koda.random_insert(sentence)

# ๋‹จ์–ด ์ž๋ฆฌ ๋ฐ”๊พธ๊ธฐ
rs = koda.random_swap(sentence)

# ๋‹จ์–ด ๋žœ๋ค ์‚ญ์ œ
rd = koda.random_delete(sentence)

overfit


๊ณผ์ ํ•ฉ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ชจ๋“ˆ

  • is_overfit(): ๊ณผ์ ํ•ฉ ํƒ์ง€
overfit.is_overfit(model, X_train, y_train, X_test, y_test)
์˜ˆ์‹œ ์ฝ”๋“œ
from babip import overfit
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
data_path = './../data.csv'
data = pd.read_csv(data_path)

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X = data['data']
Y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, Y, stratify=Y, random_state=777)

# ๋ชจ๋ธ ํ›ˆ๋ จ
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# ๊ณผ์ ํ•ฉ ํƒ์ง€
overfit_result = overfit.is_overfit(model, X_train, y_train, X_test, y_test)

hyperopt


ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋“ˆ

  • grid_search(): Grid Search๋ฅผ ํ†ตํ•ด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”
hyperopt.grid_search(model, parameters, X_train, y_train)
  • random_search(): Random Search๋ฅผ ํ†ตํ•ด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”
hyperopt.random_search(model, parameters, X_train, y_train)
์˜ˆ์‹œ ์ฝ”๋“œ
from babip import hyperopt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
data_path = './../data.csv'
data = pd.read_csv(data_path)

# ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
X = data['data']
Y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, Y, stratify=Y, random_state=777)

# ๋ชจ๋ธ ํ›ˆ๋ จ
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํƒ์ƒ‰ ๊ตฌ๊ฐ„ ์„ค์ •
parameters = {
    'criterion': ['gini', 'entropy'], 
    'max_depth': [None, 2, 3, 4, 5, 6],
    'max_leaf_nodes': range(5, 101, 5)
}

# Grid Search๋ฅผ ํ†ตํ•ด ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ํƒ์ƒ‰
best_parameters_gs = hyperopt.grid_search(model, parameters, X_train, y_train)

# Random Search๋ฅผ ํ†ตํ•ด ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ํƒ์ƒ‰
best_parameters_rs = hyperopt.random_search(model, parameters, X_train, y_train)