Text-to-Text Transformer
๋ณธ repository์์๋ Google์ T5(T5: Text-To-Text Transfer Transformer)์ text-to-text ํํ๋ก ํ๊ตญ์ด QA Task๋ฅผ ์ํ Transformer ๋ชจ๋ธ์ ๋๋ค. ์ ์ฒด ๋ชจ๋ธ์ ์ํคํ ์ฒ๋ ๊ธฐ๋ณธ Transformer ๋ชจ๋ธ์ ์ฌ์ฉํ์ต๋๋ค.
pip install transformer-korean
- 2019.12.25, version 0.0.3 : load_data_txt, load_data_csv ์ค๋ฅ ์์
- 2019.12.23, version 0.0.1 : ์ต์ด ๋ฆด๋ฆฌ์ฆ
0. Pre-training Model
- Text-to-Text Transformer-Base, Korean Model: 12-layer, 768-hidden, 12-heads(๋น๊ณต๊ฐ)
- Text-to-Text Transformer-Small, Korean Model: 6-layer, 512-hidden, 8-heads(๋น๊ณต๊ฐ)
Base This is our baseline model, whose hyperparameters are described in Section 3.1.1. It has roughly 220million parameters. Small. We consider a smaller model, which scales the baseline down by using dmodel= 512, dff= 2,048, 8-headed attention, and only 6layers each in the encoder and decoder. This varianthas about 60million parameters.
1. Pre-training
1.1 Unsupervised objective
T5 ๋ ผ๋ฌธ์์ ๊ฐ์ฅ ์ฑ๋ฅ์ด ์ ๋์จ๋ค๊ณ ์์ ๋ BERT Style Objective๋ก ๋ฌธ์ฅ์ ๊ตฌ์ฑํ์ฌ, Pre-training ํ๋๋ก ๊ตฌ์ฑํ์ต๋๋ค. BERT์ ๋์ผํ๊ฒ ์ ๋ ฅ ๋ฌธ์ฅ์ 15%๋ฅผ Random ํ๊ฒ ๋ง์คํน ์ฒ๋ฆฌํ์ต๋๋ค. ๋ง์คํน ๋์์ 80%๋ ํ ํฐ์ผ๋ก ๋์ฒดํ๋ฉฐ, 10%๋ ์ฌ์ ๋ด ์์์ ํ ํฐ์ผ๋ก ๋๋จธ์ง 10%๋ ์๋์ ๋จ์ด๋ฅผ ๊ทธ๋๋ก ์ฌ์ฉํ์ต๋๋ค.
1.2 ๋ฌธ์ฅ ์์
Input ๋ฌธ์ฅ : 1900๋
, <MASK> <MASK> ํธ์น๋์ ์คํ๋ผ ํ ์ค์นด๋ก '๋ค์ํ๊ฒ' ๊ฐ์๋์๋ค. (BERT Style)
Target ๋ฌธ์ฅ : 1900๋
, ์ฌ๋ฅด๋์ ์ฐ๊ทน์ ํธ์น๋์ ์คํ๋ผ ํ ์ค์นด๋ก ์๋กญ๊ฒ ๊ฐ์๋์๋ค. (original text)
1.3 Unlabeld dataset
ํ์ต ๋ฐ์ดํฐ๋ ํ๊ตญ์ด ์ํค๋ฐ์ดํฐ(2019.01 dump file ๊ธฐ์ค, ์ฝ 350๋ง ๋ฌธ์ฅ) ์ ์ฌ์ฉํ์ฌ ํ์ต์ ์งํํ์ผ๋ฉฐ, ํ์ต ๋ฌธ์ฅ ๊ตฌ์ฑ์ ์๋์ ๊ฐ์ต๋๋ค.
๋ผ ํ ์ค์นด(La Tosca)๋ 1887๋
์ ํ๋์ค ๊ทน์๊ฐ ์ฌ๋ฅด๋๊ฐ ๋ฐฐ์ฐ ์ฌ๋ผ ๋ฒ ๋ฅด๋๋ฅด๋ฅผ ์ํด ๋ง๋ ์ํ์ด๋ค.
1887๋
ํ๋ฆฌ์์ ์ฒ์ ์์ฐ๋์๋ค.
1990๋
๋ฒ ๋ฅด๋๋ฅด๋ฅผ ์ฃผ์ธ๊ณต์ผ๋ก ๋ฏธ๊ตญ ๋ด์์์ ์ฌ์์ฐ๋์๋ค.
1800๋
6์ ์ค์์ ์ดํ๋ฆฌ์ ๋ก๋ง๋ฅผ ๋ฐฐ๊ฒฝ์ผ๋ก ํ๋ฉฐ, ๋น์์ ์๋์ ์ํฉ ํ์์ ์ด์ผ๊ธฐ๊ฐ ์ ๊ฐ๋๋ค.
1900๋
, ์ฌ๋ฅด๋์ ์ฐ๊ทน์ ํธ์น๋์ ์คํ๋ผ ํ ์ค์นด๋ก ์๋กญ๊ฒ ๊ฐ์๋์๋ค.
๋ฒ ๋ฅด๋๋ ์ฌ๋๋ฃจ์ ๊ฐ๋ณธ์์ "๊ฐ์์ค๋ฐ ์ข
๊ฒฐ" ๋ถ๋ถ์ ์์ ํ ๊ฒ์ ๊ถํ์ง๋ง, ์ฌ๋ฅด๋ฃจ๋ ์ด๋ฅผ ๊ฑฐ์ ํ๋ค.
ํ์, ํธ์น๋ ๋ํ ์ฌ๋ฅด๋์ ๊ฐ๋ณธ์์ "๊ฐ์์ค๋ฐ ์ข
๊ฒฐ๋ถ๋ถ"์ ์์ ํ ๊ฒ์ ์ ์ํ์ง๋ง ๋๋ด ์ฌ๋ฅด๋๋ฅผ ์ค๋ํ์ง ๋ชปํ๋ค.
1.4 ํ์ต ์
from transformer_korean.run_training import Trainer
from transformer_korean.transformer import Transformer
from transformer_korean.preprocess import DataProcessor
from transformer_korean.custom_scheduler import CustomSchedule
import tensorflow as tf
path = "ko-wiki_20190621.txt"
# Data Processing
print('Loading Pre-training data')
data_preprocess = DataProcessor(txt_path=path,
batch_size=64,
pre_train=True,
max_length=128)
train = data_preprocess.load_data_txt()
print('Loading Vocab File')
vocab = data_preprocess.load_vocab_file(vocab_filename="vocab")
print('Create train dataset')
train_dataset = data_preprocess.preprocess(train)
EPOCHS = 100
num_layers = 6
d_model = 128
dff = 512
num_heads = 8
vocab_size = vocab.vocab_size
dropout_rate = 0.1
encoder_activation = 'gelu'
decoder_activation = 'relu'
# Custom Scheduler
learning_rate = CustomSchedule(d_model, warmup_steps=4000)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
# Transformer
transformer = Transformer(d_model=d_model,
num_heads=num_heads,
num_layers=num_layers,
vocab_size=vocab_size,
dff=dff,
enc_activation=encoder_activation,
dec_activation=decoder_activation,
rate=dropout_rate)
# Trainer
trainer = Trainer(train_dataset=train_dataset,
learning_rate=learning_rate,
optimizer=optimizer,
transformer=transformer,
epochs=EPOCHS,
checkpoint_path='./checkpoints/',
load_checkpoints=False,
save_checkpoints_epochs=10)
trainer.train()
2.Fine-Tuning(QA Task)
2.1 Labeld dataset
QA Task๋ฅผ ์ํด ํ๊ตญ์ด QA Dataset์ธ KorQuAD 1.1์ ์ฌ์ฉํ์ฌ Fine-Tuning ํ๋๋ก ๊ตฌ์ฑํ์ต๋๋ค. ๋ฐ์ดํฐ ๊ตฌ์ฑ์ ์๋์ ๊ฐ์ต๋๋ค. input์ question, target์ answer๊ฐ ๋๋๋ก ํ์ต๋๋ค.
Q
๋ฐ๊ทธ๋๋ ๊ดดํ
์ ํ์ฐ์คํธ๋ฅผ ์ฝ๊ณ ๋ฌด์์ ์ฐ๊ณ ์ ํ๋๊ฐ?
๋ฐ๊ทธ๋๋ ๊ตํฅ๊ณก ์๊ณก์ ์ด๋๊น์ง ์ด ๋ค์ ์ค๋จํ๋๊ฐ?
๋ฐ๊ทธ๋๊ฐ ํ์ฐ์คํธ ์๊ณก์ ์ธ ๋ ์ด๋ค ๊ณก์ ์ํฅ์ ๋ฐ์๋๊ฐ?
1839๋
๋ฐ๊ทธ๋๊ฐ ๊ตํฅ๊ณก์ ์์ฌ๋ก ์ฐ๋ ค๊ณ ํ๋ ์ฑ
์?
ํ์ฐ์คํธ ์๊ณก์ ๋ผ๋จ์กฐ ์กฐ์ฑ์ด ์ํฅ์ ๋ฐ์ ๋ฒ ํ ๋ฒค์ ๊ณก์?
A
๊ตํฅ๊ณก
1์
์ฅ
๋ฒ ํ ๋ฒค์ ๊ตํฅ๊ณก 9๋ฒ
ํ์ฐ์คํธ
ํฉ์ฐฝ๊ตํฅ๊ณก
2.2 ํ์ต ์
from transformer_korean.run_training import Trainer
from transformer_korean.transformer import Transformer
from transformer_korean.preprocess import DataProcessor
from transformer_korean.custom_scheduler import CustomSchedule
import tensorflow as tf
question = "KorQuAD_train_q.csv"
answer = "KorQuAD_train_a.csv"
# Data Processing
print('Loading fine-tuning data')
data_preprocess = DataProcessor(csv_path=[question, answer],
batch_size=64,
pre_train=False,
max_length= 128)
train = data_preprocess.load_data_csv()
print('Loading Vocab File')
vocab = data_preprocess.load_vocab_file(vocab_filename="vocab")
print('Create train dataset')
train_dataset = data_preprocess.preprocess(train)
EPOCHS = 100
num_layers = 6
d_model = 128
dff = 512
num_heads = 8
vocab_size = vocab.vocab_size
dropout_rate = 0.1
encoder_activation = 'gelu'
decoder_activation = 'relu'
# Custom Scheduler
learning_rate = CustomSchedule(d_model, warmup_steps=4000)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)
# Transformer
transformer = Transformer(d_model=d_model,
num_heads=num_heads,
num_layers=num_layers,
vocab_size=vocab_size,
dff=dff,
enc_activation = encoder_activation,
dec_activation = decoder_activation,
rate=dropout_rate)
# Trainer
trainer = Trainer(train_dataset=train_dataset,
learning_rate=learning_rate,
optimizer=optimizer,
transformer=transformer,
epochs=EPOCHS,
checkpoint_path='./checkpoints/',
load_checkpoints=True,
save_checkpoints_epochs=10)
trainer.train()
3. Activation Function
๊ธฐ๋ณธ relu activation function ์ธ์ 4๊ฐ์ activation function ์ถ๊ฐํ์์ผ๋ฉฐ, Encoder์ Decoder ๋ธ๋ญ์ ์๋ก ๋ค๋ฅธ activation function์ด ์ฌ์ฉ ๊ฐ๋ฅํ๋๋ก ํ์ต๋๋ค
- gelu
def gelu(x):
cdf = 0.5 * (1.0 + tf.tanh((np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
return x * cdf
- swish
def swish(x):
return x * tf.nn.sigmoid(x)
- swish_beta
def swish_beta(x):
beta=tf.Variable(initial_value=1.0,trainable=True, name='swish_beta')
return x * tf.nn.sigmoid(beta * x) #trainable parameter beta
def mish(x):
return x * tf.math.tanh(tf.math.softplus(x))
4. Requirement
Python == 3.x
tensorflow >=2.0
tensorflow-datasets >= 1.3.2
pandas >= 0.24.2
numpy >= 1.16.3
six>=1.12.0
5. To-Do
- TPU, Multi-GPU ์ง์ ์์
- Dropout ์์ ์์
- Predict ๋ชจ๋ ์ถ๊ฐ ์์