kobart-transformers

Kobart model on huggingface transformers


Keywords
kobart, huggingface, deep, learning
Licenses
BSD-3-Clause/BSD-3-Clause
Install
pip install kobart-transformers==0.1.4

Documentation

KoBART-Transformers

  • SKTμ—μ„œ κ³΅κ°œν•œ KoBARTλ₯Ό νŽΈλ¦¬ν•˜κ²Œ μ‚¬μš©ν•  수 있게 transformers둜 ν¬νŒ…ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

Install (Optional)

  • BartModelκ³Ό PreTrainedTokenizerFastλ₯Ό μ΄μš©ν•˜λ©΄ μ„€μΉ˜ν•˜μ‹€ ν•„μš” μ—†μŠ΅λ‹ˆλ‹€.
pip install kobart-transformers

Tokenizer

  • PreTrainedTokenizerFastλ₯Ό μ΄μš©ν•˜μ—¬ κ΅¬ν˜„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  • PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")와 λ™μΌν•©λ‹ˆλ‹€.
>>> from kobart_transformers import get_kobart_tokenizer
>>> # from transformers import PreTrainedTokenizerFast

>>> kobart_tokenizer = get_kobart_tokenizer()
>>> # kobart_tokenizer = PreTrainedTokenizerFast.from_pretrained("hyunwoongko/kobart")

>>> kobart_tokenizer.tokenize("μ•ˆλ…•ν•˜μ„Έμš”. ν•œκ΅­μ–΄ BART μž…λ‹ˆλ‹€.🀣:)l^o")
['β–μ•ˆλ…•ν•˜', 'μ„Έμš”.', 'β–ν•œκ΅­μ–΄', '▁B', 'A', 'R', 'T', 'β–μž…', 'λ‹ˆλ‹€.', '🀣', ':)', 'l^o']

Model

  • BartModel을 μ΄μš©ν•˜μ—¬ κ΅¬ν˜„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
  • BartModel.from_pretrained("hyunwoongko/kobart")와 λ™μΌν•©λ‹ˆλ‹€.
>>> from kobart_transformers import get_kobart_model, get_kobart_tokenizer
>>> # from transformers import BartModel

>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = get_kobart_model()
>>> # model = BartModel.from_pretrained("hyunwoongko/kobart")

>>> inputs = kobart_tokenizer(['μ•ˆλ…•ν•˜μ„Έμš”.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4488, -4.3651,  3.2349,  ...,  5.8916,  4.0497,  3.5468],
         [-0.4096, -4.6106,  2.7189,  ...,  6.1745,  2.9832,  3.0930]]],
       grad_fn=<TransposeBackward0>), past_key_values=None, decoder_hidden_states=None, decoder_attentions=None, cross_attentions=None, encoder_last_hidden_state=tensor([[[ 0.4624, -0.2475,  0.0902,  ...,  0.1127,  0.6529,  0.2203],
         [ 0.4538, -0.2948,  0.2556,  ..., -0.0442,  0.6858,  0.4372]]],
       grad_fn=<TransposeBackward0>), encoder_hidden_states=None, encoder_attentions=None)

For Seq2Seq Training

  • seq2seq ν•™μŠ΅μ‹œμ—λŠ” μ•„λž˜μ™€ 같이 get_kobart_for_conditional_generation()을 μ΄μš©ν•©λ‹ˆλ‹€.
  • BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")와 λ™μΌν•©λ‹ˆλ‹€.
>>> from kobart_transformers import get_kobart_for_conditional_generation
>>> # from transformers import BartForConditionalGeneration

>>> model = get_kobart_for_conditional_generation()
>>> # model = BartForConditionalGeneration.from_pretrained("hyunwoongko/kobart")

Updates Notes

version 0.1

  • pad 토큰이 μ„€μ •λ˜μ§€ μ•Šμ€ μ—λŸ¬λ₯Ό ν•΄κ²°ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
from kobart import get_kobart_tokenizer
kobart_tokenizer = get_kobart_tokenizer()
kobart_tokenizer(["ν•œκ΅­μ–΄", "BART λͺ¨λΈμ„", "μ†Œκ°œν•©λ‹ˆλ‹€."], truncation=True, padding=True)
{
'input_ids': [[28324, 3, 3, 3, 3], [15085, 264, 281, 283, 24224], [15630, 20357, 3, 3, 3]], 
'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 
'attention_mask': [[1, 0, 0, 0, 0], [1, 1, 1, 1, 1], [1, 1, 0, 0, 0]]
}

version 0.1.3

  • get_kobart_for_conditional_generation()λ₯Ό __init__.py에 λ“±λ‘ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

version 0.1.4

  • λˆ„λ½λ˜μ—ˆλ˜ special_tokens_map.json을 μΆ”κ°€ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • 이제 pip install 없이 KoBARTλ₯Ό μ΄μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • thanks to bernardscumm

Reference