thaixtransformers

ThaiXtransformers: Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.


Keywords
thainlp, NLP, natural, language, processing, text, analytics, localization, computational, linguistics, Thai
License
Apache-2.0
Install
pip install thaixtransformers==0.1.0

Documentation

ThaiXtransformers

Open In Colab

Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.

Fork from vistec-AI/thai2transformers.

This project build the tokenizer and preprocessing data for RoBERTa models from VISTEC-depa AI Research Institute of Thailand.

Paper: WangchanBERTa: Pretraining transformer-based Thai Language Models

Install

pip install thaixtransformers

Usage

Tokenizer

from thaixtransformers import Tokenizer

If you use models, you should load model by model name.

Tokenizer(model_name) -> Tokeinzer

Example

from thaixtransformers import Tokenizer
from transformers import pipeline
from transformers import AutoModelForMaskedLM

tokenizer = Tokenizer("airesearch/wangchanberta-base-wiki-newmm")
model = AutoModelForMaskedLM.from_pretrained("airesearch/wangchanberta-base-wiki-newmm")

classifier = pipeline("fill-mask",model=model,tokenizer=tokenizer)
print(classifier("āļœāļĄāļŠāļ­āļš<mask>āļĄāļēāļ āđ†"))
# output:
#    [{'score': 0.05261131376028061,
#  'token': 6052,
#  'token_str': 'āļ­āļīāļ™āđ€āļ—āļ­āļĢāđŒāđ€āļ™āđ‡āļ•',
#  'sequence': 'āļœāļĄāļŠāļ­āļšāļ­āļīāļ™āđ€āļ—āļ­āļĢāđŒāđ€āļ™āđ‡āļ•āļĄāļēāļāđ†'},
# {'score': 0.03980604186654091,
#  'token': 11893,
#  'token_str': 'āļ­āđˆāļēāļ™āļŦāļ™āļąāļ‡āļŠāļ·āļ­',
#  'sequence': 'āļœāļĄāļŠāļ­āļšāļ­āđˆāļēāļ™āļŦāļ™āļąāļ‡āļŠāļ·āļ­āļĄāļēāļāđ†'},
#    ...]

Preprocess

If you want to preprocessing data before training model, you can use preprocess.

from thaixtransformers.preprocess import process_transformers

process_transformers(str) -> str

Example

from thaixtransformers.preprocess import process_transformers

print(process_transformers("āļŠāļ§āļąāļŠāļ”āļĩ   :D"))
# output: 'āļŠāļ§āļąāļŠāļ”āļĩ<_>:d'

BibTeX entry and citation info

@misc{lowphansirikul2021wangchanberta,
      title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, 
      author={Lalita Lowphansirikul and Charin Polpanumas and Nawat Jantrakulchai and Sarana Nutanong},
      year={2021},
      eprint={2101.09635},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}