Transformers Inference Toolkit
transformers-inference-toolkit
is to get rid of boilerplate code and to simplify automatic optimization and inference process of Huggingface Transformers models.
Installation
Using pip
:
pip install transformers-inference-toolkit
Optimization
The original transformers.onnx
package, which can be used to convert PyTorch or TensorFlow models into ONNX format. This Toolkit extends this functionality by giving the user an opportunity to automatically optimize ONNX model graph - this is similar to what
Aside from ONNX conversion the Toolkit also supports resaving PyTorch models with half-precision and setting up DeepSpeed Inference.
Prerequisite
The Toolkit expects your pretrained model (in PyTorch format) and tokenizer to be saved (using save_pretrained()
method) inside a common parent directory in model
and tokenizer
folders respectively. This is how a file structure of toxic-bert
model should look like:
toxic-bert
βββ model
βΒ Β βββ config.json
βΒ Β βββ pytorch_model.bin
βββ tokenizer
βββ merges.txt
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ vocab.json
How to use
Most of the popular Transformer model architectures (like BERT and its variations) can be converted with a single command:
from transformers_inference_toolkit import (
Feature,
OnnxModelType,
OnnxOptimizationLevel,
optimizer,
)
optimizer.pack_onnx(
input_path="toxic-bert",
output_path="toxic-bert-optimized",
feature=Feature.SEQUENCE_CLASSIFICATION,
for_gpu=True,
fp16=True,
optimization_level=OnnxOptimizationLevel.FULL,
)
If your model architecture is not supported out-of-the-box (described here) you can try writing a custom OnnxConfig class:
from collections import OrderedDict
from transformers.onnx import OnnxConfig
class MPNetOnnxConfig(OnnxConfig):
@property
def default_onnx_opset(self):
return 14
@property
def inputs(self):
dynamic_axis = {0: "batch", 1: "sequence"}
return OrderedDict(
[
("input_ids", dynamic_axis),
("attention_mask", dynamic_axis),
]
)
optimizer.pack_onnx(
input_path="all-mpnet-base-v2",
output_path="all-mpnet-base-v2-optimized",
feature=Feature.DEFAULT,
custom_onnx_config_cls=MPNetOnnxConfig,
)
ONNX is not the only option, it is also possible to resave the model for future inference simply using PyTorch (optimizer.pack_transformers()
method, force_fp16
argument to save in half-precision) or DeepSpeed Inference (optimizer.pack_deepspeed()
method):
optimizer.pack_deepspeed(
input_path="gpt-neo",
output_path="gpt-neo-optimized",
feature=Feature.CAUSAL_LM,
replace_with_kernel_inject=True,
mp_size=1,
)
After calling optimizer
methods the model and tokenizer would be saved at output_path
. The output directory will also contain metadata.json
file that is necessary for the Predictor
object (described below) to correctly load the model:
toxic-bert-optimized
βββ metadata.json
βββ model
βΒ Β βββ config.json
βΒ Β βββ model.onnx
βββ tokenizer
βββ special_tokens_map.json
βββ tokenizer.json
βββ tokenizer_config.json
Prediction
After model and tokenizer are packaged using one of the optimizer
methods, it is possible to initialize a Predictor
object:
>>> from transformers_inference_toolkit import Predictor
>>>
>>> predictor = Predictor("toxic-bert-optimized", cuda=True)
>>> print(predictor("I hate this!"))
{'logits': array([[ 0.02940369, -7.0195312 , -4.7890625 , -6.0664062 , -5.625 ,
-6.09375 ]], dtype=float32)}
The Predictor
object can be simply called with tokenizer arguments (similar to pipeline
s, return_tensors
argument can be omitted, padding
and truncation
are True
by default). For text generation tasks Predictor.generate()
method (with generation arguments) can be used:
>>> predictor = Predictor("gpt-neo-optimized", cuda=True)
>>> predictor.generate(
... "Tommy: Hi Mark!",
... do_sample=True,
... top_p=0.9,
... num_return_sequences=3,
... max_new_tokens=5,
... )
['Tommy: Hi Mark!\nMadelyn: Hello', 'Tommy: Hi Mark! Itβs so', 'Tommy: Hi Mark! How are you?\n']