AI Edge Quantizer

A quantizer for advanced developers to quantize converted LiteRT models. It aims to facilitate advanced users to strive for optimal performance on resource demanding models (e.g., GenAI models).

Build Status

Build Type	Status
Unit Tests (Linux)
Nightly Release
Nightly Colab

Installation

Requirements and Dependencies

Python versions: 3.9, 3.10, 3.11
Operating system: Linux, MacOS
TensorFlow:

Install

Nightly PyPi package:

pip install ai-edge-quantizer-nightly

API Usage

The quantizer requires two inputs:

An unquantized source LiteRT model (with FP32 data type in the FlatBuffers format with .tflite extension)
A quantization recipe (details below)

and outputs a quantized LiteRT model that's ready for deployment on edge devices.

Basic Usage

In a nutshell, the quantizer works according to the following steps:

Instantiate a Quantizer class. This is the entry point to the quantizer's functionalities that the user accesses.
Load a desired quantization recipe (details in subsection).
Quantize (and save) the model. This is where most of the quantizer's internal logic works.

qt = quantizer.Quantizer("path/to/input/tflite")
qt.load_quantization_recipe(recipe.dynamic_wi8_afp32())
qt.quantize().export_model("/path/to/output/tflite")

Please see the getting started colab for the simplest quick start guide on those 3 steps, and the selective quantization colab with more details on advanced features.

LiteRT Model

Please refer to the LiteRT documentation for ways to generate LiteRT models from Jax, PyTorch and TensorFlow. The input source model should be an FP32 (unquantized) model in the FlatBuffers format with .tflite extension.

Quantization Recipe

The user needs to specify a quantization recipe using AI Edge Quantizer's API to apply to the source model. The quantization recipe encodes all information on how a model is to be quantized, such as number of bits, data type, symmetry, scope name, etc.

Essentially, a quantization recipe is defined as a collection of the following command:

“Apply Quantization Algorithm X on Operator Y under Scope Z with ConfigN”.

For example:

"Uniformly quantize the FullyConnected op under scope 'dense1/' with INT8 symmetric with Dynamic Quantization".

All the unspecified ops will be kept as FP32 (unquantized). The scope of an operator in TFLite is defined as the output tensor name of the op, which preserves the hierarchical model information from the source model (e.g., scope in TF). The best way to obtain scope name is by visualizing the model with Model Explorer.

The simplest recipe to get started with is using existing recipes from recipe.py. This is demonstrated in the getting started colab example.

Deployment

Please refer to the LiteRT deployment documentation for ways to deploy a quantized LiteRT model.

Advanced Recipes

There are many ways the user can configure and customize the quantization recipe beyond using a template in recipe.py. For example, the user can configure the recipe to achieve these features:

Selective quantization (exclude selected ops from being quantized)
Flexible mixed scheme quantization (mixture of different precision, compute precision, scope, op, config, etc)
4-bit weight quantization

The selective quantization colab shows some of these more advanced features.

For specifics of the recipe schema, please refer to the OpQuantizationRecipe in [recipe_manager.py].

For advanced usage involving mixed quantization, the following API may be useful:

Use Quantizer:load_quantization_recipe() in quantizer.py to load a custom recipe.
Use Quantizer:update_quantization_recipe() in quantizer.py to extend or override specific parts of the recipe.

Operator coverage

The table below outlines the allowed configurations for available recipes.


Config		DYNAMIC_WI8_AFP32	DYNAMIC_WI4_AFP32	STATIC_WI8_AI16	STATIC_WI4_AI16	STATIC_WI8_AI8	STATIC_WI4_AI8	WEIGHTONLY_WI8_AFP32	WEIGHTONLY_WI4_AFP32
activation	num_bits	None	None	16	16	8	8	None	None
	symmetric	None	None	TRUE	TRUE	[TRUE, FALSE]	[TRUE, FALSE]	None	None
	granularity	None	None	TENSORWISE	TENSORWISE	TENSORWISE	TENSORWISE	None	None
	dtype	None	None	INT	INT	INT	INT	None	None
weight	num_bits	8	4	8	4	8	4	8	4
	symmetric	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	[TRUE, FALSE]	[TRUE, FALSE]
	granularity	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]	[CHANNELWISE, TENSORWISE]
	dtype	INT	INT	INT	INT	INT	INT	INT	INT
explicit_dequantize		FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE
compute_precision		INTEGER	INTEGER	INTEGER	INTEGER	INTEGER	INTEGER	FLOAT	FLOAT

Operators Supporting Quantization


Config	DYNAMIC_WI8_AFP32	DYNAMIC_WI4_AFP32	STATIC_WI8_AI16	STATIC_WI4_AI16	STATIC_WI8_AI8	STATIC_WI4_AI8	WEIGHTONLY_WI8_AFP32	WEIGHTONLY_WI4_AFP32
FULLY_CONNECTED	✓	✓	✓	✓	✓	✓	✓	✓
CONV_2D	✓		✓	✓	✓	✓	✓
BATCH_MATMUL	✓		✓		✓		✓
EMBEDDING_LOOKUP	✓	✓				✓	✓
DEPTHWISE_CONV_2D	✓		✓		✓		✓
AVERAGE_POOL_2D			✓		✓
RESHAPE			✓		✓
SOFTMAX			✓		✓
TANH			✓		✓
TRANSPOSE			✓		✓
GELU			✓		✓
ADD			✓		✓
CONV_2D_TRANSPOSE	✓		✓		✓
SUB			✓		✓
MUL			✓		✓
MEAN			✓		✓
RSQRT			✓		✓
CONCATENATION			✓		✓
STRIDED_SLICE			✓		✓
SPLIT			✓		✓
LOGISTIC			✓		✓