Quantization aware training

_{Maintained by TensorFlow Model Optimization}

There are two forms of quantization: post-training quantization and quantization aware training. Start with post-training quantization since it's easier to use, though quantization aware training is often better for model accuracy.

This page provides an overview on quantization aware training to help you determine how it fits with your use case.

To dive right into an end-to-end example, see the quantization aware training example.
To quickly find the APIs you need for your use case, see the quantization aware training comprehensive guide.

Overview

Quantization aware training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.

Deploy with quantization

Quantization brings improvements via model compression and latency reduction. With the API defaults, the model size shrinks by 4x, and we typically see between 1.5 - 4x improvements in CPU latency in the tested backends. Eventually, latency improvements can be seen on compatible machine learning accelerators, such as the EdgeTPU and NNAPI.

The technique is used in production in speech, vision, text, and translate use cases. The code currently supports a subset of these models.

Experiment with quantization and associated hardware

Users can configure the quantization parameters (e.g. number of bits) and to some degree, the underlying algorithms. Note that with these changes from the API defaults, there is currently no supported path for deployment to a backend. For instance, TFLite conversion and kernel implementations only support 8-bit quantization.

APIs specific to this configuration are experimental and not subject to backward compatibility.

API compatibility

Users can apply quantization with the following APIs:

Model building: keras with only Sequential and Functional models.
TensorFlow versions: TF 2.x for tf-nightly.
- tf.compat.v1 with a TF 2.X package is not supported.
TensorFlow execution mode: eager execution

It is on our roadmap to add support in the following areas:

Model building: clarify how Subclassed Models have limited to no support
Distributed training: tf.distribute

General support matrix

Support is available in the following areas:

Model coverage: models using allowlisted layers, BatchNormalization when it follows Conv2D and DepthwiseConv2D layers, and in limited cases, Concat.
Hardware acceleration: our API defaults are compatible with acceleration on EdgeTPU, NNAPI, and TFLite backends, amongst others. See the caveat in the roadmap.
Deploy with quantization: only per-axis quantization for convolutional layers, not per-tensor quantization, is currently supported.

It is on our roadmap to add support in the following areas:

Model coverage: extended to include RNN/LSTMs and general Concat support.
Hardware acceleration: ensure the TFLite converter can produce full-integer models. See this issue for details.
Experiment with quantization use cases:
- Experiment with quantization algorithms that span Keras layers or require the training step.
- Stabilize APIs.

Results

Image classification with tools

Model	Non-quantized Top-1 Accuracy	8-bit Quantized Accuracy
MobilenetV1 224	71.03%	71.06%
Resnet v1 50	76.3%	76.1%
MobilenetV2 224	70.77%	70.01%

The models were tested on Imagenet and evaluated in both TensorFlow and TFLite.

Image classification for technique

Model	Non-quantized Top-1 Accuracy	8-Bit Quantized Accuracy
Nasnet-Mobile	74%	73%
Resnet-v2 50	75.6%	75%