Quantization aware training

Maintained by TensorFlow Model Optimization

There are two forms of quantization: post-training quantization and quantization aware training. Start with post-training quantization since it's easier to use, though quantization aware training is often better for model accuracy.

This page provides an overview on quantization aware training to help you determine how it fits with your use case.

Overview

Quantization aware training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.

Deploy with quantization

Quantization brings improvements via model compression and latency reduction. With the API defaults, the model size shrinks by 4x, and we typically see between 1.5 - 4x improvements in CPU latency in the tested backends. Eventually, latency improvements can be seen on compatible machine learning accelerators, such as the EdgeTPU and NNAPI.

The technique is used in production in speech, vision, text, and translate use cases. The code currently supports a subset of these models.

Experiment with quantization and associated hardware

Users can configure the quantization parameters (e.g. number of bits) and to some degree, the underlying algorithms. Note that with these changes from the API defaults, there is currently no supported path for deployment to a backend. For instance, TFLite conversion and kernel implementations only support 8-bit quantization.

APIs specific to this configuration are experimental and not subject to backward compatibility.

API compatibility

Users can apply quantization with the following APIs:

  • Model building: keras with only Sequential and Functional models.
  • TensorFlow versions: TF 2.x for tf-nightly.
  • TensorFlow execution mode: eager execution

It is on our roadmap to add support in the following areas:

  • Model building: clarify how Subclassed Models have limited to no support
  • Distributed training: tf.distribute

General support matrix

Support is available in the following areas:

  • Model coverage: models using allowlisted layers, BatchNormalization when it follows Conv2D and DepthwiseConv2D layers, and in limited cases, Concat.
  • Hardware acceleration: our API defaults are compatible with acceleration on EdgeTPU, NNAPI, and TFLite backends, amongst others. See the caveat in the roadmap.
  • Deploy with quantization: only per-axis quantization for convolutional layers, not per-tensor quantization, is currently supported.

It is on our roadmap to add support in the following areas:

  • Model coverage: extended to include RNN/LSTMs and general Concat support.
  • Hardware acceleration: ensure the TFLite converter can produce full-integer models. See this issue for details.
  • Experiment with quantization use cases:
    • Experiment with quantization algorithms that span Keras layers or require the training step.
    • Stabilize APIs.

Results

Image classification with tools

Model Non-quantized Top-1 Accuracy 8-bit Quantized Accuracy
MobilenetV1 224 71.03% 71.06%
Resnet v1 50 76.3% 76.1%
MobilenetV2 224 70.77% 70.01%

The models were tested on Imagenet and evaluated in both TensorFlow and TFLite.

Image classification for technique

Model Non-quantized Top-1 Accuracy 8-Bit Quantized Accuracy
Nasnet-Mobile 74% 73%
Resnet-v2 50 75.6% 75%

The models were tested on Imagenet and evaluated in both TensorFlow and TFLite.

Examples

In addition to the quantization aware training example, see the following examples:

  • CNN model on the MNIST handwritten digit classification task with quantization: code

For background on something similar, see the Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference paper. This paper introduces some concepts that this tool uses. The implementation is not exactly the same, and there are additional concepts used in this tool (e.g. per-axis quantization).