Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy. You can perform these techniques using an already-trained float TensorFlow model when you convert it to TensorFlow Lite format using the TensorFlow Lite Converter.
There are several post-training quantization options to choose from. Here is a summary table of the choices and the benefits they provide:
|Dynamic range quantization||4x smaller, 2-3x speedup||CPU|
|Full integer quantization||4x smaller, 3x+ speedup||CPU, Edge TPU, Microcontrollers|
|Float16 quantization||2x smaller, potential GPU acceleration||CPU, GPU|
This decision tree can help determine which post-training quantization method is best for your use case:
Dynamic range quantization
The simplest form of post-training quantization statically quantizes only the weights from floating point to integer, which has 8-bits of precision:
import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_quant_model = converter.convert()
At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.
To further improve latency, "dynamic-range" operators dynamically quantize activations based on their range to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inference. However, the outputs are still stored using floating point, so that the speedup with dynamic-range ops is less than a full fixed-point computation. Dynamic-range ops are available for the most compute-intensive operators in a network:
Full integer quantization
You can get further latency improvements, reductions in peak memory usage, and access to integer only hardware devices or accelerators by making sure all model math is integer quantized.
To do this, you need to measure the dynamic range of activations and inputs by
supplying sample input data to the converter. Refer to the
representative_dataset_gen() function used in the following code.
Integer with float fallback (using default float input/output)
In order to fully integer quantize a model, but use float operators when they don't have an integer implementation (to ensure conversion occurs smoothly), use the following steps:
import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.DEFAULT] def representative_dataset_gen(): for _ in range(num_calibration_steps): # Get sample input data as a numpy array in a method of your choosing. yield [input] converter.representative_dataset = representative_dataset_gen tflite_quant_model = converter.convert()
Additionally, to ensure compatibility with integer only devices (such as 8-bit microcontrollers) and accelerators (such as the Coral Edge TPU), you can enforce full integer quantization for all ops including the input and output, by using the following steps:
import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.DEFAULT] def representative_dataset_gen(): for _ in range(num_calibration_steps): # Get sample input data as a numpy array in a method of your choosing. yield [input] converter.representative_dataset = representative_dataset_gen converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] converter.inference_input_type = tf.int8 # or tf.uint8 converter.inference_output_type = tf.int8 # or tf.uint8 tflite_quant_model = converter.convert()
You can reduce the size of a floating point model by quantizing the weights to float16, the IEEE standard for 16-bit floating point numbers. To enable float16 quantization of weights, use the following steps:
import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.DEFAULT] converter.target_spec.supported_types = [tf.lite.constants.FLOAT16] tflite_quant_model = converter.convert()
The advantages of this quantization are as follows:
- Reduce model size by up to half (since all weights are now half the original size).
- Minimal loss in accuracy.
- Supports some delegates (e.g. the GPU delegate) can operate directly on float16 data, which results in faster execution than float32 computations.
The disadvantages of this quantization are as follows:
- Not a good choice for maximum performance (a quantization to fixed point math would be better in that case).
- By default, a float16 quantized model will "dequantize" the weights values to float32 when run on the CPU. (Note that the GPU delegate will not perform this dequantization, since it can operate on float16 data.)
Since weights are quantized post training, there could be an accuracy loss, particularly for smaller networks. Pre-trained fully quantized models are provided for specific networks in the TensorFlow Lite model repository. It is important to check the accuracy of the quantized model to verify that any degradation in accuracy is within acceptable limits. There is a tool to evaluate TensorFlow Lite model accuracy.
Alternatively, if the accuracy drop is too high, consider using quantization aware training . However, doing so requires modifications during model training to add fake quantization nodes, whereas the post-training quantization techniques on this page use an existing pre-trained model.
Representation for quantized tensors
8-bit quantization approximates floating point values using the following formula.
The representation has two main parts:
Per-axis (aka per-channel) or per-tensor weights represented by int8 two’s complement values in the range [-127, 127] with zero-point equal to 0.
Per-tensor activations/inputs represented by int8 two’s complement values in the range [-128, 127], with a zero-point in range [-128, 127].
For a detailed view of our quantization scheme, please see our quantization spec. Hardware vendors who want to plug into TensorFlow Lite's delegate interface are encouraged to implement the quantization scheme described there.