Post-training quantization is a general technique to reduce model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights from floating point to 8-bits of precision. This technique is enabled as an option in the TensorFlow Lite converter:
import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE] tflite_quant_model = converter.convert()
At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.
To further improve latency, hybrid operators dynamically quantize activations to 8-bits and perform computations with 8-bit weights and activations. This optimization provides latencies close to fully fixed-point inference. However, the outputs are still stored using floating point, so that the speedup with hybrid ops is less than a full fixed-point computation. Hybrid ops are available for the most compute-intensive operators in a network:
- tf.nn.bidirectional_dynamic_rnn for BasicRNNCell type
- tf.nn.dynamic_rnn for LSTM and BasicRNN Cell types
Since weights are quantized post training, there could be an accuracy loss, particularly for smaller networks. Pre-trained fully quantized models are provided for specific networks in the TensorFlow Lite model repository. It is important to check the accuracy of the quantized model to verify that any degradation in accuracy is within acceptable limits. There is a tool to evaluate TensorFlow Lite model accuracy.
If the accuracy drop is too high, consider using quantization aware training.
Representation for quantized tensors
TensorFlow approaches the conversion of floating-point arrays of numbers into 8-bit representations as a compression problem. Since the weights and activation tensors in trained neural network models tend to have values that are distributed across comparatively small ranges (e.g. -15 to +15 for weights or -500 to 1000 for image model activations).
Since neural networks tend to be robust at handling noise, the error introduced by quantizing to a small set of values maintains the precision of the overall results within an acceptable threshold. A chosen representation must perform fast calculations, especially with large matrix multiplications that comprise the bulk of the computations while running a model.
This is represented with two floats that store the overall minimum and maximum values corresponding to the lowest and highest quantized value. Each entry in the quantized array represents a float value in that range, distributed linearly between the minimum and maximum.
With our post-training quantization tooling, we use symmetric quantization for our weights, meaning we expand the represented range and force the min and max to be the negative of each other.
For example, with an overall minimum of -10.0 and a maximum of 30.0f, we instead represent a minimum of -30.0 and maximum of 30.0f. In an 8-bit array, the quantized values would be represented as follows:
|-127||30.0 (this value does not ever show up)|
The advantages of this representation format are:
- It efficiently represents an arbitrary magnitude of ranges.
- The linear spread makes multiplications straightforward.
- A symmetric range for weights enables downstream hardware optimizations.