Inference efficiency is a critical issue when deploying machine learning models to mobile devices. Where the computational demand for training grows with the number of models trained on different architectures, the computational demand for inference grows in proportion to the number of users. The Tensorflow Model Optimization Toolkit minimizes the complexity of inference—the model size, the latency and power consumption.
Model optimization is useful for:
- Deploying models to edge devices with restrictions on processing, memory, or power-consumption. For example, mobile and Internet of Things (IoT) devices.
- Reduce the payload size for over-the-air model updates.
- Execution on hardware constrained by fixed-point operations.
- Optimize models for special purpose hardware accelerators.
Model optimization uses multiple techniques:
- Reduced parameter count, for example, pruning and structured pruning.
- Reduced representational precision, for example, quantization.
- Update the original model topology to a more efficient one, with reduced parameters or faster execution, for example, tensor decomposition methods and distillation.
Quantizing deep neural networks uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation. Quantization provides several benefits:
- Support on existing CPU platforms.
- Quantizing activations reduces memory access costs for reading and storing intermediate activations.
- Many CPU and hardware accelerator implementations provide SIMD instruction capabilities, which are especially beneficial for quantization.
TensorFlow Lite provides several levels of support for quantization.
Post-training quantization quantizes weights and activations post training and is very easy to use. Quantization-aware training allows for training networks that can be quantized with minimal accuracy drop and is only available for a subset of convolutional neural network architectures.
Latency and accuracy results
Below are the results of the latency and accuracy of post-training quantization and quantization-aware training on a few models. All latency numbers are measured on Pixel 2 devices using a single big core. As the toolkit improves, so will the numbers here:
|Model||Top-1 Accuracy (Original)||Top-1 Accuracy (Post Training Quantized)||Top-1 Accuracy (Quantization Aware Training)||Latency (Original) (ms)||Latency (Post Training Quantized) (ms)||Latency (Quantization Aware Training) (ms)||Size (Original) (MB)||Size (Optimized) (MB)|
Choice of quantization tool
As a starting point, check if the models in the TensorFlow Lite model repository can work for your application. If not, we recommend that users start with the post-training quantization tool since this is broadly applicable and does not require training data. For cases where the accuracy and latency targets are not met, or hardware accelerator support is important, quantization-aware training is the better option.