This document provides an overview on model pruning to help you determine how it fits with your usecase. To dive right into the code, see the Pruning with Keras tutorial and the API docs. For additional details on how to use the Keras API, a deep dive into pruning, and documentation on more advanced usage patterns, see the Train sparse models guide.
Magnitude-based weight pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and we can skip the zeroes during inference for latency improvements.
This technique brings improvements via model compression. In the future, framework support for this technique will provide latency improvements. We've seen up to 6x improvements in model compression with minimal loss of accuracy.
The technique is being evaluated in various speech applications, such as speech recognition and text-to-speech, and has been experimented on across various vision and translation models.
API Compatibility Matrix
Users can apply pruning with the following APIs:
- Model building:
tf.keraswith only Sequential and Functional models
- TensorFlow versions: TF 1.x for versions 1.14+ and 2.x.
- TensorFlow execution mode: both graph and eager
- Distributed training:
tf.distributewith only graph execution
It is on our roadmap to add support in the following areas:
|Model||Non-sparse Top-1 Accuracy||Sparse Accuracy||Sparsity|
The models were tested on Imagenet.
|Model||Non-sparse BLEU||Sparse BLEU||Sparsity|
The models use WMT16 German and English dataset with news-test2013 as the dev set and news-test2015 as the test set.
In addition to the Prune with Keras tutorial, see the following examples:
- Train a CNN model on the MNIST handwritten digit classification task with pruning: code
- Train a LSTM on the IMDB sentiment classification task with pruning: code
- Start with a pre-trained model or weights if possible. If not, create one without pruning and start after.
- Do not prune very frequently to give the model time to recover. The toolkit provides a default frequency.
- Try running an experiment where you prune a pre-trained model to the final sparsity with begin step 0.
- Have a learning rate that's not too high or too low when the model is pruning. Consider the pruning schedule to be a hyperparameter.
For background, see To prune, or not to prune: exploring the efficacy of pruning for model compression [paper].