The Android Neural Networks API (NNAPI) is available on all Android devices running Android 8.1 (API level 27) or higher. It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators including:
- Graphics Processing Unit (GPU)
- Digital Signal Processor (DSP)
- Neural Processing Unit (NPU)
Performance will vary depending on the specific hardware available on device.
This page describes how to use the NNAPI delegate with the TensorFlow Lite Interpreter in Java and Kotlin. For Android C APIs, please refer to Android Native Developer Kit documentation.
Trying the NNAPI delegate on your own model
Gradle import
The NNAPI delegate is part of the TensorFlow Lite Android interpreter, release 1.14.0 or higher. You can import it to your project by adding the following to your module gradle file:
dependencies {
implementation 'org.tensorflow:tensorflow-lite:2.0.0'
}
Initializing the NNAPI delegate
Add the code to initialize the NNAPI delegate before you initialize the TensorFlow Lite interpreter.
import org.tensorflow.lite.Interpreter;
import org.tensorflow.lite.nnapi.NnApiDelegate;
Interpreter.Options options = (new Interpreter.Options());
NnApiDelegate nnApiDelegate = null;
// Initialize interpreter with NNAPI delegate for Android Pie or above
if(Build.VERSION.SDK_INT >= Build.VERSION_CODES.P) {
nnApiDelegate = new NnApiDelegate();
options.addDelegate(nnApiDelegate);
}
// Initialize TFLite interpreter
try {
tfLite = new Interpreter(loadModelFile(assetManager, modelFilename), options);
} catch (Exception e) {
throw new RuntimeException(e);
}
// Run inference
// ...
// Unload delegate
tfLite.close();
if(null != nnApiDelegate) {
nnApiDelegate.close();
}
Best practices
Test performance before deploying
Runtime performance can vary significantly due to model architecture, size, operations, hardware availability, and runtime hardware utilization. For example, if an app heavily utilizes the GPU for rendering, NNAPI acceleration may not improve performance due to resource contention. We recommend running a simple performance test using the debug logger to measure inference time. Run the test on several phones with different chipsets (manufacturer or models from the same manufacturer) that are representative of your user base before enabling NNAPI in production.
For advanced developers, TensorFlow Lite also offers a model benchmark tool for Android.
Create a device exclusion list
In production, there may be cases where NNAPI does not perform as expected. We
recommend developers maintain a list of devices that should not use NNAPI
acceleration in combination with particular models. You can create this list
based on the value of "ro.board.platform"
, which you can retrieve using the
following code snippet:
String boardPlatform = "";
try {
Process sysProcess =
new ProcessBuilder("/system/bin/getprop", "ro.board.platform").
redirectErrorStream(true).start();
BufferedReader reader = new BufferedReader
(new InputStreamReader(sysProcess.getInputStream()));
String currentLine = null;
while ((currentLine=reader.readLine()) != null){
boardPlatform = line;
}
sysProcess.destroy();
} catch (IOException e) {}
Log.d("Board Platform", boardPlatform);
For advanced developers, consider maintaining this list via a remote configuration system. The TensorFlow team is actively working on ways to simplify and automate discovering and applying the optimal NNAPI configuration.
Quantization
Quantization reduces model size by using 8-bit integers or 16-bit floats instead of 32-bit floats for computation. 8-bit integer model sizes are a quarter of the 32-bit float versions; 16-bit floats are half of the size. Quantization can improve performance significantly though the process could trade off some model accuracy.
There are multiple types of post-training quantization techniques available, but, for maximum support and acceleration on current hardware, we recommend full integer quantization. This approach converts both the weight and the operations into integers. This quantization process requires a representative dataset to work.
Use supported models and ops
If the NNAPI delegate does not support some of the ops or parameter combinations in a model, the framework only runs the supported parts of the graph on the accelerator. The remainder runs on the CPU, which results in split execution. Due to the high cost of CPU/accelerator synchronization, this may result in slower performance than executing the whole network on the CPU alone.
NNAPI performs best when models only use supported ops. The following models are known to be compatible with NNAPI:
- MobileNet v1 (224x224) image classification (float model download)
(quantized model download)
(image classification model designed for mobile and embedded based vision applications) - MobileNet v2 SSD object detection
(download)
(image classification model that detects multiple objects with bounding boxes) - MobileNet v1(300x300) Single Shot Detector (SSD) object detection (download)
- PoseNet for pose estimation
(download)
(vision model that estimates the poses of a person(s) in image or video)
NNAPI acceleration is also not supported when the model contains dynamically-sized outputs. In this case, you will get a warning like:
ERROR: Attempting to use a delegate that only supports static-sized tensors \
with a graph that has dynamic-sized tensors.
Enable NNAPI CPU implementation
A graph that can't be processed completely by an accelerator can fall back to
the NNAPI CPU implementation. However, since this is typically less performant
than the TensorFlow interpreter, this option is disabled by default in the NNAPI
delegate for Android 10 (API Level 29) or above. To override this behavior, set
setUseNnapiCpu
to true
in the NnApiDelegate.Options
object.