TensorFlow Lite NNAPI delegate

The Android Neural Networks API (NNAPI) is available on all Android devices running Android 8.1 (API level 27) or higher. It provides acceleration for TensorFlow Lite models on Android devices with supported hardware accelerators including:

  • Graphics Processing Unit (GPU)
  • Digital Signal Processor (DSP)
  • Neural Processing Unit (NPU)

Performance will vary depending on the specific hardware available on device.

This page describes how to use the NNAPI delegate with the TensorFlow Lite Interpreter in Java and Kotlin. For Android C APIs, please refer to Android Native Developer Kit documentation.

Trying the NNAPI delegate on your own model

Gradle import

The NNAPI delegate is part of the TensorFlow Lite Android interpreter, release 1.14.0 or higher. You can import it to your project by adding the following to your module gradle file:

dependencies {
   implementation 'org.tensorflow:tensorflow-lite:+'
}

Initializing the NNAPI delegate

Add the code to initialize the NNAPI delegate before you initialize the TensorFlow Lite interpreter.

kotlin

import android.content.res.AssetManager
import org.tensorflow.lite.Interpreter
import org.tensorflow.lite.nnapi.NnApiDelegate
import java.io.FileInputStream
import java.io.IOException
import java.nio.MappedByteBuffer
import java.nio.channels.FileChannel
...

val options = Interpreter.Options()
var nnApiDelegate: NnApiDelegate? = null
// Initialize interpreter with NNAPI delegate for Android Pie or above
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.P) {
    nnApiDelegate = NnApiDelegate()
    options.addDelegate(nnApiDelegate)
}
val assetManager = assets

// Initialize TFLite interpreter
val tfLite: Interpreter
try {
    tfLite = Interpreter(loadModelFile(assetManager, "model.tflite"), options)
} catch (e: Exception) {
    throw RuntimeException(e)
}

// Run inference
// ...

// Unload delegate
tfLite.close()
nnApiDelegate?.close()

...

@Throws(IOException::class)
private fun loadModelFile(assetManager: AssetManager, modelFilename: String): MappedByteBuffer {
    val fileDescriptor = assetManager.openFd(modelFilename)
    val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
    val fileChannel = inputStream.channel
    val startOffset = fileDescriptor.startOffset
    val declaredLength = fileDescriptor.declaredLength
    return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength)
}

...

java

import android.content.res.AssetManager;
import org.tensorflow.lite.Interpreter;
import org.tensorflow.lite.nnapi.NnApiDelegate;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
...

Interpreter.Options options = (new Interpreter.Options());
NnApiDelegate nnApiDelegate = null;
// Initialize interpreter with NNAPI delegate for Android Pie or above
if(Build.VERSION.SDK_INT >= Build.VERSION_CODES.P) {
    nnApiDelegate = new NnApiDelegate();
    options.addDelegate(nnApiDelegate);
}

AssetManager assetManager = getAssets();
// Initialize TFLite interpreter
try {
    tfLite = new Interpreter(loadModelFile(assetManager, "model.tflite"), options);
} catch (Exception e) {
    throw new RuntimeException(e);
}

// Run inference
// ...

// Unload delegate
tfLite.close();
if(null != nnApiDelegate) {
    nnApiDelegate.close();
}

...

private MappedByteBuffer loadModelFile(AssetManager assetManager, String modelFilename) throws IOException {
    AssetFileDescriptor fileDescriptor = assetManager.openFd(modelFilename);
    FileInputStream inputStream = new FileInputStream(fileDescriptor.getFileDescriptor());
    FileChannel fileChannel = inputStream.getChannel();
    long startOffset = fileDescriptor.getStartOffset();
    long declaredLength = fileDescriptor.getDeclaredLength();
    return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength);
}

...

Best practices

Test performance before deploying

Runtime performance can vary significantly due to model architecture, size, operations, hardware availability, and runtime hardware utilization. For example, if an app heavily utilizes the GPU for rendering, NNAPI acceleration may not improve performance due to resource contention. We recommend running a simple performance test using the debug logger to measure inference time. Run the test on several phones with different chipsets (manufacturer or models from the same manufacturer) that are representative of your user base before enabling NNAPI in production.

For advanced developers, TensorFlow Lite also offers a model benchmark tool for Android.

Create a device exclusion list

In production, there may be cases where NNAPI does not perform as expected. We recommend developers maintain a list of devices that should not use NNAPI acceleration in combination with particular models. You can create this list based on the value of "ro.board.platform", which you can retrieve using the following code snippet:

String boardPlatform = "";

try {
    Process sysProcess =
        new ProcessBuilder("/system/bin/getprop", "ro.board.platform").
        redirectErrorStream(true).start();

    BufferedReader reader = new BufferedReader
        (new InputStreamReader(sysProcess.getInputStream()));
    String currentLine = null;

    while ((currentLine=reader.readLine()) != null){
        boardPlatform = line;
    }
    sysProcess.destroy();
} catch (IOException e) {}

Log.d("Board Platform", boardPlatform);

For advanced developers, consider maintaining this list via a remote configuration system. The TensorFlow team is actively working on ways to simplify and automate discovering and applying the optimal NNAPI configuration.

Quantization

Quantization reduces model size by using 8-bit integers or 16-bit floats instead of 32-bit floats for computation. 8-bit integer model sizes are a quarter of the 32-bit float versions; 16-bit floats are half of the size. Quantization can improve performance significantly though the process could trade off some model accuracy.

There are multiple types of post-training quantization techniques available, but, for maximum support and acceleration on current hardware, we recommend full integer quantization. This approach converts both the weight and the operations into integers. This quantization process requires a representative dataset to work.

Use supported models and ops

If the NNAPI delegate does not support some of the ops or parameter combinations in a model, the framework only runs the supported parts of the graph on the accelerator. The remainder runs on the CPU, which results in split execution. Due to the high cost of CPU/accelerator synchronization, this may result in slower performance than executing the whole network on the CPU alone.

NNAPI performs best when models only use supported ops. The following models are known to be compatible with NNAPI:

NNAPI acceleration is also not supported when the model contains dynamically-sized outputs. In this case, you will get a warning like:

ERROR: Attempting to use a delegate that only supports static-sized tensors \
with a graph that has dynamic-sized tensors.

Enable NNAPI CPU implementation

A graph that can't be processed completely by an accelerator can fall back to the NNAPI CPU implementation. However, since this is typically less performant than the TensorFlow interpreter, this option is disabled by default in the NNAPI delegate for Android 10 (API Level 29) or above. To override this behavior, set setUseNnapiCpu to true in the NnApiDelegate.Options object.