Float16-Quantisierung nach dem Training

Auf TensorFlow.org ansehen In Google Colab ausführen Quelle auf GitHub anzeigen Notizbuch herunterladen

Überblick

TensorFlow Lite unterstützt nun die Gewichte auf 16-Bit - Gleitkomma - Wert während der Modellumwandlung von TensorFlow zu TensorFlow Lite flachem Pufferformat zu konvertieren. Dies führt zu einer 2-fachen Reduzierung der Modellgröße. Einige Hardware, wie GPUs, kann in dieser Arithmetik mit reduzierter Präzision nativ rechnen, wodurch eine Beschleunigung gegenüber der herkömmlichen Gleitkommaausführung erzielt wird. Der Tensorflow Lite-GPU-Delegat kann so konfiguriert werden, dass er auf diese Weise ausgeführt wird. Ein in float16-Gewichtungen konvertiertes Modell kann jedoch ohne weitere Modifikationen auf der CPU ausgeführt werden: Die float16-Gewichtungen werden vor der ersten Inferenz auf float32 hochgerechnet. Dies ermöglicht eine signifikante Reduzierung der Modellgröße im Austausch für minimale Auswirkungen auf Latenz und Genauigkeit.

In diesem Tutorial trainieren Sie ein MNIST-Modell von Grund auf neu, überprüfen seine Genauigkeit in TensorFlow und konvertieren das Modell dann in einen Tensorflow Lite-Flatbuffer mit float16-Quantisierung. Überprüfen Sie abschließend die Genauigkeit des konvertierten Modells und vergleichen Sie es mit dem ursprünglichen float32-Modell.

Erstellen Sie ein MNIST-Modell

Aufstellen

import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pathlib
tf.float16
tf.float16

Trainieren und exportieren Sie das Modell

# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation=tf.nn.relu),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(
  train_images,
  train_labels,
  epochs=1,
  validation_data=(test_images, test_labels)
)
2021-08-12 11:14:20.019105: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.027205: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.028209: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.030048: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-12 11:14:20.030632: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.031613: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.032472: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.621574: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.622510: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.623340: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:20.624178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-12 11:14:21.514594: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-12 11:14:22.255566: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8100
2021-08-12 11:14:22.796296: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
1875/1875 [==============================] - 6s 2ms/step - loss: 0.2932 - accuracy: 0.9164 - val_loss: 0.1466 - val_accuracy: 0.9590
<keras.callbacks.History at 0x7f745169b090>

Im Beispiel haben Sie das Modell nur für eine einzelne Epoche trainiert, sodass es nur mit einer Genauigkeit von ~96 % trainiert wird.

In ein TensorFlow Lite-Modell umwandeln

Mit Hilfe des Python TFLiteConverter , können Sie nun das trainierte Modell in ein TensorFlow Lite Modell konvertieren.

Laden Sie nun das Modell die Verwendung von TFLiteConverter :

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
2021-08-12 11:14:27.460182: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: /tmp/tmpzg3rlpvi/assets
2021-08-12 11:14:27.842033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:27.842424: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2021-08-12 11:14:27.842528: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-12 11:14:27.842853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:27.843188: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:27.843465: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:27.843826: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:27.844125: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:27.844394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-12 11:14:27.845876: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.007ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.

2021-08-12 11:14:27.877764: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-12 11:14:27.877806: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.
2021-08-12 11:14:27.881250: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.

Schreiben Sie es heraus zu einer .tflite - Datei:

tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)
84500

Um stattdessen das Modell zu float16 auf Export quantisiert, zuerst die eingestellte optimizations Flag verwenden Standardoptimierung. Geben Sie dann float16 als unterstützten Typ auf der Zielplattform an:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

Zum Schluss konvertieren Sie das Modell wie gewohnt. Beachten Sie, dass das konvertierte Modell standardmäßig weiterhin Float-Eingaben und -Ausgaben verwendet, um den Aufruf zu erleichtern.

tflite_fp16_model = converter.convert()
tflite_model_fp16_file = tflite_models_dir/"mnist_model_quant_f16.tflite"
tflite_model_fp16_file.write_bytes(tflite_fp16_model)
INFO:tensorflow:Assets written to: /tmp/tmpn9j2mw5w/assets
INFO:tensorflow:Assets written to: /tmp/tmpn9j2mw5w/assets
2021-08-12 11:14:28.441293: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:28.441648: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2021-08-12 11:14:28.441777: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-12 11:14:28.442167: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:28.442511: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:28.442812: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:28.443174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:28.443536: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-12 11:14:28.443821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-12 11:14:28.445305: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.016ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.

2021-08-12 11:14:28.480942: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-12 11:14:28.480983: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.
44432

Beachten Sie, wie die resultierende Datei annähernd 1/2 der Größe.

ls -lh {tflite_models_dir}
total 152K
-rw-rw-r-- 1 kbuilder kbuilder 83K Aug 12 11:14 mnist_model.tflite
-rw-rw-r-- 1 kbuilder kbuilder 24K Aug 12 11:13 mnist_model_quant.tflite
-rw-rw-r-- 1 kbuilder kbuilder 44K Aug 12 11:14 mnist_model_quant_f16.tflite

Führen Sie die TensorFlow Lite-Modelle aus

Führen Sie das TensorFlow Lite-Modell mit dem Python TensorFlow Lite Interpreter aus.

Laden Sie das Modell in die Interpreter

interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
interpreter_fp16 = tf.lite.Interpreter(model_path=str(tflite_model_fp16_file))
interpreter_fp16.allocate_tensors()

Testen Sie die Modelle auf einem Bild

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

interpreter.set_tensor(input_index, test_image)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
import matplotlib.pylab as plt

plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

png

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter_fp16.get_input_details()[0]["index"]
output_index = interpreter_fp16.get_output_details()[0]["index"]

interpreter_fp16.set_tensor(input_index, test_image)
interpreter_fp16.invoke()
predictions = interpreter_fp16.get_tensor(output_index)
plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

png

Bewerten Sie die Modelle

# A helper function to evaluate the TF Lite model using "test" dataset.
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for test_image in test_images:
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  # Compare prediction results with ground truth labels to calculate accuracy.
  accurate_count = 0
  for index in range(len(prediction_digits)):
    if prediction_digits[index] == test_labels[index]:
      accurate_count += 1
  accuracy = accurate_count * 1.0 / len(prediction_digits)

  return accuracy
print(evaluate_model(interpreter))
0.959

Wiederholen Sie die Auswertung für das quantisierte Modell float16, um Folgendes zu erhalten:

# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite
# doesn't have super optimized server CPU kernels. For this reason this may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
print(evaluate_model(interpreter_fp16))
0.959

In diesem Beispiel haben Sie ein Modell ohne Unterschied in der Genauigkeit auf float16 quantisiert.

Es ist auch möglich, das quantisierte fp16-Modell auf der GPU auszuwerten. Um alle Arithmetik mit den reduzierten Genauigkeitswerte durchführen, sollten Sie die erstellen TfLiteGPUDelegateOptions struct in Ihrer Anwendung und Set precision_loss_allowed auf 1 , wie folgt aus :

//Prepare GPU delegate.
const TfLiteGpuDelegateOptions options = {
  .metadata = NULL,
  .compile_options = {
    .precision_loss_allowed = 1,  // FP16
    .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
    .dynamic_batch_enabled = 0,   // Not fully functional yet
  },
};

Eine ausführliche Dokumentation auf der TFLite GPU delegieren und wie es in Ihrer Anwendung verwenden können , finden Sie hier