Kuantisasi float16 pasca-pelatihan

Lihat di TensorFlow.org Jalankan di Google Colab Lihat sumber di GitHub Unduh buku catatan

Ringkasan

TensorFlow Lite sekarang mendukung mengkonversi bobot untuk 16-bit nilai floating point selama konversi Model dari TensorFlow ke format penyangga datar TensorFlow Lite. Ini menghasilkan pengurangan 2x dalam ukuran model. Beberapa perangkat keras, seperti GPU, dapat menghitung secara native dalam aritmatika presisi yang dikurangi ini, mewujudkan percepatan dibandingkan eksekusi floating point tradisional. Delegasi GPU Tensorflow Lite dapat dikonfigurasi untuk berjalan dengan cara ini. Namun, model yang dikonversi ke bobot float16 masih dapat berjalan di CPU tanpa modifikasi tambahan: bobot float16 di-upsampled ke float32 sebelum inferensi pertama. Ini memungkinkan pengurangan ukuran model yang signifikan dengan imbalan dampak minimal terhadap latensi dan akurasi.

Dalam tutorial ini, Anda melatih model MNIST dari awal, memeriksa keakuratannya di TensorFlow, lalu mengonversi model menjadi flatbuffer Tensorflow Lite dengan kuantisasi float16. Terakhir, periksa keakuratan model yang dikonversi dan bandingkan dengan model float32 asli.

Bangun model MNIST

Mempersiapkan

import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pathlib

Latih dan ekspor modelnya

# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation=tf.nn.relu),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(
  train_images,
  train_labels,
  epochs=1,
  validation_data=(test_images, test_labels)
)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
11501568/11490434 [==============================] - 0s 0us/step
1875/1875 [==============================] - 13s 2ms/step - loss: 0.2655 - accuracy: 0.9244 - val_loss: 0.1237 - val_accuracy: 0.9654
<keras.callbacks.History at 0x7f3f8428e6d0>

Misalnya, Anda melatih model hanya untuk satu epoch, sehingga hanya melatih akurasi ~96%.

Konversikan ke model TensorFlow Lite

Menggunakan Python TFLiteConverter , Anda sekarang dapat mengkonversi model dilatih menjadi model TensorFlow Lite.

Sekarang memuat model menggunakan TFLiteConverter :

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
2021-12-14 12:18:07.073783: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: /tmp/tmpm1s3vkrd/assets
2021-12-14 12:18:07.876066: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:363] Ignored output_format.
2021-12-14 12:18:07.876112: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:366] Ignored drop_control_dependency.
WARNING:absl:Buffer deduplication procedure will be skipped when flatbuffer library is not properly loaded

Menulis itu ke .tflite berkas:

tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)
84540

Untuk bukannya quantize model untuk float16 ekspor, pertama mengatur optimizations bendera untuk optimasi penggunaan default. Kemudian tentukan bahwa float16 adalah tipe yang didukung pada platform target:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

Terakhir, ubah model seperti biasa. Catatan, secara default model yang dikonversi akan tetap menggunakan input dan output float untuk kenyamanan pemanggilan.

tflite_fp16_model = converter.convert()
tflite_model_fp16_file = tflite_models_dir/"mnist_model_quant_f16.tflite"
tflite_model_fp16_file.write_bytes(tflite_fp16_model)
INFO:tensorflow:Assets written to: /tmp/tmpvjt9l68i/assets
INFO:tensorflow:Assets written to: /tmp/tmpvjt9l68i/assets
2021-12-14 12:18:08.810262: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:363] Ignored output_format.
2021-12-14 12:18:08.810303: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:366] Ignored drop_control_dependency.
WARNING:absl:Buffer deduplication procedure will be skipped when flatbuffer library is not properly loaded
44384

Perhatikan bagaimana file yang dihasilkan adalah sekitar 1/2 ukuran.

ls -lh {tflite_models_dir}
total 128K
-rw-rw-r-- 1 kbuilder kbuilder 83K Dec 14 12:18 mnist_model.tflite
-rw-rw-r-- 1 kbuilder kbuilder 44K Dec 14 12:18 mnist_model_quant_f16.tflite

Jalankan model TensorFlow Lite

Jalankan model TensorFlow Lite menggunakan Interpreter Python TensorFlow Lite.

Muat model ke dalam penerjemah

interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
interpreter_fp16 = tf.lite.Interpreter(model_path=str(tflite_model_fp16_file))
interpreter_fp16.allocate_tensors()

Uji model pada satu gambar

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

interpreter.set_tensor(input_index, test_image)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
import matplotlib.pylab as plt

plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

png

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter_fp16.get_input_details()[0]["index"]
output_index = interpreter_fp16.get_output_details()[0]["index"]

interpreter_fp16.set_tensor(input_index, test_image)
interpreter_fp16.invoke()
predictions = interpreter_fp16.get_tensor(output_index)
plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

png

Evaluasi modelnya

# A helper function to evaluate the TF Lite model using "test" dataset.
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for test_image in test_images:
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  # Compare prediction results with ground truth labels to calculate accuracy.
  accurate_count = 0
  for index in range(len(prediction_digits)):
    if prediction_digits[index] == test_labels[index]:
      accurate_count += 1
  accuracy = accurate_count * 1.0 / len(prediction_digits)

  return accuracy
print(evaluate_model(interpreter))
0.9654

Ulangi evaluasi pada model terkuantisasi float16 untuk mendapatkan:

# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite
# doesn't have super optimized server CPU kernels. For this reason this may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
print(evaluate_model(interpreter_fp16))
0.9654

Dalam contoh ini, Anda telah mengkuantisasi model ke float16 tanpa perbedaan dalam akurasi.

Dimungkinkan juga untuk mengevaluasi model terkuantisasi fp16 pada GPU. Untuk melakukan semua aritmatika dengan nilai presisi berkurang, pastikan untuk membuat TfLiteGPUDelegateOptions struct di aplikasi Anda dan set precision_loss_allowed ke 1 , seperti ini:

//Prepare GPU delegate.
const TfLiteGpuDelegateOptions options = {
  .metadata = NULL,
  .compile_options = {
    .precision_loss_allowed = 1,  // FP16
    .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,
    .dynamic_batch_enabled = 0,   // Not fully functional yet
  },
};

Dokumentasi rinci pada delegasi TFLite GPU dan bagaimana menggunakannya dalam aplikasi Anda dapat ditemukan disini