훈련 후 float16 양자화

TensorFlow.org에서 보기 Google Colab에서 실행 GitHub에서 소스보기 노트북 다운로드

개요

TensorFlow Lite는 이제 TensorFlow에서 TensorFlow Lite의 flat buffer 형식으로 모델을 변환하는 동안 가중치를 16bit 부동 소수점 값으로 변환하는 것을 지원합니다. 그 결과 모델 크기가 2배 감소합니다. GPU와 같은 일부 하드웨어는 감소한 정밀도 산술로 기본적으로 계산할 수 있으므로 기존 부동 소수점 실행보다 속도가 향상됩니다. Tensorflow Lite GPU 대리자는 이러한 방식으로 실행되도록 구성될 수 있습니다. 그러나 float16 가중치로 변환된 모델은 추가 수정없이도 CPU에서 계속 실행될 수 있습니다. float16 가중치는 첫 번째 추론 이전에 float32로 업 샘플링됩니다. 이를 통해 지연 시간과 정확성에 미치는 영향을 최소화하는 대신 모델 크기를 크게 줄일 수 있습니다.

이 가이드에서는 MNIST 모델을 처음부터 훈련하고 TensorFlow에서 정확성을 확인한 다음 모델을 float16 양자화를 사용하여 Tensorflow Lite flatbuffer로 변환합니다. 마지막으로 변환된 모델의 정확성을 확인하고 원래 float32 모델과 비교합니다.

MNIST 모델 빌드하기

설정

import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pathlib
tf.float16
tf.float16

모델 훈련 및 내보내기

# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation=tf.nn.relu),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(
  train_images,
  train_labels,
  epochs=1,
  validation_data=(test_images, test_labels)
)
2021-08-14 00:41:52.696609: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:52.705083: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:52.706048: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:52.708114: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-14 00:41:52.708701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:52.709751: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:52.710762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:53.314825: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:53.315759: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:53.316608: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:41:53.317481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-14 00:41:54.233004: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-14 00:41:54.986445: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8100
2021-08-14 00:41:55.559248: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
1875/1875 [==============================] - 6s 2ms/step - loss: 0.3084 - accuracy: 0.9117 - val_loss: 0.1557 - val_accuracy: 0.9563
<keras.callbacks.History at 0x7f9a454117d0>

예를 들어, 단일 epoch에 대해서만 모델을 훈련시켰으므로 ~96% 정확성으로만 훈련합니다.

TensorFlow Lite 모델로 변환하기

이제 Python TFLiteConverter를 사용하여 훈련된 모델을 TensorFlow Lite 모델로 변환할 수 있습니다.

TFLiteConverter를 사용하여 모델을 로드합니다.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
2021-08-14 00:42:00.376596: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
INFO:tensorflow:Assets written to: /tmp/tmp4j5dtmrx/assets
2021-08-14 00:42:00.770077: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:00.770442: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2021-08-14 00:42:00.770539: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-14 00:42:00.770925: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:00.771290: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:00.771582: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:00.771945: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:00.772252: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:00.772497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-14 00:42:00.773941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.007ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.

2021-08-14 00:42:00.806211: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-14 00:42:00.806243: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.
2021-08-14 00:42:00.809423: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.

.tflite 파일에 작성합니다.

tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)
84500

대신 모델을 내보낼 때 float16으로 양자화하려면 먼저 기본 최적화를 사용하도록 optimizations 플래그를 지정합니다. 그런 다음 float16이 대상 플랫폼에서 지원되는 유형임을 지정합니다.

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

마지막으로 평소와 같이 모델을 변환합니다. 기본적으로 변환된 모델은 호출 편의를 위해 여전히 float 입력 및 출력을 사용합니다.

tflite_fp16_model = converter.convert()
tflite_model_fp16_file = tflite_models_dir/"mnist_model_quant_f16.tflite"
tflite_model_fp16_file.write_bytes(tflite_fp16_model)
INFO:tensorflow:Assets written to: /tmp/tmp7a_xe_qh/assets
INFO:tensorflow:Assets written to: /tmp/tmp7a_xe_qh/assets
2021-08-14 00:42:01.374135: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:01.374511: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 1
2021-08-14 00:42:01.374625: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-14 00:42:01.374962: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:01.375281: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:01.375559: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:01.375889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:01.376162: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-08-14 00:42:01.376409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14648 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0
2021-08-14 00:42:01.377933: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.007ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.

2021-08-14 00:42:01.410805: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-14 00:42:01.410837: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.
44432

결과 파일이 약 1/2 크기인지 확인하세요.

ls -lh {tflite_models_dir}
total 128K
-rw-rw-r-- 1 kbuilder kbuilder 83K Aug 14 00:42 mnist_model.tflite
-rw-rw-r-- 1 kbuilder kbuilder 44K Aug 14 00:42 mnist_model_quant_f16.tflite

TensorFlow Lite 모델 실행하기

Python TensorFlow Lite 인터프리터를 사용하여 TensorFlow Lite 모델을 실행합니다.

인터프리터에 모델 로드하기

interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
interpreter_fp16 = tf.lite.Interpreter(model_path=str(tflite_model_fp16_file))
interpreter_fp16.allocate_tensors()

하나의 이미지에서 모델 테스트하기

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

interpreter.set_tensor(input_index, test_image)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)
import matplotlib.pylab as plt

plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

png

test_image = np.expand_dims(test_images[0], axis=0).astype(np.float32)

input_index = interpreter_fp16.get_input_details()[0]["index"]
output_index = interpreter_fp16.get_output_details()[0]["index"]

interpreter_fp16.set_tensor(input_index, test_image)
interpreter_fp16.invoke()
predictions = interpreter_fp16.get_tensor(output_index)
plt.imshow(test_images[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(test_labels[0]),
                              predict=str(np.argmax(predictions[0]))))
plt.grid(False)

png

모델 평가하기

# A helper function to evaluate the TF Lite model using "test" dataset.
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for test_image in test_images:
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  # Compare prediction results with ground truth labels to calculate accuracy.
  accurate_count = 0
  for index in range(len(prediction_digits)):
    if prediction_digits[index] == test_labels[index]:
      accurate_count += 1
  accuracy = accurate_count * 1.0 / len(prediction_digits)

  return accuracy
print(evaluate_model(interpreter))
0.9563

float16 양자화된 모델에 대한 평가를 반복하여 다음을 얻습니다.

# NOTE: Colab runs on server CPUs. At the time of writing this, TensorFlow Lite
# doesn't have super optimized server CPU kernels. For this reason this may be
# slower than the above float interpreter. But for mobile CPUs, considerable
# speedup can be observed.
print(evaluate_model(interpreter_fp16))
0.9563

이 예에서는 정확성 차이가 없이 모델을 float16으로 양자화했습니다.

GPU에서 fp16 양자화 모델을 평가하는 것도 가능합니다. 감소된 정밀도 값으로 모든 산술을 수행하려면 다음과 같이 앱에서 TfLiteGPUDelegateOptions 구조체를 만들고 precision_loss_allowed1로 설정해야 합니다.

//Prepare GPU delegate. const TfLiteGpuDelegateOptions options = {   .metadata = NULL,   .compile_options = {     .precision_loss_allowed = 1,  // FP16     .preferred_gl_object_type = TFLITE_GL_OBJECT_TYPE_FASTEST,     .dynamic_batch_enabled = 0,   // Not fully functional yet   }, };

TFLite GPU 대리자 및 애플리케이션에서 사용하는 방법에 대한 자세한 설명서는 여기에서 찾을 수 있습니다