Hari Komunitas ML adalah 9 November! Bergabung dengan kami untuk update dari TensorFlow, JAX, dan lebih Pelajari lebih lanjut

Pemangkasan untuk inferensi pada perangkat dengan XNNPACK

Lihat di TensorFlow.org Jalankan di Google Colab Lihat sumber di GitHub Unduh buku catatan

Selamat datang di panduan pada bobot Keras pemangkasan untuk meningkatkan latency dari pada-perangkat inferensi melalui XNNPACK .

Panduan ini menyajikan penggunaan yang baru diperkenalkan tfmot.sparsity.keras.PruningPolicy API dan menunjukkan bagaimana hal itu dapat digunakan untuk mempercepat model sebagian besar convolutional pada CPU modern menggunakan XNNPACK Jarang inferensi .

Panduan ini mencakup langkah-langkah berikut dari proses pembuatan model:

  • Bangun dan latih garis dasar yang padat
  • Model sempurna dengan pemangkasan
  • Konversikan ke TFLite
  • Tolok ukur pada perangkat

Panduan ini tidak mencakup praktik terbaik untuk penyesuaian dengan pemangkasan. Untuk informasi lebih rinci tentang topik ini, silahkan periksa kami panduan yang komprehensif .

Mendirikan

 pip install -q tensorflow
 pip install -q tensorflow-model-optimization
import tempfile

import tensorflow as tf
import numpy as np

from tensorflow import keras
import tensorflow_datasets as tfds
import tensorflow_model_optimization as tfmot

%load_ext tensorboard

Bangun dan latih model padat

Kami membangun dan melatih sederhana dasar CNN untuk tugas klasifikasi pada CIFAR10 dataset.

# Load CIFAR10 dataset.
(ds_train, ds_val, ds_test), ds_info = tfds.load(
    'cifar10',
    split=['train[:90%]', 'train[90%:]', 'test'],
    as_supervised=True,
    with_info=True,
)

# Normalize the input image so that each pixel value is between 0 and 1.
def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.image.convert_image_dtype(image, tf.float32), label

# Load the data in batches of 128 images.
batch_size = 128
def prepare_dataset(ds, buffer_size=None):
  ds = ds.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  ds = ds.cache()
  if buffer_size:
    ds = ds.shuffle(buffer_size)
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
  return ds

ds_train = prepare_dataset(ds_train,
                           buffer_size=ds_info.splits['train'].num_examples)
ds_val = prepare_dataset(ds_val)
ds_test = prepare_dataset(ds_test)

# Build the dense baseline model.
dense_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.Conv2D(
        filters=8,
        kernel_size=(3, 3),
        strides=(2, 2),
        padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.DepthwiseConv2D(kernel_size=(3, 3), padding='same'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=16, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.DepthwiseConv2D(
        kernel_size=(3, 3), strides=(2, 2), padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=32, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Flatten(),
    keras.layers.Dense(10)
])

# Compile and train the dense model for 10 epochs.
dense_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

dense_model.fit(
  ds_train,
  epochs=10,
  validation_data=ds_val)

# Evaluate the dense model.
_, dense_model_accuracy = dense_model.evaluate(ds_test, verbose=0)
2021-08-13 11:13:35.517009: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-08-13 11:13:35.517068: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (kokoro-gcp-ubuntu-prod-1682665100): /proc/driver/nvidia/version does not exist
2021-08-13 11:13:35.517823: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Epoch 1/10
2021-08-13 11:13:36.392179: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
352/352 [==============================] - 12s 21ms/step - loss: 1.9929 - accuracy: 0.2651 - val_loss: 2.5594 - val_accuracy: 0.1466
Epoch 2/10
352/352 [==============================] - 7s 19ms/step - loss: 1.7293 - accuracy: 0.3582 - val_loss: 1.7533 - val_accuracy: 0.3414
Epoch 3/10
352/352 [==============================] - 7s 19ms/step - loss: 1.6531 - accuracy: 0.3849 - val_loss: 1.6463 - val_accuracy: 0.3886
Epoch 4/10
352/352 [==============================] - 7s 19ms/step - loss: 1.6073 - accuracy: 0.4024 - val_loss: 1.6127 - val_accuracy: 0.3980
Epoch 5/10
352/352 [==============================] - 7s 19ms/step - loss: 1.5692 - accuracy: 0.4200 - val_loss: 1.5552 - val_accuracy: 0.4228
Epoch 6/10
352/352 [==============================] - 7s 19ms/step - loss: 1.5358 - accuracy: 0.4344 - val_loss: 1.6375 - val_accuracy: 0.4030
Epoch 7/10
352/352 [==============================] - 7s 19ms/step - loss: 1.5074 - accuracy: 0.4475 - val_loss: 1.5514 - val_accuracy: 0.4258
Epoch 8/10
352/352 [==============================] - 7s 19ms/step - loss: 1.4810 - accuracy: 0.4598 - val_loss: 1.7087 - val_accuracy: 0.3866
Epoch 9/10
352/352 [==============================] - 7s 19ms/step - loss: 1.4610 - accuracy: 0.4669 - val_loss: 1.5219 - val_accuracy: 0.4492
Epoch 10/10
352/352 [==============================] - 7s 19ms/step - loss: 1.4445 - accuracy: 0.4748 - val_loss: 1.5329 - val_accuracy: 0.4302

Bangun model yang jarang

Menggunakan petunjuk dari panduan yang komprehensif , kami menerapkan tfmot.sparsity.keras.prune_low_magnitude fungsi dengan parameter yang target pada perangkat percepatan melalui pemangkasan yaitu tfmot.sparsity.keras.PruneForLatencyOnXNNPack kebijakan.

prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

# Compute end step to finish pruning after after 5 epochs.
end_epoch = 5

num_iterations_per_epoch = len(ds_train)
end_step =  num_iterations_per_epoch * end_epoch

# Define parameters for pruning.
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.25,
                                                               final_sparsity=0.75,
                                                               begin_step=0,
                                                               end_step=end_step),
      'pruning_policy': tfmot.sparsity.keras.PruneForLatencyOnXNNPack()
}

# Try to apply pruning wrapper with pruning policy parameter.
try:
  model_for_pruning = prune_low_magnitude(dense_model, **pruning_params)
except ValueError as e:
  print(e)
Could not find a `GlobalAveragePooling2D` layer with `keepdims = True` in all output branches

Panggilan prune_low_magnitude hasil ValueError dengan pesan Could not find a GlobalAveragePooling2D layer with keepdims = True in all output branches . Pesan tersebut menunjukkan bahwa model tersebut tidak didukung untuk pemangkasan dengan kebijakan tfmot.sparsity.keras.PruneForLatencyOnXNNPack dan secara khusus lapisan GlobalAveragePooling2D memerlukan parameter keepdims = True . Mari kita memperbaiki itu dan mengajukan permohonan kembali prune_low_magnitude fungsi.

fixed_dense_model = keras.Sequential([
    keras.layers.InputLayer(input_shape=(32, 32, 3)),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.Conv2D(
        filters=8,
        kernel_size=(3, 3),
        strides=(2, 2),
        padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.DepthwiseConv2D(kernel_size=(3, 3), padding='same'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=16, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.ZeroPadding2D(padding=1),
    keras.layers.DepthwiseConv2D(
        kernel_size=(3, 3), strides=(2, 2), padding='valid'),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.Conv2D(filters=32, kernel_size=(1, 1)),
    keras.layers.BatchNormalization(),
    keras.layers.ReLU(),
    keras.layers.GlobalAveragePooling2D(keepdims=True),
    keras.layers.Flatten(),
    keras.layers.Dense(10)
])

# Use the pretrained model for pruning instead of training from scratch.
fixed_dense_model.set_weights(dense_model.get_weights())

# Try to reapply pruning wrapper.
model_for_pruning = prune_low_magnitude(fixed_dense_model, **pruning_params)
/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/keras/engine/base_layer.py:2223: UserWarning: `layer.add_variable` is deprecated and will be removed in a future version. Please use `layer.add_weight` method instead.
  warnings.warn('`layer.add_variable` is deprecated and '

Doa prune_low_magnitude telah selesai tanpa kesalahan yang berarti bahwa model didukung penuh untuk tfmot.sparsity.keras.PruneForLatencyOnXNNPack kebijakan dan dapat dipercepat dengan menggunakan XNNPACK Jarang inferensi .

Sempurnakan model yang jarang

Mengikuti contoh pemangkasan , kami menyempurnakan model jarang menggunakan bobot dari model padat. Kami mulai menyempurnakan model dengan 25% sparsity (25% bobot disetel ke nol) dan diakhiri dengan 75% sparsity.

logdir = tempfile.mkdtemp()

callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep(),
  tfmot.sparsity.keras.PruningSummaries(log_dir=logdir),
]

model_for_pruning.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

model_for_pruning.fit(
  ds_train,
  epochs=15,
  validation_data=ds_val,
  callbacks=callbacks)

# Evaluate the dense model.
_, pruned_model_accuracy = model_for_pruning.evaluate(ds_test, verbose=0)

print('Dense model test accuracy:', dense_model_accuracy)
print('Pruned model test accuracy:', pruned_model_accuracy)
2021-08-13 11:14:50.266658: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-13 11:14:50.266694: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-13 11:14:50.833248: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-13 11:14:50.851018: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Epoch 1/15
 10/352 [..............................] - ETA: 8s - loss: 1.4245 - accuracy: 0.5016
2021-08-13 11:14:52.593103: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-08-13 11:14:52.593147: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-08-13 11:14:52.617240: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2021-08-13 11:14:52.619415: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-08-13 11:14:52.623098: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52

2021-08-13 11:14:52.625016: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.trace.json.gz
2021-08-13 11:14:52.628674: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52

2021-08-13 11:14:52.628785: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for memory_profile.json.gz to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.memory_profile.json.gz
2021-08-13 11:14:52.629073: I tensorflow/core/profiler/rpc/client/capture_profile.cc:251] Creating directory: /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52
Dumped tool data for xplane.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.xplane.pb
Dumped tool data for overview_page.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.overview_page.pb
Dumped tool data for input_pipeline.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /tmp/tmpkwu32h8j/train/plugins/profile/2021_08_13_11_14_52/kokoro-gcp-ubuntu-prod-1682665100.kernel_stats.pb
352/352 [==============================] - 9s 20ms/step - loss: 1.4474 - accuracy: 0.4732 - val_loss: 1.5224 - val_accuracy: 0.4368
Epoch 2/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4763 - accuracy: 0.4601 - val_loss: 1.9179 - val_accuracy: 0.3514
Epoch 3/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4861 - accuracy: 0.4602 - val_loss: 1.5849 - val_accuracy: 0.4100
Epoch 4/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4838 - accuracy: 0.4614 - val_loss: 1.5123 - val_accuracy: 0.4412
Epoch 5/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4669 - accuracy: 0.4696 - val_loss: 1.7005 - val_accuracy: 0.3620
Epoch 6/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4497 - accuracy: 0.4772 - val_loss: 1.4644 - val_accuracy: 0.4576
Epoch 7/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4397 - accuracy: 0.4799 - val_loss: 1.4532 - val_accuracy: 0.4710
Epoch 8/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4307 - accuracy: 0.4844 - val_loss: 2.0308 - val_accuracy: 0.3674
Epoch 9/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4254 - accuracy: 0.4849 - val_loss: 1.6031 - val_accuracy: 0.4180
Epoch 10/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4200 - accuracy: 0.4834 - val_loss: 1.8140 - val_accuracy: 0.3768
Epoch 11/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4132 - accuracy: 0.4892 - val_loss: 1.4289 - val_accuracy: 0.4810
Epoch 12/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4075 - accuracy: 0.4915 - val_loss: 1.4257 - val_accuracy: 0.4734
Epoch 13/15
352/352 [==============================] - 7s 19ms/step - loss: 1.4032 - accuracy: 0.4922 - val_loss: 1.4693 - val_accuracy: 0.4620
Epoch 14/15
352/352 [==============================] - 7s 19ms/step - loss: 1.3992 - accuracy: 0.4950 - val_loss: 1.3901 - val_accuracy: 0.4860
Epoch 15/15
352/352 [==============================] - 7s 19ms/step - loss: 1.3957 - accuracy: 0.4952 - val_loss: 1.4754 - val_accuracy: 0.4620
Dense model test accuracy: 0.43209999799728394
Pruned model test accuracy: 0.4596000015735626

Log menunjukkan perkembangan sparity pada basis per-lapisan.

%tensorboard --logdir={logdir}

Setelah fine-tuning dengan pemangkasan, akurasi pengujian menunjukkan peningkatan sederhana (43% hingga 44%) dibandingkan dengan model padat. Mari kita bandingkan pada perangkat latency menggunakan TFLite patokan .

Konversi dan pembandingan model

Untuk mengkonversi model dipangkas menjadi TFLite, kita perlu mengganti PruneLowMagnitude bungkus dengan lapisan asli melalui strip_pruning fungsi. Juga, karena bobot dari model dipangkas ( model_for_pruning ) sebagian besar nol, kita dapat menerapkan optimasi tf.lite.Optimize.EXPERIMENTAL_SPARSITY untuk secara efisien menyimpan mengakibatkan Model TFLite. Bendera pengoptimalan ini tidak diperlukan untuk model padat.

converter = tf.lite.TFLiteConverter.from_keras_model(dense_model)
dense_tflite_model = converter.convert()

_, dense_tflite_file = tempfile.mkstemp('.tflite')
with open(dense_tflite_file, 'wb') as f:
  f.write(dense_tflite_model)

model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.EXPERIMENTAL_SPARSITY]
pruned_tflite_model = converter.convert()

_, pruned_tflite_file = tempfile.mkstemp('.tflite')
with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)
INFO:tensorflow:Assets written to: /tmp/tmp0yx5e3fy/assets
INFO:tensorflow:Assets written to: /tmp/tmp0yx5e3fy/assets
2021-08-13 11:16:36.564681: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2021-08-13 11:16:36.564926: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-13 11:16:36.568512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.008ms.
  function_optimizer: function_optimizer did nothing. time = 0.001ms.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
2021-08-13 11:16:36.664551: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-13 11:16:36.664597: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.
2021-08-13 11:16:36.668981: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: /tmp/tmpenn8hns6/assets
INFO:tensorflow:Assets written to: /tmp/tmpenn8hns6/assets
2021-08-13 11:16:39.184787: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2021-08-13 11:16:39.185019: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-08-13 11:16:39.188948: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1137] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.01ms.
  function_optimizer: function_optimizer did nothing. time = 0.002ms.

2021-08-13 11:16:39.294765: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:351] Ignored output_format.
2021-08-13 11:16:39.294816: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:354] Ignored drop_control_dependency.

Mengikuti petunjuk dari TFLite Model Benchmarking Alat , kita membangun alat ini, meng-upload ke perangkat Android bersama-sama dengan padat dan model TFLite dipangkas, dan patokan kedua model pada perangkat.

! adb shell /data/local/tmp/benchmark_model \
    --graph=/data/local/tmp/dense_model.tflite \
    --use_xnnpack=true \
    --num_runs=100 \
    --num_threads=1
/bin/bash: adb: command not found
! adb shell /data/local/tmp/benchmark_model \
    --graph=/data/local/tmp/pruned_model.tflite \
    --use_xnnpack=true \
    --num_runs=100 \
    --num_threads=1
/bin/bash: adb: command not found

Benchmark pada Pixel 4 mengakibatkan waktu inferensi rata-rata 17US untuk model padat dan 12us untuk model dipangkas. The benchmark pada perangkat menunjukkan 5us jelas atau 30% perbaikan dalam latency bahkan untuk model kecil. Dalam pengalaman kami, model yang lebih besar berdasarkan MobileNetV3 atau EfficientNet-lite acara perbaikan kinerja yang serupa. Percepatan bervariasi berdasarkan kontribusi relatif dari konvolusi 1x1 ke model keseluruhan.

Kesimpulan

Dalam tutorial ini, kami menunjukkan bagaimana seseorang dapat membuat model yang jarang untuk kinerja perangkat yang lebih cepat menggunakan fungsionalitas baru yang diperkenalkan oleh TF MOT API dan XNNPack. Model sparse ini lebih kecil dan lebih cepat daripada model padatnya sambil mempertahankan atau bahkan melampaui kualitasnya.

Kami mendorong Anda untuk mencoba kemampuan baru ini yang dapat menjadi sangat penting untuk menerapkan model Anda di perangkat.