Join us at TensorFlow World, Oct 28-31. Use code TF20 for 20% off select passes. Register now

TensorBoard Profile: Profiling basic training metrics in Keras

View on TensorFlow.org Run in Google Colab View source on GitHub

Overview

Performance is critical for machine learning. TensorFlow has a built-in profiler that allows you to record runtime of each ops with very little effort. Then you can visualize the profile result in TensorBoard's Profile Plugin. This tutorial focuses on GPU but the Profile Plugin can also be used with TPUs by following the Cloud TPU Tools.

This tutorial presents very basic examples to help you learn how to enable profiler when developing your Keras model. You will learn how to use the Keras TensorBoard callback to visualize profile result. Profiler APIs and Profiler Server mentioned in “Other ways for profiling” allow you to profile non-Keras TensorFlow job.

Prerequisites

  • Install latest TensorBoard on your local machine.

  • Select “GPU” in the Accelerator drop-down in Notebook Settings (Assuming you run this notebook on Colab).

Notebook Settings

Setup

# Ensure latest TensorFlow is installed.
!pip install -q tf-nightly-gpu-2.0-preview
# Load the TensorBoard notebook extension.
%load_ext tensorboard
    100% |████████████████████████████████| 345.7MB 61kB/s 
    100% |████████████████████████████████| 3.1MB 6.6MB/s 
    100% |████████████████████████████████| 430kB 10.2MB/s 
    100% |████████████████████████████████| 61kB 29.0MB/s 
[?25h  Building wheel for wrapt (setup.py) ... [?25ldone
thinc 6.12.1 has requirement wrapt<1.11.0,>=1.10.0, but you'll have wrapt 1.11.1 which is incompatible.
[?25h
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from datetime import datetime
from packaging import version

import functools
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.python.keras import backend
from tensorflow.python.keras import layers

import numpy as np

print("TensorFlow version: ", tf.__version__)
TensorFlow version:  2.0.0-dev20190424

Confirm TensorFlow can see the GPU.

device_name = tf.test.gpu_device_name()
if not tf.test.is_gpu_available():
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
Found GPU at: /device:GPU:0

Run a simple model with TensorBoard callback

You're now going to use Keras to build a simple model for classifying CIFAR-10 images using ResNet56 (Reference: Deep Residual Learning for Image Recognition).

Following ResNet model code is copied from TensorFlow models garden.

BATCH_NORM_DECAY = 0.997
BATCH_NORM_EPSILON = 1e-5
L2_WEIGHT_DECAY = 2e-4


def identity_building_block(input_tensor,
                            kernel_size,
                            filters,
                            stage,
                            block,
                            training=None):
  """The identity block is the block that has no conv layer at shortcut.

  Arguments:
    input_tensor: input tensor
    kernel_size: default 3, the kernel size of
        middle conv layer at main path
    filters: list of integers, the filters of 3 conv layer at main path
    stage: integer, current stage label, used for generating layer names
    block: current block label, used for generating layer names
    training: Only used if training keras model with Estimator.  In other
      scenarios it is handled automatically.

  Returns:
    Output tensor for the block.
  """
  filters1, filters2 = filters
  if tf.keras.backend.image_data_format() == 'channels_last':
    bn_axis = 3
  else:
    bn_axis = 1
  conv_name_base = 'res' + str(stage) + block + '_branch'
  bn_name_base = 'bn' + str(stage) + block + '_branch'

  x = tf.keras.layers.Conv2D(filters1, kernel_size,
                             padding='same',
                             kernel_initializer='he_normal',
                             kernel_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             bias_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             name=conv_name_base + '2a')(input_tensor)
  x = tf.keras.layers.BatchNormalization(axis=bn_axis,
                                         name=bn_name_base + '2a',
                                         momentum=BATCH_NORM_DECAY,
                                         epsilon=BATCH_NORM_EPSILON)(
                                             x, training=training)
  x = tf.keras.layers.Activation('relu')(x)

  x = tf.keras.layers.Conv2D(filters2, kernel_size,
                             padding='same',
                             kernel_initializer='he_normal',
                             kernel_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             bias_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             name=conv_name_base + '2b')(x)
  x = tf.keras.layers.BatchNormalization(axis=bn_axis,
                                         name=bn_name_base + '2b',
                                         momentum=BATCH_NORM_DECAY,
                                         epsilon=BATCH_NORM_EPSILON)(
                                             x, training=training)

  x = tf.keras.layers.add([x, input_tensor])
  x = tf.keras.layers.Activation('relu')(x)
  return x


def conv_building_block(input_tensor,
                        kernel_size,
                        filters,
                        stage,
                        block,
                        strides=(2, 2),
                        training=None):
  """A block that has a conv layer at shortcut.

  Arguments:
    input_tensor: input tensor
    kernel_size: default 3, the kernel size of
        middle conv layer at main path
    filters: list of integers, the filters of 3 conv layer at main path
    stage: integer, current stage label, used for generating layer names
    block: current block label, used for generating layer names
    strides: Strides for the first conv layer in the block.
    training: Only used if training keras model with Estimator.  In other
      scenarios it is handled automatically.

  Returns:
    Output tensor for the block.

  Note that from stage 3,
  the first conv layer at main path is with strides=(2, 2)
  And the shortcut should have strides=(2, 2) as well
  """
  filters1, filters2 = filters
  if tf.keras.backend.image_data_format() == 'channels_last':
    bn_axis = 3
  else:
    bn_axis = 1
  conv_name_base = 'res' + str(stage) + block + '_branch'
  bn_name_base = 'bn' + str(stage) + block + '_branch'

  x = tf.keras.layers.Conv2D(filters1, kernel_size, strides=strides,
                             padding='same',
                             kernel_initializer='he_normal',
                             kernel_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             bias_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             name=conv_name_base + '2a')(input_tensor)
  x = tf.keras.layers.BatchNormalization(axis=bn_axis,
                                         name=bn_name_base + '2a',
                                         momentum=BATCH_NORM_DECAY,
                                         epsilon=BATCH_NORM_EPSILON)(
                                             x, training=training)
  x = tf.keras.layers.Activation('relu')(x)

  x = tf.keras.layers.Conv2D(filters2, kernel_size, padding='same',
                             kernel_initializer='he_normal',
                             kernel_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             bias_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             name=conv_name_base + '2b')(x)
  x = tf.keras.layers.BatchNormalization(axis=bn_axis,
                                         name=bn_name_base + '2b',
                                         momentum=BATCH_NORM_DECAY,
                                         epsilon=BATCH_NORM_EPSILON)(
                                             x, training=training)

  shortcut = tf.keras.layers.Conv2D(filters2, (1, 1), strides=strides,
                                    kernel_initializer='he_normal',
                                    kernel_regularizer=
                                    tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                                    bias_regularizer=
                                    tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                                    name=conv_name_base + '1')(input_tensor)
  shortcut = tf.keras.layers.BatchNormalization(
      axis=bn_axis, name=bn_name_base + '1',
      momentum=BATCH_NORM_DECAY, epsilon=BATCH_NORM_EPSILON)(
          shortcut, training=training)

  x = tf.keras.layers.add([x, shortcut])
  x = tf.keras.layers.Activation('relu')(x)
  return x


def resnet_block(input_tensor,
                 size,
                 kernel_size,
                 filters,
                 stage,
                 conv_strides=(2, 2),
                 training=None):
  """A block which applies conv followed by multiple identity blocks.

  Arguments:
    input_tensor: input tensor
    size: integer, number of constituent conv/identity building blocks.
    A conv block is applied once, followed by (size - 1) identity blocks.
    kernel_size: default 3, the kernel size of
        middle conv layer at main path
    filters: list of integers, the filters of 3 conv layer at main path
    stage: integer, current stage label, used for generating layer names
    conv_strides: Strides for the first conv layer in the block.
    training: Only used if training keras model with Estimator.  In other
      scenarios it is handled automatically.

  Returns:
    Output tensor after applying conv and identity blocks.
  """

  x = conv_building_block(input_tensor, kernel_size, filters, stage=stage,
                          strides=conv_strides, block='block_0',
                          training=training)
  for i in range(size - 1):
    x = identity_building_block(x, kernel_size, filters, stage=stage,
                                block='block_%d' % (i + 1), training=training)
  return x

def resnet(num_blocks, classes=10, training=None):
  """Instantiates the ResNet architecture.

  Arguments:
    num_blocks: integer, the number of conv/identity blocks in each block.
      The ResNet contains 3 blocks with each block containing one conv block
      followed by (layers_per_block - 1) number of idenity blocks. Each
      conv/idenity block has 2 convolutional layers. With the input
      convolutional layer and the pooling layer towards the end, this brings
      the total size of the network to (6*num_blocks + 2)
    classes: optional number of classes to classify images into
    training: Only used if training keras model with Estimator.  In other
    scenarios it is handled automatically.

  Returns:
    A Keras model instance.
  """

  input_shape = (32, 32, 3)
  img_input = layers.Input(shape=input_shape)

  if backend.image_data_format() == 'channels_first':
    x = layers.Lambda(lambda x: backend.permute_dimensions(x, (0, 3, 1, 2)),
                      name='transpose')(img_input)
    bn_axis = 1
  else:  # channel_last
    x = img_input
    bn_axis = 3

  x = tf.keras.layers.ZeroPadding2D(padding=(1, 1), name='conv1_pad')(x)
  x = tf.keras.layers.Conv2D(16, (3, 3),
                             strides=(1, 1),
                             padding='valid',
                             kernel_initializer='he_normal',
                             kernel_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             bias_regularizer=
                             tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                             name='conv1')(x)
  x = tf.keras.layers.BatchNormalization(axis=bn_axis, name='bn_conv1',
                                         momentum=BATCH_NORM_DECAY,
                                         epsilon=BATCH_NORM_EPSILON)(
                                             x, training=training)
  x = tf.keras.layers.Activation('relu')(x)

  x = resnet_block(x, size=num_blocks, kernel_size=3, filters=[16, 16],
                   stage=2, conv_strides=(1, 1), training=training)

  x = resnet_block(x, size=num_blocks, kernel_size=3, filters=[32, 32],
                   stage=3, conv_strides=(2, 2), training=training)

  x = resnet_block(x, size=num_blocks, kernel_size=3, filters=[64, 64],
                   stage=4, conv_strides=(2, 2), training=training)

  x = tf.keras.layers.GlobalAveragePooling2D(name='avg_pool')(x)
  x = tf.keras.layers.Dense(classes, activation='softmax',
                            kernel_initializer='he_normal',
                            kernel_regularizer=
                            tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                            bias_regularizer=
                            tf.keras.regularizers.l2(L2_WEIGHT_DECAY),
                            name='fc10')(x)

  inputs = img_input
  # Create model.
  model = tf.keras.models.Model(inputs, x, name='resnet56')

  return model


resnet20 = functools.partial(resnet, num_blocks=3)
resnet32 = functools.partial(resnet, num_blocks=5)
resnet56 = functools.partial(resnet, num_blocks=9)
resnet10 = functools.partial(resnet, num_blocks=110)

Download CIFAR-10 data from TensorFlow Datasets.

cifar_builder = tfds.builder('cifar10')
cifar_builder.download_and_prepare()

Build data input pipeline and compile ResNet56 model.

HEIGHT = 32
WIDTH = 32
NUM_CHANNELS = 3
NUM_CLASSES = 10
BATCH_SIZE = 128

def preprocess_data(record):
  image = record['image']
  label = record['label']
  
  # Resize the image to add four extra pixels on each side.
  image = tf.image.resize_with_crop_or_pad(
      image, HEIGHT + 8, WIDTH + 8)

  # Randomly crop a [HEIGHT, WIDTH] section of the image.
  image = tf.image.random_crop(image, [HEIGHT, WIDTH, NUM_CHANNELS])

  # Randomly flip the image horizontally.
  image = tf.image.random_flip_left_right(image)

  # Subtract off the mean and divide by the variance of the pixels.
  image = tf.image.per_image_standardization(image)
  
  label = tf.compat.v1.sparse_to_dense(label, (NUM_CLASSES,), 1)
  return image, label

train_data = cifar_builder.as_dataset(split=tfds.Split.TRAIN)
train_data = train_data.repeat()
train_data = train_data.map(
    lambda value: preprocess_data(value))
train_data = train_data.shuffle(1024)

train_data = train_data.batch(BATCH_SIZE)

model = resnet56(classes=NUM_CLASSES)

model.compile(optimizer='SGD',
              loss='categorical_crossentropy',
              metrics=['categorical_accuracy'])

When creating TensorBoard callback, you can specify the batch num you want to profile. By default, TensorFlow will profile the second batch, because many one time graph optimizations run on the first batch. You can modify it by setting profile_batch. You can also turn off profiling by setting it to 0.

This time, you will profile on the third batch.

log_dir="logs/profile/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch = 3)

Start training by calling Model.fit().

model.fit(train_data,
          steps_per_epoch=20,
          epochs=5, 
          callbacks=[tensorboard_callback])
Epoch 1/5
 1/20 [>.............................] - ETA: 14:27 - loss: 5.4251 - categorical_accuracy: 0.0859
W0425 21:14:50.396199 140078590396288 callbacks.py:238] Method (on_train_batch_end) is slow compared to the batch update (0.317050). Check your callbacks.

 2/20 [==>...........................] - ETA: 6:58 - loss: 5.5955 - categorical_accuracy: 0.0781 
W0425 21:14:50.954807 140078590396288 callbacks.py:238] Method (on_train_batch_end) is slow compared to the batch update (0.268180). Check your callbacks.

 3/20 [===>..........................] - ETA: 4:26 - loss: 5.7003 - categorical_accuracy: 0.0911
W0425 21:14:51.180765 140078590396288 callbacks.py:238] Method (on_train_batch_end) is slow compared to the batch update (0.134130). Check your callbacks.

20/20 [==============================] - 51s 3s/step - loss: 5.3766 - categorical_accuracy: 0.1004
Epoch 2/5
20/20 [==============================] - 5s 227ms/step - loss: 4.8007 - categorical_accuracy: 0.0988
Epoch 3/5
20/20 [==============================] - 5s 242ms/step - loss: 4.3439 - categorical_accuracy: 0.0980
Epoch 4/5
20/20 [==============================] - 5s 247ms/step - loss: 3.9405 - categorical_accuracy: 0.1074
Epoch 5/5
20/20 [==============================] - 5s 225ms/step - loss: 3.6195 - categorical_accuracy: 0.1176

<tensorflow.python.keras.callbacks.History at 0x7f65f8758908>

Visualizing profile result using TensorBoard

Unfortunately, due to #1913, you cannot use TensorBoard in Colab to visualize profile result. You are going to download the logdir and start TensorBoard on your local machine.

Compress logdir:

!tar -zcvf logs.tar.gz logs/profile/

Download logdir.tar.gz by right-clicking it in “Files” tab.

Download

Please make sure you have the latest TensorBoard installed on you local machine as well. Execute following commands on your local machine:

> cd download/directory
> tar -zxvf logs.tar.gz
> tensorboard --logdir=logs/ --port=6006

Open a new tab in your Chrome browser and navigate to localhost:6006 and then click “Profile” tab. You may see the profile result like this:

Trace View

Trace Viewer

Once you click the profile tab, you will see Trace Viewer. The page displays a timeline of different events that happened on the CPU and the accelerator during the collection period.

The Trace Viewer shows multiple event groups on the vertical axis. Each event group has multiple horizontal tracks, filled with trace events. The track is basically an event timeline for events executed on a thread or a GPU stream. Individual events are the colored, rectangular blocks on the timeline tracks. Time moves from left to right.

You can navigate through the result using w (zoom in), s (zoom out), a (scroll left), d (scroll right).

A single rectangle represents a trace event: when it began, and when it ended. To study an individual rectangle, you can click on it after selecting the mouse cursor icon in the floating tool bar. This will display information about the rectangle, such as its Start time and Duration.

In addition to clicking, you can drag the mouse to select a rectangle covering a group of trace events. This will give you a list of events that intersect that rectangle and summarize them for you. The m key can be used to measure the time duration of the selected events.

List of Events

The trace events are collected from three sources:

  • CPU: CPU events are under event group named /host:CPU. Each track represents a thread on CPU. E.g. input pipeline events, GPU op scheduling events, CPU ops execution events, etc.
  • GPU: GPU events are under event groups prefixed by /device:GPU:. Except stream:all, each event group represents one stream on GPU. stream::all aggregates all events on one GPU. E.g. Memory copy events, Kernel execution events, etc.
  • TensorFlow Runtime: Runtime events are under event groups prefixed by /job:. Runtime events represent the TensorFlow ops invoked by python program. E.g. tf.function execution events, etc.

Debug Performance

Now, you're going to use the Trace Viewer to improve your model's performance.

Let's go back to the profile result you have just captured.

GPU kernel

GPU events show that GPU has nothing to do at all in the first harf of the step.

CPU events

CPU events show that CPU is occupied by data input pipeline in the beginning of this step.

Runtime

In TensorFlow runtime, there is a big block named Iterator::GetNextSync, which is a blocking call to get the next batch from data input pipeline. And it blocks the training step. So if you could prepare the input data for step s in s-1 step, you can probably train this model faster.

You can achieve it by using tf.data.prefetch.

train_data = cifar_builder.as_dataset(split=tfds.Split.TRAIN)
train_data = train_data.repeat()
train_data = train_data.map(
    lambda value: preprocess_data(value))
train_data = train_data.shuffle(1024)
train_data = train_data.batch(BATCH_SIZE)

# It will prefetch the data in (s-1) step
train_data = train_data.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Re-run the model.

log_dir="logs/profile/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch = 3)

model.fit(train_data,
          steps_per_epoch=20,
          epochs=5, 
          callbacks=[tensorboard_callback])
Epoch 1/5
20/20 [==============================] - 5s 265ms/step - loss: 3.4081 - categorical_accuracy: 0.1055
Epoch 2/5
20/20 [==============================] - 4s 205ms/step - loss: 3.3122 - categorical_accuracy: 0.1141
Epoch 3/5
20/20 [==============================] - 4s 200ms/step - loss: 3.2795 - categorical_accuracy: 0.1199
Epoch 4/5
20/20 [==============================] - 4s 204ms/step - loss: 3.2237 - categorical_accuracy: 0.1469
Epoch 5/5
20/20 [==============================] - 4s 201ms/step - loss: 3.1888 - categorical_accuracy: 0.1465

<tensorflow.python.keras.callbacks.History at 0x7f65fbf87898>

Woohoo! You have just improvd training performance from ~235ms/step to ~200ms/step.

!tar -zcvf logs.tar.gz logs/profile/

Download logs directory again to see the new profile result in TensorBoard.

TF Runtime

The big Iterator::GetNextSync block is not there anymore.

Good job!

Apparently, this is still not the best performance yet. Please try by yourself to see if you can have further improvements.

Some useful references for performance tuning:

Other ways for profiling

In addition to TensorBoard callback, TensorFlow also provides two additional way to trigger profiler manually: Profiler APIs and Profiler Service.

Profiler APIs

# Context manager APIs
with tf.python.eager.profiler.Profiler('logdir_path'):
  # do your training here
  pass


# Function APIs
tf.python.eager.profiler.start()
# do your training here
profiler_result = tf.python.eager.profiler.stop()
tf.python.eager.profiler.save('logdir_path', profiler_result)

Profiler Service

# This API will start a gRPC server with your TensorFlow job which can receive
# on-demand profiling request.
tf.python.eager.profiler.start_profiler_server(6009)

# Your TensorFlow program here

Then you can send profiling request to profiler server to perform on-demand profiling on TensorBoard by clicking the “Capture Profile” button:

CAPTURE PROFILE

A message will show up after successfully captured. Then you can refresh TensorBoard to visualize the result.