# Mixed precision

Stay organized with collections Save and categorize content based on your preferences.

## Overview

Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. This guide describes how to use the Keras mixed precision API to speed up your models. Using this API can improve performance by more than 3 times on modern GPUs and 60% on TPUs.

Today, most models use the float32 dtype, which takes 32 bits of memory. However, there are two lower-precision dtypes, float16 and bfloat16, each which take 16 bits of memory instead. Modern accelerators can run operations faster in the 16-bit dtypes, as they have specialized hardware to run 16-bit computations and 16-bit dtypes can be read from memory faster.

NVIDIA GPUs can run operations in float16 faster than in float32, and TPUs can run operations in bfloat16 faster than float32. Therefore, these lower-precision dtypes should be used whenever possible on those devices. However, variables and a few computations should still be in float32 for numeric reasons so that the model trains to the same quality. The Keras mixed precision API allows you to use a mix of either float16 or bfloat16 with float32, to get the performance benefits from float16/bfloat16 and the numeric stability benefits from float32.

## Setup

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import mixed_precision
2022-12-14 03:56:07.992471: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-14 03:56:07.992568: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-14 03:56:07.992577: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

## Supported hardware

While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs and Cloud TPUs. NVIDIA GPUs support using a mix of float16 and float32, while TPUs support a mix of bfloat16 and float32.

Among NVIDIA GPUs, those with compute capability 7.0 or higher will see the greatest performance benefit from mixed precision because they have special hardware units, called Tensor Cores, to accelerate float16 matrix multiplications and convolutions. Older GPUs offer no math performance benefit for using mixed precision, however memory and bandwidth savings can enable some speedups. You can look up the compute capability for your GPU at NVIDIA's CUDA GPU web page. Examples of GPUs that will benefit most from mixed precision include RTX GPUs, the V100, and the A100.

You can check your GPU type with the following. The command only exists if the NVIDIA drivers are installed, so the following will raise an error otherwise.

nvidia-smi -L
GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-4085a382-7ba9-6217-eec0-69c9d972f8b8)
GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-74a03c5f-68af-a74f-7fd8-9d41f665cb2c)
GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-c821432d-533b-aa21-8a59-6c213fe0e44d)
GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-50cba888-7483-234a-525f-5feaa2f49edd)

All Cloud TPUs support bfloat16.

Even on CPUs and older GPUs, where no speedup is expected, mixed precision APIs can still be used for unit testing, debugging, or just to try out the API. On CPUs, mixed precision will run significantly slower, however.

## Setting the dtype policy

To use mixed precision in Keras, you need to create a tf.keras.mixed_precision.Policy, typically referred to as a dtype policy. Dtype policies specify the dtypes layers will run in. In this guide, you will construct a policy from the string 'mixed_float16' and set it as the global policy. This will cause subsequently created layers to use mixed precision with a mix of float16 and float32.

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
WARNING:tensorflow:Mixed precision compatibility check (mixed_float16): WARNING
Your GPUs may run slowly with dtype policy mixed_float16 because they do not have compute capability of at least 7.0. Your GPUs:
Tesla P100-PCIE-16GB, compute capability 6.0 (x4)
See https://developer.nvidia.com/cuda-gpus for a list of GPUs and their compute capabilities.
If you will use compatible GPU(s) not attached to this host, e.g. by running a multi-worker model, you can ignore this warning. This message will only be logged once

For short, you can directly pass a string to set_global_policy, which is typically done in practice.

# Equivalent to the two lines above
mixed_precision.set_global_policy('mixed_float16')

The policy specifies two important aspects of a layer: the dtype the layer's computations are done in, and the dtype of a layer's variables. Above, you created a mixed_float16 policy (i.e., a mixed_precision.Policy created by passing the string 'mixed_float16' to its constructor). With this policy, layers use float16 computations and float32 variables. Computations are done in float16 for performance, but variables must be kept in float32 for numeric stability. You can directly query these properties of the policy.

print('Compute dtype: %s' % policy.compute_dtype)
print('Variable dtype: %s' % policy.variable_dtype)
Compute dtype: float16
Variable dtype: float32

As mentioned before, the mixed_float16 policy will most significantly improve performance on NVIDIA GPUs with compute capability of at least 7.0. The policy will run on other GPUs and CPUs but may not improve performance. For TPUs, the mixed_bfloat16 policy should be used instead.

## Building the model

Next, let's start building a simple model. Very small toy models typically do not benefit from mixed precision, because overhead from the TensorFlow runtime typically dominates the execution time, making any performance improvement on the GPU negligible. Therefore, let's build two large Dense layers with 4096 units each if a GPU is used.

inputs = keras.Input(shape=(784,), name='digits')
if tf.config.list_physical_devices('GPU'):
print('The model will run with 4096 units on a GPU')
num_units = 4096
else:
# Use fewer units on CPUs so the model finishes in a reasonable amount of time
print('The model will run with 64 units on a CPU')
num_units = 64
dense1 = layers.Dense(num_units, activation='relu', name='dense_1')
x = dense1(inputs)
dense2 = layers.Dense(num_units, activation='relu', name='dense_2')
x = dense2(x)
The model will run with 4096 units on a GPU

Each layer has a policy and uses the global policy by default. Each of the Dense layers therefore have the mixed_float16 policy because you set the global policy to mixed_float16 previously. This will cause the dense layers to do float16 computations and have float32 variables. They cast their inputs to float16 in order to do float16 computations, which causes their outputs to be float16 as a result. Their variables are float32 and will be cast to float16 when the layers are called to avoid errors from dtype mismatches.

print(dense1.dtype_policy)
print('x.dtype: %s' % x.dtype.name)
# 'kernel' is dense1's variable
print('dense1.kernel.dtype: %s' % dense1.kernel.dtype.name)
<Policy "mixed_float16">
x.dtype: float16
dense1.kernel.dtype: float32

Next, create the output predictions. Normally, you can create the output predictions as follows, but this is not always numerically stable with float16.

# INCORRECT: softmax and model output will be float16, when it should be float32
outputs = layers.Dense(10, activation='softmax', name='predictions')(x)
print('Outputs dtype: %s' % outputs.dtype.name)
Outputs dtype: float16

A softmax activation at the end of the model should be float32. Because the dtype policy is mixed_float16, the softmax activation would normally have a float16 compute dtype and output float16 tensors.

This can be fixed by separating the Dense and softmax layers, and by passing dtype='float32' to the softmax layer:

# CORRECT: softmax and model output are float32
x = layers.Dense(10, name='dense_logits')(x)
outputs = layers.Activation('softmax', dtype='float32', name='predictions')(x)
print('Outputs dtype: %s' % outputs.dtype.name)
Outputs dtype: float32

Passing dtype='float32' to the softmax layer constructor overrides the layer's dtype policy to be the float32 policy, which does computations and keeps variables in float32. Equivalently, you could have instead passed dtype=mixed_precision.Policy('float32'); layers always convert the dtype argument to a policy. Because the Activation layer has no variables, the policy's variable dtype is ignored, but the policy's compute dtype of float32 causes softmax and the model output to be float32.

Adding a float16 softmax in the middle of a model is fine, but a softmax at the end of the model should be in float32. The reason is that if the intermediate tensor flowing from the softmax to the loss is float16 or bfloat16, numeric issues may occur.

You can override the dtype of any layer to be float32 by passing dtype='float32' if you think it will not be numerically stable with float16 computations. But typically, this is only necessary on the last layer of the model, as most layers have sufficient precision with mixed_float16 and mixed_bfloat16.

Even if the model does not end in a softmax, the outputs should still be float32. While unnecessary for this specific model, the model outputs can be cast to float32 with the following:

# The linear activation is an identity function. So this simply casts 'outputs'
# to float32. In this particular case, 'outputs' is already float32 so this is a
# no-op.
outputs = layers.Activation('linear', dtype='float32')(outputs)

Next, finish and compile the model, and generate input data:

model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='sparse_categorical_crossentropy',
optimizer=keras.optimizers.RMSprop(),
metrics=['accuracy'])

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

This example casts the input data from int8 to float32. You don't cast to float16 since the division by 255 is on the CPU, which runs float16 operations slower than float32 operations. In this case, the performance difference is negligible, but in general you should run input processing math in float32 if it runs on the CPU. The first layer of the model will cast the inputs to float16, as each layer casts floating-point inputs to its compute dtype.

The initial weights of the model are retrieved. This will allow training from scratch again by loading the weights.

initial_weights = model.get_weights()

## Training the model with Model.fit

Next, train the model:

history = model.fit(x_train, y_train,
batch_size=8192,
epochs=5,
validation_split=0.2)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])
Epoch 1/5
6/6 [==============================] - 2s 202ms/step - loss: 2.6197 - accuracy: 0.3875 - val_loss: 0.7951 - val_accuracy: 0.8290
Epoch 2/5
6/6 [==============================] - 1s 156ms/step - loss: 0.9418 - accuracy: 0.7210 - val_loss: 0.4762 - val_accuracy: 0.8877
Epoch 3/5
6/6 [==============================] - 1s 155ms/step - loss: 0.4998 - accuracy: 0.8549 - val_loss: 0.7045 - val_accuracy: 0.7447
Epoch 4/5
6/6 [==============================] - 1s 156ms/step - loss: 0.4647 - accuracy: 0.8544 - val_loss: 0.2608 - val_accuracy: 0.9250
Epoch 5/5
6/6 [==============================] - 1s 156ms/step - loss: 0.3179 - accuracy: 0.9022 - val_loss: 0.3698 - val_accuracy: 0.8745
313/313 - 1s - loss: 0.3833 - accuracy: 0.8649 - 644ms/epoch - 2ms/step
Test loss: 0.38325420022010803
Test accuracy: 0.8648999929428101

Notice the model prints the time per step in the logs: for example, "25ms/step". The first epoch may be slower as TensorFlow spends some time optimizing the model, but afterwards the time per step should stabilize.

If you are running this guide in Colab, you can compare the performance of mixed precision with float32. To do so, change the policy from mixed_float16 to float32 in the "Setting the dtype policy" section, then rerun all the cells up to this point. On GPUs with compute capability 7.X, you should see the time per step significantly increase, indicating mixed precision sped up the model. Make sure to change the policy back to mixed_float16 and rerun the cells before continuing with the guide.

On GPUs with compute capability of at least 8.0 (Ampere GPUs and above), you likely will see no performance improvement in the toy model in this guide when using mixed precision compared to float32. This is due to the use of TensorFloat-32, which automatically uses lower precision math in certain float32 ops such as tf.linalg.matmul. TensorFloat-32 gives some of the performance advantages of mixed precision when using float32. However, in real-world models, you will still typically experience significant performance improvements from mixed precision due to memory bandwidth savings and ops which TensorFloat-32 does not support.

If running mixed precision on a TPU, you will not see as much of a performance gain compared to running mixed precision on GPUs, especially pre-Ampere GPUs. This is because TPUs do certain ops in bfloat16 under the hood even with the default dtype policy of float32. This is similar to how Ampere GPUs use TensorFloat-32 by default. Compared to Ampere GPUs, TPUs typically see less performance gains with mixed precision on real-world models.

For many real-world models, mixed precision also allows you to double the batch size without running out of memory, as float16 tensors take half the memory. This does not apply however to this toy model, as you can likely run the model in any dtype where each batch consists of the entire MNIST dataset of 60,000 images.

## Loss scaling

Loss scaling is a technique which tf.keras.Model.fit automatically performs with the mixed_float16 policy to avoid numeric underflow. This section describes what loss scaling is and the next section describes how to use it with a custom training loop.

### Underflow and Overflow

The float16 data type has a narrow dynamic range compared to float32. This means values above $$65504$$ will overflow to infinity and values below $$6.0 \times 10^{-8}$$ will underflow to zero. float32 and bfloat16 have a much higher dynamic range so that overflow and underflow are not a problem.

For example:

x = tf.constant(256, dtype='float16')
(x ** 2).numpy()  # Overflow
inf
x = tf.constant(1e-5, dtype='float16')
(x ** 2).numpy()  # Underflow
0.0

In practice, overflow with float16 rarely occurs. Additionally, underflow also rarely occurs during the forward pass. However, during the backward pass, gradients can underflow to zero. Loss scaling is a technique to prevent this underflow.

### Loss scaling overview

The basic concept of loss scaling is simple: simply multiply the loss by some large number, say $$1024$$, and you get the loss scale value. This will cause the gradients to scale by $$1024$$ as well, greatly reducing the chance of underflow. Once the final gradients are computed, divide them by $$1024$$ to bring them back to their correct values.

The pseudocode for this process is:

loss_scale = 1024
loss = model(inputs)
loss *= loss_scale
# Assume grads are float32. You do not want to divide float16 gradients.

Choosing a loss scale can be tricky. If the loss scale is too low, gradients may still underflow to zero. If too high, the opposite the problem occurs: the gradients may overflow to infinity.

To solve this, TensorFlow dynamically determines the loss scale so you do not have to choose one manually. If you use tf.keras.Model.fit, loss scaling is done for you so you do not have to do any extra work. If you use a custom training loop, you must explicitly use the special optimizer wrapper tf.keras.mixed_precision.LossScaleOptimizer in order to use loss scaling. This is described in the next section.

## Training the model with a custom training loop

So far, you have trained a Keras model with mixed precision using tf.keras.Model.fit. Next, you will use mixed precision with a custom training loop. If you do not already know what a custom training loop is, please read the Custom training guide first.

Running a custom training loop with mixed precision requires two changes over running it in float32:

1. Build the model with mixed precision (you already did this)
2. Explicitly use loss scaling if mixed_float16 is used.

For step (2), you will use the tf.keras.mixed_precision.LossScaleOptimizer class, which wraps an optimizer and applies loss scaling. By default, it dynamically determines the loss scale so you do not have to choose one. Construct a LossScaleOptimizer as follows.

optimizer = keras.optimizers.RMSprop()
optimizer = mixed_precision.LossScaleOptimizer(optimizer)

If you want, it is possible choose an explicit loss scale or otherwise customize the loss scaling behavior, but it is highly recommended to keep the default loss scaling behavior, as it has been found to work well on all known models. See the tf.keras.mixed_precision.LossScaleOptimizer documentation if you want to customize the loss scaling behavior.

Next, define the loss object and the tf.data.Datasets:

loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
train_dataset = (tf.data.Dataset.from_tensor_slices((x_train, y_train))
.shuffle(10000).batch(8192))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(8192)

Next, define the training step function. You will use two new methods from the loss scale optimizer to scale the loss and unscale the gradients:

• get_scaled_loss(loss): Multiplies the loss by the loss scale
• get_unscaled_gradients(gradients): Takes in a list of scaled gradients as inputs, and divides each one by the loss scale to unscale them

These functions must be used in order to prevent underflow in the gradients. LossScaleOptimizer.apply_gradients will then apply gradients if none of them have Infs or NaNs. It will also update the loss scale, halving it if the gradients had Infs or NaNs and potentially increasing it otherwise.

@tf.function
def train_step(x, y):
predictions = model(x)
loss = loss_object(y, predictions)
scaled_loss = optimizer.get_scaled_loss(loss)
return loss

The LossScaleOptimizer will likely skip the first few steps at the start of training. The loss scale starts out high so that the optimal loss scale can quickly be determined. After a few steps, the loss scale will stabilize and very few steps will be skipped. This process happens automatically and does not affect training quality.

Now, define the test step:

@tf.function
def test_step(x):
return model(x, training=False)

Load the initial weights of the model, so you can retrain from scratch:

model.set_weights(initial_weights)

Finally, run the custom training loop:

for epoch in range(5):
epoch_loss_avg = tf.keras.metrics.Mean()
test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
name='test_accuracy')
for x, y in train_dataset:
loss = train_step(x, y)
epoch_loss_avg(loss)
for x, y in test_dataset:
predictions = test_step(x)
test_accuracy.update_state(y, predictions)
print('Epoch {}: loss={}, test accuracy={}'.format(epoch, epoch_loss_avg.result(), test_accuracy.result()))
Epoch 0: loss=2.1030306816101074, test accuracy=0.5697000026702881
Epoch 1: loss=0.7294090390205383, test accuracy=0.7651000022888184
Epoch 2: loss=0.46808382868766785, test accuracy=0.9204000234603882
Epoch 3: loss=0.3774358034133911, test accuracy=0.8847000002861023
Epoch 4: loss=0.2915167212486267, test accuracy=0.9106000065803528

## GPU performance tips

Here are some performance tips when using mixed precision on GPUs.

If it doesn't affect model quality, try running with double the batch size when using mixed precision. As float16 tensors use half the memory, this often allows you to double your batch size without running out of memory. Increasing batch size typically increases training throughput, i.e. the training elements per second your model can run on.

### Ensuring GPU Tensor Cores are used

As mentioned previously, modern NVIDIA GPUs use a special hardware unit called Tensor Cores that can multiply float16 matrices very quickly. However, Tensor Cores requires certain dimensions of tensors to be a multiple of 8. In the examples below, an argument is bold if and only if it needs to be a multiple of 8 for Tensor Cores to be used.

• tf.keras.layers.Dense(units=64)
• tf.keras.layers.Conv2d(filters=48, kernel_size=7, stride=3)
• And similarly for other convolutional layers, such as tf.keras.layers.Conv3d
• tf.keras.layers.LSTM(units=64)
• And similar for other RNNs, such as tf.keras.layers.GRU
• tf.keras.Model.fit(epochs=2, batch_size=128)

You should try to use Tensor Cores when possible. If you want to learn more, NVIDIA deep learning performance guide describes the exact requirements for using Tensor Cores as well as other Tensor Core-related performance information.

### XLA

XLA is a compiler that can further increase mixed precision performance, as well as float32 performance to a lesser extent. Refer to the XLA guide for details.

## Cloud TPU performance tips

As with GPUs, you should try doubling your batch size when using Cloud TPUs because bfloat16 tensors use half the memory. Doubling batch size may increase training throughput.

TPUs do not require any other mixed precision-specific tuning to get optimal performance. They already require the use of XLA. TPUs benefit from having certain dimensions being multiples of $$128$$, but this applies equally to the float32 type as it does for mixed precision. Check the Cloud TPU performance guide for general TPU performance tips, which apply to mixed precision as well as float32 tensors.

## Summary

• You should use mixed precision if you use TPUs or NVIDIA GPUs with at least compute capability 7.0, as it will improve performance by up to 3x.
• You can use mixed precision with the following lines:

# On TPUs, use 'mixed_bfloat16' instead
mixed_precision.set_global_policy('mixed_float16')

• If your model ends in softmax, make sure it is float32. And regardless of what your model ends in, make sure the output is float32.

• If you use a custom training loop with mixed_float16, in addition to the above lines, you need to wrap your optimizer with a tf.keras.mixed_precision.LossScaleOptimizer. Then call optimizer.get_scaled_loss to scale the loss, and optimizer.get_unscaled_gradients to unscale the gradients.

• Double the training batch size if it does not reduce evaluation accuracy

• On GPUs, ensure most tensor dimensions are a multiple of $$8$$ to maximize performance

For an example of mixed precision using the tf.keras.mixed_precision API, check functions and classes related to training performance. Check out the official models, such as Transformer, for details.

[{ "type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type": "thumb-down", "id": "tooComplicatedTooManySteps", "label":"Too complicated / too many steps" },{ "type": "thumb-down", "id": "outOfDate", "label":"Out of date" },{ "type": "thumb-down", "id": "samplesCodeIssue", "label":"Samples / code issue" },{ "type": "thumb-down", "id": "otherDown", "label":"Other" }]
[{ "type": "thumb-up", "id": "easyToUnderstand", "label":"Easy to understand" },{ "type": "thumb-up", "id": "solvedMyProblem", "label":"Solved my problem" },{ "type": "thumb-up", "id": "otherUp", "label":"Other" }]