Se usó la API de Cloud Translation para traducir esta página.
Switch to English

Usar TPU

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

Experimental support for Cloud TPUs is currently available for Keras and Google Colab. Before you run this Colab notebooks, ensure that your hardware accelerator is a TPU by checking your notebook settings: Runtime > Change runtime type > Hardware accelerator > TPU.

Setup

import tensorflow as tf

import os
import tensorflow_datasets as tfds

TPU Initialization

TPUs are usually on Cloud TPU workers which are different from the local process running the user python program. Thus some initialization work needs to be done to connect to the remote cluster and initialize TPUs. Note that the tpu argument to TPUClusterResolver is a special address just for Colab. In the case that you are running on Google Compute Engine (GCE), you should instead pass in the name of your CloudTPU.

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.2:8470

INFO:tensorflow:Initializing the TPU system: grpc://10.240.1.2:8470

INFO:tensorflow:Clearing out eager caches

INFO:tensorflow:Clearing out eager caches

INFO:tensorflow:Finished initializing TPU system.

INFO:tensorflow:Finished initializing TPU system.

All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU')]

Manual device placement

After the TPU is initialized, you can use manual device placement to place the computation on a single TPU device.

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
with tf.device('/TPU:0'):
  c = tf.matmul(a, b)
print("c device: ", c.device)
print(c)
c device:  /job:worker/replica:0/task:0/device:TPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

Distribution strategies

Most times users want to run the model on multiple TPUs in a data parallel way. A distribution strategy is an abstraction that can be used to drive models on CPU, GPUs or TPUs. Simply swap out the distribution strategy and the model will run on the given device. See the distribution strategy guide for more information.

First, creates the TPUStrategy object.

strategy = tf.distribute.TPUStrategy(resolver)
INFO:tensorflow:Found TPU system:

INFO:tensorflow:Found TPU system:

INFO:tensorflow:*** Num TPU Cores: 8

INFO:tensorflow:*** Num TPU Cores: 8

INFO:tensorflow:*** Num TPU Workers: 1

INFO:tensorflow:*** Num TPU Workers: 1

INFO:tensorflow:*** Num TPU Cores Per Worker: 8

INFO:tensorflow:*** Num TPU Cores Per Worker: 8

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)

To replicate a computation so it can run in all TPU cores, you can simply pass it to strategy.run API. Below is an example that all the cores will obtain the same inputs (a, b), and do the matmul on each core independently. The outputs will be the values from all the replicas.

@tf.function
def matmul_fn(x, y):
  z = tf.matmul(x, y)
  return z

z = strategy.run(matmul_fn, args=(a, b))
print(z)
PerReplica:{
  0: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  1: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  2: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  3: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  4: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  5: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  6: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32),
  7: tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)
}

Classification on TPUs

As we have learned the basic concepts, it is time to look at a more concrete example. This guide demonstrates how to use the distribution strategy tf.distribute.experimental.TPUStrategy to drive a Cloud TPU and train a Keras model.

Define a Keras model

Below is the definition of MNIST model using Keras, unchanged from what you would use on CPU or GPU. Note that Keras model creation needs to be inside strategy.scope, so the variables can be created on each TPU device. Other parts of the code is not necessary to be inside the strategy scope.

def create_model():
  return tf.keras.Sequential(
      [tf.keras.layers.Conv2D(256, 3, activation='relu', input_shape=(28, 28, 1)),
       tf.keras.layers.Conv2D(256, 3, activation='relu'),
       tf.keras.layers.Flatten(),
       tf.keras.layers.Dense(256, activation='relu'),
       tf.keras.layers.Dense(128, activation='relu'),
       tf.keras.layers.Dense(10)])

Input datasets

Efficient use of the tf.data.Dataset API is critical when using a Cloud TPU, as it is impossible to use the Cloud TPUs unless you can feed them data quickly enough. See Input Pipeline Performance Guide for details on dataset performance.

For all but the simplest experimentation (using tf.data.Dataset.from_tensor_slices or other in-graph data) you will need to store all data files read by the Dataset in Google Cloud Storage (GCS) buckets.

For most use-cases, it is recommended to convert your data into TFRecord format and use a tf.data.TFRecordDataset to read it. See TFRecord and tf.Example tutorial for details on how to do this. This, however, is not a hard requirement and you can use other dataset readers (FixedLengthRecordDataset or TextLineDataset) if you prefer.

Small datasets can be loaded entirely into memory using tf.data.Dataset.cache.

Regardless of the data format used, it is strongly recommended that you use large files, on the order of 100MB. This is especially important in this networked setting as the overhead of opening a file is significantly higher.

Here you should use the tensorflow_datasets module to get a copy of the MNIST training data. Note that try_gcs is specified to use a copy that is available in a public GCS bucket. If you don't specify this, the TPU will not be able to access the data that is downloaded.

def get_dataset(batch_size, is_training=True):
  split = 'train' if is_training else 'test'
  dataset, info = tfds.load(name='mnist', split=split, with_info=True,
                            as_supervised=True, try_gcs=True)

  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255.0

    return image, label

  dataset = dataset.map(scale)

  # Only shuffle and repeat the dataset in training. The advantage to have a
  # infinite dataset for training is to avoid the potential last partial batch
  # in each epoch, so users don't need to think about scaling the gradients
  # based on the actual batch size.
  if is_training:
    dataset = dataset.shuffle(10000)
    dataset = dataset.repeat()

  dataset = dataset.batch(batch_size)

  return dataset

Train a model using Keras high level APIs

You can train a model simply with Keras fit/compile APIs. Nothing here is TPU specific, you would write the same code below if you had mutliple GPUs and where using a MirroredStrategy rather than a TPUStrategy. To learn more, check out the Distributed training with Keras tutorial.

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

batch_size = 200
steps_per_epoch = 60000 // batch_size
validation_steps = 10000 // batch_size

train_dataset = get_dataset(batch_size, is_training=True)
test_dataset = get_dataset(batch_size, is_training=False)

model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=steps_per_epoch,
          validation_data=test_dataset, 
          validation_steps=validation_steps)
Epoch 1/5
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py:601: get_next_as_optional (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Iterator.get_next_as_optional()` instead.

  1/300 [..............................] - ETA: 20:25 - loss: 2.3062 - sparse_categorical_accuracy: 0.0500WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0015s vs `on_train_batch_end` time: 0.0237s). Check your callbacks.

Warning:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0015s vs `on_train_batch_end` time: 0.0237s). Check your callbacks.

300/300 [==============================] - ETA: 0s - loss: 0.1450 - sparse_categorical_accuracy: 0.9561WARNING:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0013s vs `on_test_batch_end` time: 0.0097s). Check your callbacks.

Warning:tensorflow:Callbacks method `on_test_batch_end` is slow compared to the batch time (batch time: 0.0013s vs `on_test_batch_end` time: 0.0097s). Check your callbacks.

300/300 [==============================] - 15s 50ms/step - loss: 0.1450 - sparse_categorical_accuracy: 0.9561 - val_loss: 0.0456 - val_sparse_categorical_accuracy: 0.9852
Epoch 2/5
300/300 [==============================] - 8s 28ms/step - loss: 0.0357 - sparse_categorical_accuracy: 0.9890 - val_loss: 0.0372 - val_sparse_categorical_accuracy: 0.9884
Epoch 3/5
300/300 [==============================] - 9s 28ms/step - loss: 0.0193 - sparse_categorical_accuracy: 0.9940 - val_loss: 0.0557 - val_sparse_categorical_accuracy: 0.9835
Epoch 4/5
300/300 [==============================] - 9s 29ms/step - loss: 0.0141 - sparse_categorical_accuracy: 0.9954 - val_loss: 0.0405 - val_sparse_categorical_accuracy: 0.9883
Epoch 5/5
300/300 [==============================] - 9s 29ms/step - loss: 0.0092 - sparse_categorical_accuracy: 0.9967 - val_loss: 0.0428 - val_sparse_categorical_accuracy: 0.9887

<tensorflow.python.keras.callbacks.History at 0x7f480c793240>

To reduce python overhead, and maximize the performance of your TPU, try out the experimental experimental_steps_per_execution argument to Model.compile. Here it increases throughput by about 50%:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                # Anything between 2 and `steps_per_epoch` could help here.
                experimental_steps_per_execution = 50,
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=steps_per_epoch,
          validation_data=test_dataset,
          validation_steps=validation_steps)
Epoch 1/5
300/300 [==============================] - 16s 52ms/step - loss: 0.1337 - sparse_categorical_accuracy: 0.9583 - val_loss: 0.0521 - val_sparse_categorical_accuracy: 0.9840
Epoch 2/5
300/300 [==============================] - 5s 16ms/step - loss: 0.0331 - sparse_categorical_accuracy: 0.9898 - val_loss: 0.0360 - val_sparse_categorical_accuracy: 0.9884
Epoch 3/5
300/300 [==============================] - 5s 16ms/step - loss: 0.0188 - sparse_categorical_accuracy: 0.9939 - val_loss: 0.0405 - val_sparse_categorical_accuracy: 0.9887
Epoch 4/5
300/300 [==============================] - 5s 16ms/step - loss: 0.0117 - sparse_categorical_accuracy: 0.9961 - val_loss: 0.0808 - val_sparse_categorical_accuracy: 0.9820
Epoch 5/5
300/300 [==============================] - 5s 16ms/step - loss: 0.0119 - sparse_categorical_accuracy: 0.9962 - val_loss: 0.0488 - val_sparse_categorical_accuracy: 0.9862

<tensorflow.python.keras.callbacks.History at 0x7f47ac791240>

Train a model using custom training loop.

You can also create and train your models using tf.function and tf.distribute APIs directly. strategy.experimental_distribute_datasets_from_function API is used to distribute the dataset given a dataset function. Note that the batch size passed into the dataset will be per replica batch size instead of global batch size in this case. To learn more, check out the Custom training with tf.distribute.Strategy tutorial.

First, create the model, datasets and tf.functions.

# Create the model, optimizer and metrics inside strategy scope, so that the
# variables can be mirrored on each device.
with strategy.scope():
  model = create_model()
  optimizer = tf.keras.optimizers.Adam()
  training_loss = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)
  training_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      'training_accuracy', dtype=tf.float32)

# Calculate per replica batch size, and distribute the datasets on each TPU
# worker.
per_replica_batch_size = batch_size // strategy.num_replicas_in_sync

train_dataset = strategy.experimental_distribute_datasets_from_function(
    lambda _: get_dataset(per_replica_batch_size, is_training=True))

@tf.function
def train_step(iterator):
  """The step function for one training step"""

  def step_fn(inputs):
    """The computation to run on each TPU device."""
    images, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(images, training=True)
      loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits, from_logits=True)
      loss = tf.nn.compute_average_loss(loss, global_batch_size=batch_size)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    training_loss.update_state(loss * strategy.num_replicas_in_sync)
    training_accuracy.update_state(labels, logits)

  strategy.run(step_fn, args=(next(iterator),))

Then run the training loop.

steps_per_eval = 10000 // batch_size

train_iterator = iter(train_dataset)
for epoch in range(5):
  print('Epoch: {}/5'.format(epoch))

  for step in range(steps_per_epoch):
    train_step(train_iterator)
  print('Current step: {}, training loss: {}, accuracy: {}%'.format(
      optimizer.iterations.numpy(),
      round(float(training_loss.result()), 4),
      round(float(training_accuracy.result()) * 100, 2)))
  training_loss.reset_states()
  training_accuracy.reset_states()
Epoch: 0/5
Current step: 300, training loss: 0.1335, accuracy: 95.8%
Epoch: 1/5
Current step: 600, training loss: 0.0344, accuracy: 98.93%
Epoch: 2/5
Current step: 900, training loss: 0.0196, accuracy: 99.35%
Epoch: 3/5
Current step: 1200, training loss: 0.0119, accuracy: 99.61%
Epoch: 4/5
Current step: 1500, training loss: 0.0102, accuracy: 99.65%

Improving performance by multiple steps within tf.function

The performance can be improved by running multiple steps within a tf.function. This is achieved by wrapping the strategy.run call with a tf.range inside tf.function, AutoGraph will convert it to a tf.while_loop on the TPU worker.

Although with better performance, there are tradeoffs comparing with a single step inside tf.function. Running multiple steps in a tf.function is less flexible, you cannot run things eagerly or arbitrary python code within the steps.

@tf.function
def train_multiple_steps(iterator, steps):
  """The step function for one training step"""

  def step_fn(inputs):
    """The computation to run on each TPU device."""
    images, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(images, training=True)
      loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits, from_logits=True)
      loss = tf.nn.compute_average_loss(loss, global_batch_size=batch_size)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    training_loss.update_state(loss * strategy.num_replicas_in_sync)
    training_accuracy.update_state(labels, logits)

  for _ in tf.range(steps):
    strategy.run(step_fn, args=(next(iterator),))

# Convert `steps_per_epoch` to `tf.Tensor` so the `tf.function` won't get 
# retraced if the value changes.
train_multiple_steps(train_iterator, tf.convert_to_tensor(steps_per_epoch))

print('Current step: {}, training loss: {}, accuracy: {}%'.format(
      optimizer.iterations.numpy(),
      round(float(training_loss.result()), 4),
      round(float(training_accuracy.result()) * 100, 2)))
Current step: 1800, training loss: 0.0087, accuracy: 99.72%

Next steps