Thanks for tuning in to Google I/O. View all sessions on demandWatch on demand

Training Keras models with TensorFlow Cloud

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

Introduction

TensorFlow Cloud is a Python package that provides APIs for a seamless transition from local debugging to distributed training in Google Cloud. It simplifies the process of training TensorFlow models on the cloud into a single, simple function call, requiring minimal setup and no changes to your model. TensorFlow Cloud handles cloud-specific tasks such as creating VM instances and distribution strategies for your models automatically. This guide will demonstrate how to interface with Google Cloud through TensorFlow Cloud, and the wide range of functionality provided within TensorFlow Cloud. We'll start with the simplest use-case.

Setup

We'll get started by installing TensorFlow Cloud, and importing the packages we will need in this guide.

pip install -q tensorflow_cloud
import tensorflow as tf
import tensorflow_cloud as tfc

from tensorflow import keras
from tensorflow.keras import layers
2021-07-27 22:07:16.348453: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0

API overview: a first end-to-end example

Let's begin with a Keras model training script, such as the following CNN:

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

model = keras.Sequential(
    [
        keras.Input(shape=(28, 28)),
        # Use a Rescaling layer to make sure input values are in the [0, 1] range.
        layers.experimental.preprocessing.Rescaling(1.0 / 255),
        # The original images have shape (28, 28), so we reshape them to (28, 28, 1)
        layers.Reshape(target_shape=(28, 28, 1)),
        # Follow-up with a classic small convnet
        layers.Conv2D(32, 3, activation="relu"),
        layers.MaxPooling2D(2),
        layers.Conv2D(32, 3, activation="relu"),
        layers.MaxPooling2D(2),
        layers.Conv2D(32, 3, activation="relu"),
        layers.Flatten(),
        layers.Dense(128, activation="relu"),
        layers.Dense(10),
    ]
)

model.compile(
    optimizer=keras.optimizers.Adam(),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(),
)

model.fit(x_train, y_train, epochs=20, batch_size=128, validation_split=0.1)

To train this model on Google Cloud we just need to add a call to run() at the beginning of the script, before the imports:

tfc.run()

You don't need to worry about cloud-specific tasks such as creating VM instances and distribution strategies when using TensorFlow Cloud. The API includes intelligent defaults for all the parameters -- everything is configurable, but many models can rely on these defaults.

Upon calling run(), TensorFlow Cloud will:

  • Make your Python script or notebook distribution-ready.
  • Convert it into a Docker image with required dependencies.
  • Run the training job on a GCP GPU-powered VM.
  • Stream relevant logs and job information.

The default VM configuration is 1 chief and 0 workers with 8 CPU cores and 1 Tesla T4 GPU.

Google Cloud configuration

In order to facilitate the proper pathways for Cloud training, you will need to do some first-time setup. If you're a new Google Cloud user, there are a few preliminary steps you will need to take:

  1. Create a GCP Project;
  2. Enable AI Platform Services;
  3. Create a Service Account;
  4. Download an authorization key;
  5. Create a Cloud Storage bucket.

Detailed first-time setup instructions can be found in the TensorFlow Cloud README, and an additional setup example is shown on the TensorFlow Blog.

Common workflows and Cloud storage

In most cases, you'll want to retrieve your model after training on Google Cloud. For this, it's crucial to redirect saving and loading to Cloud Storage while training remotely. We can direct TensorFlow Cloud to our Cloud Storage bucket for a variety of tasks. The storage bucket can be used to save and load large training datasets, store callback logs or model weights, and save trained model files. To begin, let's configure fit() to save the model to a Cloud Storage, and set up TensorBoard monitoring to track training progress.

def create_model():
    model = keras.Sequential(
        [
            keras.Input(shape=(28, 28)),
            layers.experimental.preprocessing.Rescaling(1.0 / 255),
            layers.Reshape(target_shape=(28, 28, 1)),
            layers.Conv2D(32, 3, activation="relu"),
            layers.MaxPooling2D(2),
            layers.Conv2D(32, 3, activation="relu"),
            layers.MaxPooling2D(2),
            layers.Conv2D(32, 3, activation="relu"),
            layers.Flatten(),
            layers.Dense(128, activation="relu"),
            layers.Dense(10),
        ]
    )

    model.compile(
        optimizer=keras.optimizers.Adam(),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=keras.metrics.SparseCategoricalAccuracy(),
    )
    return model

Let's save the TensorBoard logs and model checkpoints generated during training in our cloud storage bucket.

import datetime
import os

# Note: Please change the gcp_bucket to your bucket name.
gcp_bucket = "keras-examples"

checkpoint_path = os.path.join("gs://", gcp_bucket, "mnist_example", "save_at_{epoch}")

tensorboard_path = os.path.join(  # Timestamp included to enable timeseries graphs
    "gs://", gcp_bucket, "logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
)

callbacks = [
    # TensorBoard will store logs for each epoch and graph performance for us.
    keras.callbacks.TensorBoard(log_dir=tensorboard_path, histogram_freq=1),
    # ModelCheckpoint will save models after each epoch for retrieval later.
    keras.callbacks.ModelCheckpoint(checkpoint_path),
    # EarlyStopping will terminate training when val_loss ceases to improve.
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=3),
]

model = create_model()
2021-07-27 22:07:18.825259: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-07-27 22:07:18.825306: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-07-27 22:07:18.826514: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-27 22:07:19.524654: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs
2021-07-27 22:07:19.569799: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcupti.so.11.2
2021-07-27 22:07:19.574795: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down.
2021-07-27 22:07:19.574958: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1743] CUPTI activity buffer flushed
2021-07-27 22:07:19.590994: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:19.592061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-27 22:07:19.592100: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-27 22:07:19.595897: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-27 22:07:19.595991: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-27 22:07:19.597230: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-27 22:07:19.597581: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-27 22:07:19.598756: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-07-27 22:07:19.599746: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-27 22:07:19.599930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-27 22:07:19.600043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:19.601088: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:19.602037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-27 22:07:19.602416: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-27 22:07:19.603033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:19.604024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-07-27 22:07:19.604096: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:19.605089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:19.606005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-27 22:07:19.606052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-27 22:07:20.242028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-27 22:07:20.242067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2021-07-27 22:07:20.242076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2021-07-27 22:07:20.242317: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:20.243478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:20.244412: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-27 22:07:20.245277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14646 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0)
WARNING:tensorflow:Please add `keras.layers.InputLayer` instead of `keras.Input` to Sequential model. `keras.Input` is intended to be used by Functional model.
WARNING:tensorflow:Please add `keras.layers.InputLayer` instead of `keras.Input` to Sequential model. `keras.Input` is intended to be used by Functional model.

Here, we will load our data from Keras directly. In general, it's best practice to store your dataset in your Cloud Storage bucket, however TensorFlow Cloud can also accomodate datasets stored locally. That's covered in the Multi-file section of this guide.

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

The TensorFlow Cloud API provides the remote() function to determine whether code is being executed locally or on the cloud. This allows for the separate designation of fit() parameters for local and remote execution, and provides means for easy debugging without overloading your local machine.

if tfc.remote():
    epochs = 100
    callbacks = callbacks
    batch_size = 128
else:
    epochs = 5
    batch_size = 64
    callbacks = None

model.fit(x_train, y_train, epochs=epochs, callbacks=callbacks, batch_size=batch_size)
2021-07-27 22:07:21.458608: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-27 22:07:21.459072: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000170000 Hz
Epoch 1/5
2021-07-27 22:07:21.885085: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-27 22:07:23.986122: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-07-27 22:07:29.307903: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-27 22:07:29.684317: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
938/938 [==============================] - 12s 3ms/step - loss: 0.2065 - sparse_categorical_accuracy: 0.9374
Epoch 2/5
938/938 [==============================] - 3s 3ms/step - loss: 0.0577 - sparse_categorical_accuracy: 0.9822
Epoch 3/5
938/938 [==============================] - 3s 3ms/step - loss: 0.0415 - sparse_categorical_accuracy: 0.9868
Epoch 4/5
938/938 [==============================] - 3s 3ms/step - loss: 0.0332 - sparse_categorical_accuracy: 0.9893
Epoch 5/5
938/938 [==============================] - 3s 3ms/step - loss: 0.0275 - sparse_categorical_accuracy: 0.9915
<tensorflow.python.keras.callbacks.History at 0x7f7b0c66a390>

Let's save the model in GCS after the training is complete.

save_path = os.path.join("gs://", gcp_bucket, "mnist_example")

if tfc.remote():
    model.save(save_path)

We can also use this storage bucket for Docker image building, instead of your local Docker instance. For this, just add your bucket to the docker_image_bucket_name parameter.

# docs_infra: no_execute
tfc.run(docker_image_bucket_name=gcp_bucket)

After training the model, we can load the saved model and view our TensorBoard logs to monitor performance.

# docs_infra: no_execute
model = keras.models.load_model(save_path)
#docs_infra: no_execute
tensorboard dev upload --logdir "gs://keras-examples-jonah/logs/fit" --name "Guide MNIST"

Large-scale projects

In many cases, your project containing a Keras model may encompass more than one Python script, or may involve external data or specific dependencies. TensorFlow Cloud is entirely flexible for large-scale deployment, and provides a number of intelligent functionalities to aid your projects.

Entry points: support for Python scripts and Jupyter notebooks

Your call to the run() API won't always be contained inside the same Python script as your model training code. For this purpose, we provide an entry_point parameter. The entry_point parameter can be used to specify the Python script or notebook in which your model training code lives. When calling run() from the same script as your model, use the entry_point default of None.

pip dependencies

If your project calls on additional pip dependencies, it's possible to specify the additional required libraries by including a requirements.txt file. In this file, simply put a list of all the required dependencies and TensorFlow Cloud will handle integrating these into your cloud build.

Python notebooks

TensorFlow Cloud is also runnable from Python notebooks. Additionally, your specified entry_point can be a notebook if needed. There are two key differences to keep in mind between TensorFlow Cloud on notebooks compared to scripts:

  • When calling run() from within a notebook, a Cloud Storage bucket must be specified for building and storing your Docker image.
  • GCloud authentication happens entirely through your authentication key, without project specification. An example workflow using TensorFlow Cloud from a notebook is provided in the "Putting it all together" section of this guide.

Multi-file projects

If your model depends on additional files, you only need to ensure that these files live in the same directory (or subdirectory) of the specified entry point. Every file that is stored in the same directory as the specified entry_point will be included in the Docker image, as well as any files stored in subdirectories adjacent to the entry_point. This is also true for dependencies you may need which can't be acquired through pip

For an example of a custom entry-point and multi-file project with additional pip dependencies, take a look at this multi-file example on the TensorFlow Cloud Repository. For brevity, we'll just include the example's run() call:

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    entry_point="train_model.py",
    requirements="requirements.txt"
)

Machine configuration and distributed training

Model training may require a wide range of different resources, depending on the size of the model or the dataset. When accounting for configurations with multiple GPUs, it becomes critical to choose a fitting distribution strategy. Here, we outline a few possible configurations:

Multi-worker distribution

Here, we can use COMMON_MACHINE_CONFIGS to designate 1 chief CPU and 4 worker GPUs.

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'],
    worker_count=2,
    worker_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X']
)

By default, TensorFlow Cloud chooses the best distribution strategy for your machine configuration with a simple formula using the chief_config, worker_config and worker_count parameters provided.

TPU distribution

Let's train the same model on TPU, as shown:

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
    worker_count=1,
    worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"]
)

Custom distribution strategy

To specify a custom distribution strategy, format your code normally as you would according to the distributed training guide and set distribution_strategy to None. Below, we'll specify our own distribution strategy for the same MNIST model.

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
  model = create_model()

if tfc.remote():
    epochs = 100
    batch_size = 128
else:
    epochs = 10
    batch_size = 64
    callbacks = None

model.fit(
    x_train, y_train, epochs=epochs, callbacks=callbacks, batch_size=batch_size
)

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'],
    worker_count=2,
    worker_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X'],
    distribution_strategy=None
)

Custom Docker images

By default, TensorFlow Cloud uses a Docker base image supplied by Google and corresponding to your current TensorFlow version. However, you can also specify a custom Docker image to fit your build requirements, if necessary. For this example, we will specify the Docker image from an older version of TensorFlow:

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    base_docker_image="tensorflow/tensorflow:2.1.0-gpu"
)

Additional metrics

You may find it useful to tag your Cloud jobs with specific labels, or to stream your model's logs during Cloud training. It's good practice to maintain proper labeling on all Cloud jobs, for record-keeping. For this purpose, run() accepts a dictionary of labels up to 64 key-value pairs, which are visible from the Cloud build logs. Logs such as epoch performance and model saving internals can be accessed using the link provided by executing tfc.run or printed to your local terminal using the stream_logs flag.

job_labels = {"job": "mnist-example", "team": "keras-io", "user": "jonah"}

tfc.run(
    docker_image_bucket_name=gcp_bucket,
    job_labels=job_labels,
    stream_logs=True
)

Putting it all together

For an in-depth Colab which uses many of the features described in this guide, follow along this example to train a state-of-the-art model to recognize dog breeds from photos using feature extraction.