![]() |
![]() |
![]() |
![]() |
Introduction
TensorFlow Cloud is a Python package that provides APIs for a seamless transition from local debugging to distributed training in Google Cloud. It simplifies the process of training TensorFlow models on the cloud into a single, simple function call, requiring minimal setup and no changes to your model. TensorFlow Cloud handles cloud-specific tasks such as creating VM instances and distribution strategies for your models automatically. This guide will demonstrate how to interface with Google Cloud through TensorFlow Cloud, and the wide range of functionality provided within TensorFlow Cloud. We'll start with the simplest use-case.
Setup
We'll get started by installing TensorFlow Cloud, and importing the packages we will need in this guide.
pip install -q tensorflow_cloud
import tensorflow as tf
import tensorflow_cloud as tfc
from tensorflow import keras
from tensorflow.keras import layers
2021-07-27 22:07:16.348453: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
API overview: a first end-to-end example
Let's begin with a Keras model training script, such as the following CNN:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
model = keras.Sequential(
[
keras.Input(shape=(28, 28)),
# Use a Rescaling layer to make sure input values are in the [0, 1] range.
layers.experimental.preprocessing.Rescaling(1.0 / 255),
# The original images have shape (28, 28), so we reshape them to (28, 28, 1)
layers.Reshape(target_shape=(28, 28, 1)),
# Follow-up with a classic small convnet
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, activation="relu"),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dense(10),
]
)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=keras.metrics.SparseCategoricalAccuracy(),
)
model.fit(x_train, y_train, epochs=20, batch_size=128, validation_split=0.1)
To train this model on Google Cloud we just need to add a call to run()
at
the beginning of the script, before the imports:
tfc.run()
You don't need to worry about cloud-specific tasks such as creating VM instances and distribution strategies when using TensorFlow Cloud. The API includes intelligent defaults for all the parameters -- everything is configurable, but many models can rely on these defaults.
Upon calling run()
, TensorFlow Cloud will:
- Make your Python script or notebook distribution-ready.
- Convert it into a Docker image with required dependencies.
- Run the training job on a GCP GPU-powered VM.
- Stream relevant logs and job information.
The default VM configuration is 1 chief and 0 workers with 8 CPU cores and 1 Tesla T4 GPU.
Google Cloud configuration
In order to facilitate the proper pathways for Cloud training, you will need to do some first-time setup. If you're a new Google Cloud user, there are a few preliminary steps you will need to take:
- Create a GCP Project;
- Enable AI Platform Services;
- Create a Service Account;
- Download an authorization key;
- Create a Cloud Storage bucket.
Detailed first-time setup instructions can be found in the TensorFlow Cloud README, and an additional setup example is shown on the TensorFlow Blog.
Common workflows and Cloud storage
In most cases, you'll want to retrieve your model after training on Google Cloud.
For this, it's crucial to redirect saving and loading to Cloud Storage while
training remotely. We can direct TensorFlow Cloud to our Cloud Storage bucket for
a variety of tasks. The storage bucket can be used to save and load large training
datasets, store callback logs or model weights, and save trained model files.
To begin, let's configure fit()
to save the model to a Cloud Storage, and set
up TensorBoard monitoring to track training progress.
def create_model():
model = keras.Sequential(
[
keras.Input(shape=(28, 28)),
layers.experimental.preprocessing.Rescaling(1.0 / 255),
layers.Reshape(target_shape=(28, 28, 1)),
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, activation="relu"),
layers.MaxPooling2D(2),
layers.Conv2D(32, 3, activation="relu"),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dense(10),
]
)
model.compile(
optimizer=keras.optimizers.Adam(),
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=keras.metrics.SparseCategoricalAccuracy(),
)
return model
Let's save the TensorBoard logs and model checkpoints generated during training in our cloud storage bucket.
import datetime
import os
# Note: Please change the gcp_bucket to your bucket name.
gcp_bucket = "keras-examples"
checkpoint_path = os.path.join("gs://", gcp_bucket, "mnist_example", "save_at_{epoch}")
tensorboard_path = os.path.join( # Timestamp included to enable timeseries graphs
"gs://", gcp_bucket, "logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
)
callbacks = [
# TensorBoard will store logs for each epoch and graph performance for us.
keras.callbacks.TensorBoard(log_dir=tensorboard_path, histogram_freq=1),
# ModelCheckpoint will save models after each epoch for retrieval later.
keras.callbacks.ModelCheckpoint(checkpoint_path),
# EarlyStopping will terminate training when val_loss ceases to improve.
keras.callbacks.EarlyStopping(monitor="val_loss", patience=3),
]
model = create_model()
2021-07-27 22:07:18.825259: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing. 2021-07-27 22:07:18.825306: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started. 2021-07-27 22:07:18.826514: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2021-07-27 22:07:19.524654: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1611] Profiler found 1 GPUs 2021-07-27 22:07:19.569799: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcupti.so.11.2 2021-07-27 22:07:19.574795: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler session tear down. 2021-07-27 22:07:19.574958: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1743] CUPTI activity buffer flushed 2021-07-27 22:07:19.590994: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:19.592061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-07-27 22:07:19.592100: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-07-27 22:07:19.595897: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-07-27 22:07:19.595991: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 2021-07-27 22:07:19.597230: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2021-07-27 22:07:19.597581: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2021-07-27 22:07:19.598756: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11 2021-07-27 22:07:19.599746: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11 2021-07-27 22:07:19.599930: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-07-27 22:07:19.600043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:19.601088: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:19.602037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-07-27 22:07:19.602416: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-07-27 22:07:19.603033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:19.604024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:00:05.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s 2021-07-27 22:07:19.604096: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:19.605089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:19.606005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2021-07-27 22:07:19.606052: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-07-27 22:07:20.242028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-07-27 22:07:20.242067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2021-07-27 22:07:20.242076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2021-07-27 22:07:20.242317: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:20.243478: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:20.244412: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-07-27 22:07:20.245277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14646 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:05.0, compute capability: 7.0) WARNING:tensorflow:Please add `keras.layers.InputLayer` instead of `keras.Input` to Sequential model. `keras.Input` is intended to be used by Functional model. WARNING:tensorflow:Please add `keras.layers.InputLayer` instead of `keras.Input` to Sequential model. `keras.Input` is intended to be used by Functional model.
Here, we will load our data from Keras directly. In general, it's best practice to store your dataset in your Cloud Storage bucket, however TensorFlow Cloud can also accomodate datasets stored locally. That's covered in the Multi-file section of this guide.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
The TensorFlow Cloud API provides the
remote()
function to determine whether code is being executed locally or on
the cloud. This allows for the separate designation of fit()
parameters for
local and remote execution, and provides means for easy debugging without overloading
your local machine.
if tfc.remote():
epochs = 100
callbacks = callbacks
batch_size = 128
else:
epochs = 5
batch_size = 64
callbacks = None
model.fit(x_train, y_train, epochs=epochs, callbacks=callbacks, batch_size=batch_size)
2021-07-27 22:07:21.458608: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2021-07-27 22:07:21.459072: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000170000 Hz Epoch 1/5 2021-07-27 22:07:21.885085: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8 2021-07-27 22:07:23.986122: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100 2021-07-27 22:07:29.307903: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11 2021-07-27 22:07:29.684317: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11 938/938 [==============================] - 12s 3ms/step - loss: 0.2065 - sparse_categorical_accuracy: 0.9374 Epoch 2/5 938/938 [==============================] - 3s 3ms/step - loss: 0.0577 - sparse_categorical_accuracy: 0.9822 Epoch 3/5 938/938 [==============================] - 3s 3ms/step - loss: 0.0415 - sparse_categorical_accuracy: 0.9868 Epoch 4/5 938/938 [==============================] - 3s 3ms/step - loss: 0.0332 - sparse_categorical_accuracy: 0.9893 Epoch 5/5 938/938 [==============================] - 3s 3ms/step - loss: 0.0275 - sparse_categorical_accuracy: 0.9915 <tensorflow.python.keras.callbacks.History at 0x7f7b0c66a390>
Let's save the model in GCS after the training is complete.
save_path = os.path.join("gs://", gcp_bucket, "mnist_example")
if tfc.remote():
model.save(save_path)
We can also use this storage bucket for Docker image building, instead of your local
Docker instance. For this, just add your bucket to the docker_image_bucket_name
parameter.
# docs_infra: no_execute
tfc.run(docker_image_bucket_name=gcp_bucket)
After training the model, we can load the saved model and view our TensorBoard logs to monitor performance.
# docs_infra: no_execute
model = keras.models.load_model(save_path)
#docs_infra: no_execute
tensorboard dev upload --logdir "gs://keras-examples-jonah/logs/fit" --name "Guide MNIST"
Large-scale projects
In many cases, your project containing a Keras model may encompass more than one Python script, or may involve external data or specific dependencies. TensorFlow Cloud is entirely flexible for large-scale deployment, and provides a number of intelligent functionalities to aid your projects.
Entry points: support for Python scripts and Jupyter notebooks
Your call to the run()
API won't always be contained inside the same Python script
as your model training code. For this purpose, we provide an entry_point
parameter.
The entry_point
parameter can be used to specify the Python script or notebook in
which your model training code lives. When calling run()
from the same script as
your model, use the entry_point
default of None
.
pip
dependencies
If your project calls on additional pip
dependencies, it's possible to specify
the additional required libraries by including a requirements.txt
file. In this
file, simply put a list of all the required dependencies and TensorFlow Cloud will
handle integrating these into your cloud build.
Python notebooks
TensorFlow Cloud is also runnable from Python notebooks. Additionally, your specified
entry_point
can be a notebook if needed. There are two key differences to keep
in mind between TensorFlow Cloud on notebooks compared to scripts:
- When calling
run()
from within a notebook, a Cloud Storage bucket must be specified for building and storing your Docker image. - GCloud authentication happens entirely through your authentication key, without project specification. An example workflow using TensorFlow Cloud from a notebook is provided in the "Putting it all together" section of this guide.
Multi-file projects
If your model depends on additional files, you only need to ensure that these files
live in the same directory (or subdirectory) of the specified entry point. Every file
that is stored in the same directory as the specified entry_point
will be included
in the Docker image, as well as any files stored in subdirectories adjacent to the
entry_point
. This is also true for dependencies you may need which can't be acquired
through pip
For an example of a custom entry-point and multi-file project with additional pip
dependencies, take a look at this multi-file example on the
TensorFlow Cloud Repository.
For brevity, we'll just include the example's run()
call:
tfc.run(
docker_image_bucket_name=gcp_bucket,
entry_point="train_model.py",
requirements="requirements.txt"
)
Machine configuration and distributed training
Model training may require a wide range of different resources, depending on the size of the model or the dataset. When accounting for configurations with multiple GPUs, it becomes critical to choose a fitting distribution strategy. Here, we outline a few possible configurations:
Multi-worker distribution
Here, we can use COMMON_MACHINE_CONFIGS
to designate 1 chief CPU and 4 worker GPUs.
tfc.run(
docker_image_bucket_name=gcp_bucket,
chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'],
worker_count=2,
worker_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X']
)
By default, TensorFlow Cloud chooses the best distribution strategy for your machine
configuration with a simple formula using the chief_config
, worker_config
and
worker_count
parameters provided.
- If the number of GPUs specified is greater than zero,
tf.distribute.MirroredStrategy
will be chosen. - If the number of workers is greater than zero,
tf.distribute.experimental.MultiWorkerMirroredStrategy
ortf.distribute.experimental.TPUStrategy
will be chosen based on the accelerator type. - Otherwise,
tf.distribute.OneDeviceStrategy
will be chosen.
TPU distribution
Let's train the same model on TPU, as shown:
tfc.run(
docker_image_bucket_name=gcp_bucket,
chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
worker_count=1,
worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"]
)
Custom distribution strategy
To specify a custom distribution strategy, format your code normally as you would
according to the
distributed training guide
and set distribution_strategy
to None
. Below, we'll specify our own distribution
strategy for the same MNIST model.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = create_model()
if tfc.remote():
epochs = 100
batch_size = 128
else:
epochs = 10
batch_size = 64
callbacks = None
model.fit(
x_train, y_train, epochs=epochs, callbacks=callbacks, batch_size=batch_size
)
tfc.run(
docker_image_bucket_name=gcp_bucket,
chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'],
worker_count=2,
worker_config=tfc.COMMON_MACHINE_CONFIGS['T4_4X'],
distribution_strategy=None
)
Custom Docker images
By default, TensorFlow Cloud uses a Docker base image supplied by Google and corresponding to your current TensorFlow version. However, you can also specify a custom Docker image to fit your build requirements, if necessary. For this example, we will specify the Docker image from an older version of TensorFlow:
tfc.run(
docker_image_bucket_name=gcp_bucket,
base_docker_image="tensorflow/tensorflow:2.1.0-gpu"
)
Additional metrics
You may find it useful to tag your Cloud jobs with specific labels, or to stream
your model's logs during Cloud training.
It's good practice to maintain proper labeling on all Cloud jobs, for record-keeping.
For this purpose, run()
accepts a dictionary of labels up to 64 key-value pairs,
which are visible from the Cloud build logs. Logs such as epoch performance and model
saving internals can be accessed using the link provided by executing tfc.run
or
printed to your local terminal using the stream_logs
flag.
job_labels = {"job": "mnist-example", "team": "keras-io", "user": "jonah"}
tfc.run(
docker_image_bucket_name=gcp_bucket,
job_labels=job_labels,
stream_logs=True
)
Putting it all together
For an in-depth Colab which uses many of the features described in this guide, follow along this example to train a state-of-the-art model to recognize dog breeds from photos using feature extraction.