View on TensorFlow.org | Run in Google Colab | View source on GitHub | Download notebook |

## Overview

`tf.distribute.Strategy`

is a TensorFlow API to distribute training across multiple GPUs, multiple machines, or TPUs. Using this API, you can distribute your existing models and training code with minimal code changes.

`tf.distribute.Strategy`

has been designed with these key goals in mind:

- Easy to use and support multiple user segments, including researchers, machine learning engineers, etc.
- Provide good performance out of the box.
- Easy switching between strategies.

You can distribute training using `tf.distribute.Strategy`

with a high-level API like Keras `Model.fit`

, as well as custom training loops (and, in general, any computation using TensorFlow).

In TensorFlow 2.x, you can execute your programs eagerly, or in a graph using `tf.function`

. `tf.distribute.Strategy`

intends to support both these modes of execution, but works best with `tf.function`

. Eager mode is only recommended for debugging purposes and not supported for `tf.distribute.TPUStrategy`

. Although training is the focus of this guide, this API can also be used for distributing evaluation and prediction on different platforms.

You can use `tf.distribute.Strategy`

with very few changes to your code, because the underlying components of TensorFlow have been changed to become strategy-aware. This includes variables, layers, models, optimizers, metrics, summaries, and checkpoints.

In this guide, you will learn about various types of strategies and how you can use them in different situations. To learn how to debug performance issues, check out the Optimize TensorFlow GPU performance guide.

## Set up TensorFlow

```
import tensorflow as tf
```

2024-07-19 02:37:20.035597: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-07-19 02:37:20.056573: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-07-19 02:37:20.062960: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

## Types of strategies

`tf.distribute.Strategy`

intends to cover a number of use cases along different axes. Some of these combinations are currently supported and others will be added in the future. Some of these axes are:

*Synchronous vs asynchronous training:*These are two common ways of distributing training with data parallelism. In sync training, all workers train over different slices of input data in sync, and aggregating gradients at each step. In async training, all workers are independently training over the input data and updating variables asynchronously. Typically sync training is supported via all-reduce and async through parameter server architecture.*Hardware platform:*You may want to scale your training onto multiple GPUs on one machine, or multiple machines in a network (with 0 or more GPUs each), or on Cloud TPUs.

In order to support these use cases, TensorFlow has `MirroredStrategy`

, `TPUStrategy`

, `MultiWorkerMirroredStrategy`

, `ParameterServerStrategy`

, `CentralStorageStrategy`

, as well as other strategies available. The next section explains which of these are supported in which scenarios in TensorFlow. Here is a quick overview:

Training API | `MirroredStrategy` |
`TPUStrategy` |
`MultiWorkerMirroredStrategy` |
`CentralStorageStrategy` |
`ParameterServerStrategy` |
---|---|---|---|---|---|

Keras `Model.fit` |
Supported | Supported | Supported | Experimental support | Experimental support |

Custom training loop |
Supported | Supported | Supported | Experimental support | Experimental support |

Estimator API |
Limited Support | Not supported | Limited Support | Limited Support | Limited Support |

### MirroredStrategy

`tf.distribute.MirroredStrategy`

supports synchronous distributed training on multiple GPUs on one machine. It creates one replica per GPU device. Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called `MirroredVariable`

. These variables are kept in sync with each other by applying identical updates.

Efficient all-reduce algorithms are used to communicate the variable updates across the devices. All-reduce aggregates tensors across all the devices by adding them up, and makes them available on each device. It’s a fused algorithm that is very efficient and can reduce the overhead of synchronization significantly. There are many all-reduce algorithms and implementations available, depending on the type of communication available between devices. By default, it uses the NVIDIA Collective Communication Library (NCCL) as the all-reduce implementation. You can choose from a few other options or write your own.

Here is the simplest way of creating `MirroredStrategy`

:

```
mirrored_strategy = tf.distribute.MirroredStrategy()
```

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3') WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1721356642.657353 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.661130 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.664453 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.667694 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.678867 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.682371 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.685198 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.688116 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.691081 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.694438 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.697439 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356642.700370 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.939825 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.941902 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.943946 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.946041 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.948116 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.950046 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.951993 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.953949 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.955925 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.957811 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.959810 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356643.961790 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.000054 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.002033 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.004033 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.006043 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.008046 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.009959 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.011912 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.013867 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.015890 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.018340 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.020704 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1721356644.023046 119446 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

This will create a `MirroredStrategy`

instance, which will use all the GPUs that are visible to TensorFlow, and NCCL—as the cross-device communication.

If you wish to use only some of the GPUs on your machine, you can do so like this:

```
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
```

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

If you wish to override the cross device communication, you can do so using the `cross_device_ops`

argument by supplying an instance of `tf.distribute.CrossDeviceOps`

. Currently, `tf.distribute.HierarchicalCopyAllReduce`

and `tf.distribute.ReductionToOneDevice`

are two options other than `tf.distribute.NcclAllReduce`

, which is the default.

```
mirrored_strategy = tf.distribute.MirroredStrategy(
cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
```

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

### TPUStrategy

`tf.distribute.TPUStrategy`

lets you run your TensorFlow training on Tensor Processing Units (TPUs). TPUs are Google's specialized ASICs designed to dramatically accelerate machine learning workloads. They are available on Google Colab, the TPU Research Cloud, and Cloud TPU.

In terms of distributed training architecture, `TPUStrategy`

is the same `MirroredStrategy`

—it implements synchronous distributed training. TPUs provide their own implementation of efficient all-reduce and other collective operations across multiple TPU cores, which are used in `TPUStrategy`

.

Here is how you would instantiate `TPUStrategy`

:

```
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
tpu=tpu_address)
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
tpu_strategy = tf.distribute.TPUStrategy(cluster_resolver)
```

The `TPUClusterResolver`

instance helps locate the TPUs. In Colab, you don't need to specify any arguments to it.

If you want to use this for Cloud TPUs:

- You must specify the name of your TPU resource in the
`tpu`

argument. - You must initialize the TPU system explicitly at the
*start*of the program. This is required before TPUs can be used for computation. Initializing the TPU system also wipes out the TPU memory, so it's important to complete this step first in order to avoid losing state.

### MultiWorkerMirroredStrategy

`tf.distribute.MultiWorkerMirroredStrategy`

is very similar to `MirroredStrategy`

. It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. Similar to `tf.distribute.MirroredStrategy`

, it creates copies of all variables in the model on each device across all workers.

Here is the simplest way of creating `MultiWorkerMirroredStrategy`

:

```
strategy = tf.distribute.MultiWorkerMirroredStrategy()
```

WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled. INFO:tensorflow:Using MirroredStrategy with devices ('/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3') INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3'), communication = CommunicationImplementation.AUTO

`MultiWorkerMirroredStrategy`

has two implementations for cross-device communications. `CommunicationImplementation.RING`

is RPC-based and supports both CPUs and GPUs. `CommunicationImplementation.NCCL`

uses NCCL and provides state-of-art performance on GPUs but it doesn't support CPUs. `CollectiveCommunication.AUTO`

defers the choice to Tensorflow. You can specify them in the following way:

```
communication_options = tf.distribute.experimental.CommunicationOptions(
implementation=tf.distribute.experimental.CommunicationImplementation.NCCL)
strategy = tf.distribute.MultiWorkerMirroredStrategy(
communication_options=communication_options)
```

WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled. INFO:tensorflow:Using MirroredStrategy with devices ('/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3') INFO:tensorflow:Single-worker MultiWorkerMirroredStrategy with local_devices = ('/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3'), communication = CommunicationImplementation.NCCL

One of the key differences to get multi worker training going, as compared to multi-GPU training, is the multi-worker setup. The `'TF_CONFIG'`

environment variable is the standard way in TensorFlow to specify the cluster configuration to each worker that is part of the cluster. Learn more in the setting up TF_CONFIG section of this document.

For more details about `MultiWorkerMirroredStrategy`

, consider the following tutorials:

### ParameterServerStrategy

Parameter server training is a common data-parallel method to scale up model training on multiple machines. A parameter server training cluster consists of workers and parameter servers. Variables are created on parameter servers and they are read and updated by workers in each step. Check out the Parameter server training tutorial for details.

In TensorFlow 2, parameter server training uses a central coordinator-based architecture via the `tf.distribute.experimental.coordinator.ClusterCoordinator`

class.

In this implementation, the `worker`

and `parameter server`

tasks run `tf.distribute.Server`

s that listen for tasks from the coordinator. The coordinator creates resources, dispatches training tasks, writes checkpoints, and deals with task failures.

In the programming running on the coordinator, you will use a `ParameterServerStrategy`

object to define a training step and use a `ClusterCoordinator`

to dispatch training steps to remote workers. Here is the simplest way to create them:

```
strategy = tf.distribute.experimental.ParameterServerStrategy(
tf.distribute.cluster_resolver.TFConfigClusterResolver(),
variable_partitioner=variable_partitioner)
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(
strategy)
```

To learn more about `ParameterServerStrategy`

, check out the Parameter server training with Keras Model.fit and a custom training loop tutorial.

In TensorFlow 1, `ParameterServerStrategy`

is available only with an Estimator via `tf.compat.v1.distribute.experimental.ParameterServerStrategy`

symbol.

### CentralStorageStrategy

`tf.distribute.experimental.CentralStorageStrategy`

does synchronous training as well. Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.

Create an instance of `CentralStorageStrategy`

by:

```
central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()
```

INFO:tensorflow:ParameterServerStrategy (CentralStorageStrategy if you are using a single machine) with compute_devices = ['/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3'], variable_device = '/device:CPU:0'

This will create a `CentralStorageStrategy`

instance which will use all visible GPUs and CPU. Update to variables on replicas will be aggregated before being applied to variables.

### Other strategies

In addition to the above strategies, there are two other strategies which might be useful for prototyping and debugging when using `tf.distribute`

APIs.

#### Default Strategy

The Default Strategy is a distribution strategy which is present when no explicit distribution strategy is in scope. It implements the `tf.distribute.Strategy`

interface but is a pass-through and provides no actual distribution. For instance, `Strategy.run(fn)`

will simply call `fn`

. Code written using this strategy should behave exactly as code written without any strategy. You can think of it as a "no-op" strategy.

The Default Strategy is a singleton—and one cannot create more instances of it. It can be obtained using `tf.distribute.get_strategy`

outside any explicit strategy's scope (the same API that can be used to get the current strategy inside an explicit strategy's scope).

```
default_strategy = tf.distribute.get_strategy()
```

This strategy serves two main purposes:

- It allows writing distribution-aware library code unconditionally. For example, in
`tf.keras.optimizers`

you can use`tf.distribute.get_strategy`

and use that strategy for reducing gradients—it will always return a strategy object on which you can call the`Strategy.reduce`

API.

```
# In optimizer or other library code
# Get currently active strategy
strategy = tf.distribute.get_strategy()
strategy.reduce("SUM", 1., axis=None) # reduce some values
```

1.0

- Similar to library code, it can be used to write end users' programs to work with and without distribution strategy, without requiring conditional logic. Here's a sample code snippet illustrating this:

```
if tf.config.list_physical_devices('GPU'):
strategy = tf.distribute.MirroredStrategy()
else: # Use the Default Strategy
strategy = tf.distribute.get_strategy()
with strategy.scope():
# Do something interesting
print(tf.Variable(1.))
```

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3') MirroredVariable:{ 0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>, 1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>, 2: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=1.0>, 3: <tf.Variable 'Variable/replica_3:0' shape=() dtype=float32, numpy=1.0> }

#### OneDeviceStrategy

`tf.distribute.OneDeviceStrategy`

is a strategy to place all variables and computation on a single specified device.

```
strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
```

This strategy is distinct from the Default Strategy in a number of ways. In the Default Strategy, the variable placement logic remains unchanged when compared to running TensorFlow without any distribution strategy. But when using `OneDeviceStrategy`

, all variables created in its scope are explicitly placed on the specified device. Moreover, any functions called via `OneDeviceStrategy.run`

will also be placed on the specified device.

Input distributed through this strategy will be prefetched to the specified device. In the Default Strategy, there is no input distribution.

Similar to the Default Strategy, this strategy could also be used to test your code before switching to other strategies which actually distribute to multiple devices/machines. This will exercise the distribution strategy machinery somewhat more than the Default Strategy, but not to the full extent of using, for example, `MirroredStrategy`

or `TPUStrategy`

. If you want code that behaves as if there is no strategy, then use the Default Strategy.

So far you've learned about different strategies and how you can instantiate them. The next few sections show the different ways in which you can use them to distribute your training.

## Use tf.distribute.Strategy with Keras Model.fit

`tf.distribute.Strategy`

is integrated into `tf.keras`

, which is TensorFlow's implementation of the Keras API specification. `tf.keras`

is a high-level API to build and train models. By integrating into the `tf.keras`

backend, it's seamless for you to distribute your training written in the Keras training framework using Model.fit.

Here's what you need to change in your code:

- Create an instance of the appropriate
`tf.distribute.Strategy`

. - Move the creation of Keras model, optimizer and metrics inside
`strategy.scope`

. Thus the code in the model's`call()`

,`train_step()`

, and`test_step()`

methods will all be distributed and executed on the accelerator(s).

TensorFlow distribution strategies support all types of Keras models—Sequential, Functional, and subclassed

Here is a snippet of code to do this for a very simple Keras model with one `Dense`

layer:

```
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Dense(1, input_shape=(1,),
kernel_regularizer=tf.keras.regularizers.L2(1e-4))])
model.compile(loss='mse', optimizer='sgd')
```

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3') /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)

This example uses `MirroredStrategy`

, so you can run this on a machine with multiple GPUs. `strategy.scope()`

indicates to Keras which strategy to use to distribute the training. Creating models/optimizers/metrics inside this scope allows you to create distributed variables instead of regular variables. Once this is set up, you can fit your model like you would normally. `MirroredStrategy`

takes care of replicating the model's training on the available GPUs, aggregating gradients, and more.

```
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=2)
model.evaluate(dataset)
```

Epoch 1/2 10/10 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - loss: 0.0015 Epoch 2/2 10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 7.0200e-04 10/10 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - loss: 4.2109e-04 0.00042108571506105363

Here a `tf.data.Dataset`

provides the training and eval input. You can also use NumPy arrays:

```
import numpy as np
inputs, targets = np.ones((100, 1)), np.ones((100, 1))
model.fit(inputs, targets, epochs=2, batch_size=10)
```

Epoch 1/2 10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 3.6834e-04 Epoch 2/2 10/10 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step - loss: 2.2087e-04 <keras.src.callbacks.history.History at 0x7f09c83f1670>

In both cases—with `Dataset`

or NumPy—each batch of the given input is divided equally among the multiple replicas. For instance, if you are using the `MirroredStrategy`

with 2 GPUs, each batch of size 10 will be divided among the 2 GPUs, with each receiving 5 input examples in each step. Each epoch will then train faster as you add more GPUs. Typically, you would want to increase your batch size as you add more accelerators, so as to make effective use of the extra computing power. You will also need to re-tune your learning rate, depending on the model. You can use `strategy.num_replicas_in_sync`

to get the number of replicas.

```
mirrored_strategy.num_replicas_in_sync
```

4

```
# Compute a global batch size using a number of replicas.
BATCH_SIZE_PER_REPLICA = 5
global_batch_size = (BATCH_SIZE_PER_REPLICA *
mirrored_strategy.num_replicas_in_sync)
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100)
dataset = dataset.batch(global_batch_size)
LEARNING_RATES_BY_BATCH_SIZE = {5: 0.1, 10: 0.15, 20:0.175}
learning_rate = LEARNING_RATES_BY_BATCH_SIZE[global_batch_size]
```

### What's supported now?

Training API | `MirroredStrategy` |
`TPUStrategy` |
`MultiWorkerMirroredStrategy` |
`ParameterServerStrategy` |
`CentralStorageStrategy` |
---|---|---|---|---|---|

Keras `Model.fit` |
Supported | Supported | Supported | Experimental support | Experimental support |

### Examples and tutorials

Here is a list of tutorials and examples that illustrate the above integration end-to-end with Keras `Model.fit`

:

- Tutorial: Training with
`Model.fit`

and`MirroredStrategy`

. - Tutorial: Training with
`Model.fit`

and`MultiWorkerMirroredStrategy`

. - Guide: Contains an example of using
`Model.fit`

and`TPUStrategy`

. - Tutorial: Parameter server training with
`Model.fit`

and`ParameterServerStrategy`

. - Tutorial: Fine-tuning BERT for many tasks from the GLUE benchmark with
`Model.fit`

and`TPUStrategy`

. - TensorFlow Model Garden repository containing collections of state-of-the-art models implemented using various strategies.

## Use tf.distribute.Strategy with custom training loops

As demonstrated above, using `tf.distribute.Strategy`

with Keras `Model.fit`

requires changing only a couple lines of your code. With a little more effort, you can also use `tf.distribute.Strategy`

with custom training loops.

If you need more flexibility and control over your training loops than is possible with Estimator or Keras, you can write custom training loops. For instance, when using a GAN, you may want to take a different number of generator or discriminator steps each round. Similarly, the high level frameworks are not very suitable for Reinforcement Learning training.

The `tf.distribute.Strategy`

classes provide a core set of methods to support custom training loops. Using these may require minor restructuring of the code initially, but once that is done, you should be able to switch between GPUs, TPUs, and multiple machines simply by changing the strategy instance.

Below is a brief snippet illustrating this use case for a simple training example using the same Keras model as before.

First, create the model and optimizer inside the strategy's scope. This ensures that any variables created with the model and optimizer are mirrored variables.

```
with mirrored_strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Dense(1, input_shape=(1,),
kernel_regularizer=tf.keras.regularizers.L2(1e-4))])
optimizer = tf.keras.optimizers.SGD()
```

Next, create the input dataset and call `tf.distribute.Strategy.experimental_distribute_dataset`

to distribute the dataset based on the strategy.

```
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(1000).batch(
global_batch_size)
dist_dataset = mirrored_strategy.experimental_distribute_dataset(dataset)
```

Then, define one step of the training. Use `tf.GradientTape`

to compute gradients and optimizer to apply those gradients to update your model's variables. To distribute this training step, put it in a function `train_step`

and pass it to `tf.distribute.Strategy.run`

along with the dataset inputs you got from the `dist_dataset`

created before:

```
# Sets `reduction=NONE` to leave it to tf.nn.compute_average_loss() below.
loss_object = tf.keras.losses.BinaryCrossentropy(
from_logits=True,
reduction=tf.keras.losses.Reduction.NONE)
def train_step(inputs):
features, labels = inputs
with tf.GradientTape() as tape:
predictions = model(features, training=True)
per_example_loss = loss_object(labels, predictions)
loss = tf.nn.compute_average_loss(per_example_loss)
model_losses = model.losses
if model_losses:
loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
@tf.function
def distributed_train_step(dist_inputs):
per_replica_losses = mirrored_strategy.run(train_step, args=(dist_inputs,))
return mirrored_strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)
```

A few other things to note in the code above:

You used

`tf.nn.compute_average_loss`

to reduce the per-example prediction losses to a scalar.`tf.nn.compute_average_loss`

sums the per example loss and divides the sum by the global batch size. This is important because later after the gradients are calculated on each replica, they are aggregated across the replicas by**summing**them.By default, the global batch size is taken to be

`tf.get_strategy().num_replicas_in_sync * tf.shape(per_example_loss)[0]`

. It can also be specified explicitly as a keyword argument`global_batch_size=`

. Without short batches, the default is equivalent to`tf.nn.compute_average_loss(..., global_batch_size=global_batch_size)`

with the`global_batch_size`

defined above. (For more on short batches and how to avoid or handle them, see the Custom Training tutorial.)You used

`tf.nn.scale_regularization_loss`

to scale regularization losses registered with the`Model`

object, if any, by`1/num_replicas_in_sync`

as well. For those regularization losses that are input-dependent, it falls on the modeling code, not the custom training loop, to perform the averaging over the per-replica(!) batch size; that way the modeling code can remain agnostic of replication while the training loop remains agnostic of how regularization losses are computed.When you call

`apply_gradients`

within a distribution strategy scope, its behavior is modified. Specifically, before applying gradients on each parallel instance during synchronous training, it performs a sum-over-all-replicas of the gradients.You also used the

`tf.distribute.Strategy.reduce`

API to aggregate the results returned by`tf.distribute.Strategy.run`

for reporting.`tf.distribute.Strategy.run`

returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can`reduce`

them to get an aggregated value. You can also do`tf.distribute.Strategy.experimental_local_results`

to get the list of values contained in the result, one per local replica.

Finally, once you have defined the training step, you can iterate over `dist_dataset`

and run the training in a loop:

```
for dist_inputs in dist_dataset:
print(distributed_train_step(dist_inputs))
```

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). tf.Tensor(0.29947755, shape=(), dtype=float32) tf.Tensor(0.2991433, shape=(), dtype=float32) tf.Tensor(0.29880974, shape=(), dtype=float32) tf.Tensor(0.29847682, shape=(), dtype=float32) tf.Tensor(0.29814446, shape=(), dtype=float32) tf.Tensor(0.2978128, shape=(), dtype=float32) tf.Tensor(0.29748178, shape=(), dtype=float32) tf.Tensor(0.2971514, shape=(), dtype=float32) tf.Tensor(0.29682156, shape=(), dtype=float32) tf.Tensor(0.29649243, shape=(), dtype=float32) tf.Tensor(0.2961639, shape=(), dtype=float32) tf.Tensor(0.295836, shape=(), dtype=float32) tf.Tensor(0.29550874, shape=(), dtype=float32) tf.Tensor(0.29518208, shape=(), dtype=float32) tf.Tensor(0.29485607, shape=(), dtype=float32) tf.Tensor(0.2945307, shape=(), dtype=float32) tf.Tensor(0.2942059, shape=(), dtype=float32) tf.Tensor(0.2938817, shape=(), dtype=float32) tf.Tensor(0.29355815, shape=(), dtype=float32) tf.Tensor(0.2932352, shape=(), dtype=float32) tf.Tensor(0.29291287, shape=(), dtype=float32) tf.Tensor(0.29259112, shape=(), dtype=float32) tf.Tensor(0.29226997, shape=(), dtype=float32) tf.Tensor(0.29194948, shape=(), dtype=float32) tf.Tensor(0.29162955, shape=(), dtype=float32) tf.Tensor(0.29131028, shape=(), dtype=float32) tf.Tensor(0.29099157, shape=(), dtype=float32) tf.Tensor(0.2906735, shape=(), dtype=float32) tf.Tensor(0.29035598, shape=(), dtype=float32) tf.Tensor(0.29003906, shape=(), dtype=float32) tf.Tensor(0.2897228, shape=(), dtype=float32) tf.Tensor(0.28940707, shape=(), dtype=float32) tf.Tensor(0.2890919, shape=(), dtype=float32) tf.Tensor(0.2887774, shape=(), dtype=float32) tf.Tensor(0.28846347, shape=(), dtype=float32) tf.Tensor(0.28815013, shape=(), dtype=float32) tf.Tensor(0.28783742, shape=(), dtype=float32) tf.Tensor(0.28752527, shape=(), dtype=float32) tf.Tensor(0.28721365, shape=(), dtype=float32) tf.Tensor(0.28690264, shape=(), dtype=float32) tf.Tensor(0.28659227, shape=(), dtype=float32) tf.Tensor(0.28628245, shape=(), dtype=float32) tf.Tensor(0.28597325, shape=(), dtype=float32) tf.Tensor(0.28566453, shape=(), dtype=float32) tf.Tensor(0.28535643, shape=(), dtype=float32) tf.Tensor(0.28504893, shape=(), dtype=float32) tf.Tensor(0.28474197, shape=(), dtype=float32) tf.Tensor(0.28443557, shape=(), dtype=float32) tf.Tensor(0.28412983, shape=(), dtype=float32) tf.Tensor(0.28382456, shape=(), dtype=float32)

In the example above, you iterated over the `dist_dataset`

to provide input to your training. You are also provided with the `tf.distribute.Strategy.make_experimental_numpy_dataset`

to support NumPy inputs. You can use this API to create a dataset before calling `tf.distribute.Strategy.experimental_distribute_dataset`

.

Another way of iterating over your data is to explicitly use iterators. You may want to do this when you want to run for a given number of steps as opposed to iterating over the entire dataset. The above iteration would now be modified to first create an iterator and then explicitly call `next`

on it to get the input data.

```
iterator = iter(dist_dataset)
for _ in range(10):
print(distributed_train_step(next(iterator)))
```

tf.Tensor(0.2835199, shape=(), dtype=float32) tf.Tensor(0.2832158, shape=(), dtype=float32) tf.Tensor(0.28291225, shape=(), dtype=float32) tf.Tensor(0.28260925, shape=(), dtype=float32) tf.Tensor(0.28230688, shape=(), dtype=float32) tf.Tensor(0.28200498, shape=(), dtype=float32) tf.Tensor(0.28170374, shape=(), dtype=float32) tf.Tensor(0.281403, shape=(), dtype=float32) tf.Tensor(0.2811028, shape=(), dtype=float32) tf.Tensor(0.2808032, shape=(), dtype=float32)

This covers the simplest case of using `tf.distribute.Strategy`

API to distribute custom training loops.

### What's supported now?

Training API | `MirroredStrategy` |
`TPUStrategy` |
`MultiWorkerMirroredStrategy` |
`ParameterServerStrategy` |
`CentralStorageStrategy` |
---|---|---|---|---|---|

Custom training loop | Supported | Supported | Supported | Experimental support | Experimental support |

### Examples and tutorials

Here are some examples for using distribution strategies with custom training loops:

- Tutorial: Training with a custom training loop and
`MirroredStrategy`

. - Tutorial: Training with a custom training loop and
`MultiWorkerMirroredStrategy`

. - Guide: Contains an example of a custom training loop with
`TPUStrategy`

. - Tutorial: Parameter server training with a custom training loop and
`ParameterServerStrategy`

. - TensorFlow Model Garden repository containing collections of state-of-the-art models implemented using various strategies.

## Other topics

This section covers some topics that are relevant to multiple use cases.

### Setting up the TF_CONFIG environment variable

For multi-worker training, as mentioned before, you need to set up the `'TF_CONFIG'`

environment variable for each binary running in your cluster. The `'TF_CONFIG'`

environment variable is a JSON string which specifies what tasks constitute a cluster, their addresses and each task's role in the cluster. The `tensorflow/ecosystem`

repo provides a Kubernetes template, which sets up `'TF_CONFIG'`

for your training tasks.

There are two components of `'TF_CONFIG'`

: a cluster and a task.

- A cluster provides information about the training cluster, which is a dict consisting of different types of jobs such as workers. In multi-worker training, there is usually one worker that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular worker does. Such worker is referred to as the "chief" worker, and it is customary that the worker with index
`0`

is appointed as the chief worker (in fact this is how`tf.distribute.Strategy`

is implemented). - A task on the other hand provides information about the current task. The first component cluster is the same for all workers, and the second component task is different on each worker and specifies the type and index of that worker.

One example of `'TF_CONFIG'`

is:

```
os.environ["TF_CONFIG"] = json.dumps({
"cluster": {
"worker": ["host1:port", "host2:port", "host3:port"],
"ps": ["host4:port", "host5:port"]
},
"task": {"type": "worker", "index": 1}
})
```

This `'TF_CONFIG'`

specifies that there are three workers and two `"ps"`

tasks in the `"cluster"`

along with their hosts and ports. The `"task"`

part specifies the role of the current task in the `"cluster"`

—worker `1`

(the second worker). Valid roles in a cluster are `"chief"`

, `"worker"`

, `"ps"`

, and `"evaluator"`

. There should be no `"ps"`

job except when using `tf.distribute.experimental.ParameterServerStrategy`

.

## What's next?

`tf.distribute.Strategy`

is actively under development. Try it out and provide your feedback using GitHub issues.