Module: tf.distribute

Library for running a computation across multiple devices.

The intent of this library is that you can write an algorithm in a stylized way and it will be usable with a variety of different tf.distribute.Strategy implementations. Each descendant will implement a different strategy for distributing the algorithm across multiple devices/machines. Furthermore, these changes can be hidden inside the specific layers and other library classes that need special treatment to run in a distributed setting, so that most users' model definition code can run unchanged. The tf.distribute.Strategy API works the same way with eager and graph execution.

Guides

Tutorials

Glossary

  • Data parallelism is where we run multiple copies of the model on different slices of the input data. This is in contrast to model parallelism where we divide up a single copy of a model across multiple devices. Note: we only support data parallelism for now, but hope to add support for model parallelism in the future.
  • A device is a CPU or accelerator (e.g. GPUs, TPUs) on some machine that TensorFlow can run operations on (see e.g. tf.device). You may have multiple devices on a single machine, or be connected to devices on multiple machines. Devices used to run computations are called worker devices. Devices used to store variables are parameter devices. For some strategies, such as tf.distribute.MirroredStrategy, the worker and parameter devices will be the same (see mirrored variables below). For others they will be different. For example, tf.distribute.experimental.CentralStorageStrategy puts the variables on a single device (which may be a worker device or may be the CPU), and tf.distribute.experimental.ParameterServerStrategy puts the variables on separate machines called parameter servers (see below).
  • A replica is one copy of the model, running on one slice of the input data. Right now each replica is executed on i