tf.distribute.experimental.coordinator.ClusterCoordinator

An object to schedule and coordinate remote function execution.

Used in the notebooks

Used in the tutorials

This class is used to create fault-tolerant resources and dispatch functions to remote TensorFlow servers.

Currently, this class is not supported to be used in a standalone manner. It should be used in conjunction with a tf.distribute strategy that is designed to work with it. The ClusterCoordinator class currently only works tf.distribute.experimental.ParameterServerStrategy.

The schedule/join APIs

The most important APIs provided by this class is the schedule/join pair. The schedule API is non-blocking in that it queues a tf.function and returns a RemoteValue immediately. The queued functions will be dispatched to remote workers in background threads and their RemoteValues will be filled asynchronously. Since schedule doesn’t require worker assignment, the tf.function passed in can be executed on any available worker. If the worker it is executed on becomes unavailable