Tune in to the first Women in ML Symposium this Tuesday, October 19 at 9am PST Register now

tf.distribute.experimental.coordinator.ClusterCoordinator

An object to schedule and coordinate remote function execution.

Used in the notebooks

Used in the tutorials

This class is used to create fault-tolerant resources and dispatch functions to remote TensorFlow servers.

Currently, this class is not supported to be used in a standalone manner. It should be used in conjunction with a tf.distribute strategy that is designed to work with it. The ClusterCoordinator class currently only works tf.distribute.experimental.ParameterServerStrategy.

The schedule/join APIs

The most important APIs provided by this class is the schedule/join pair. The schedule API is non-blocking in that it queues a tf.function and returns a RemoteValue immediately. The queued functions will be dispatched to remote workers in background threads and their RemoteValues will be filled asynchronously. Since schedule doesn’t require worker assignment, the tf.function passed in can be executed on any available worker. If the worker it is executed on becomes unavailable before its completion, it will be migrated to another worker. Because of this fact and function execution is not atomic, a function may be executed more than once.

Handling Task Failure

This class when used with tf.distribute.experimental.ParameterServerStrategy, comes with built-in fault tolerance for worker failures. That is, when some workers are not available for any reason to be reached from the coordinator, the training progress continues to be made with the remaining workers. Upon recovery of a failed worker, it will be added for function execution after datasets created by create_per_worker_dataset are re-built on it.

When a parameter server fails, a tf.errors.UnavailableError is raised by schedule, join or done. In this case, in addition to bringing back the failed parameter server, users should restart the coordinator so that it reconnects to workers and parameter servers, re-creates the variables, and loads checkpoints. If the coordinator fails, after the user brings it back, the program will automatically connect to workers and parameter servers, and continue the progress from a checkpoint.

It is thus essential that in user's program, a checkpoint file is periodically saved, and restored at the start of the program. If an tf.keras.optimizers.Optimizer is checkpointed, after restoring from a checkpoiont, its iterations property roughly indicates the number of steps that have been made. This can be used to decide how many epochs and steps are needed before the training completion.

See tf.distribute.experimental.ParameterServerStrategy docstring for an example usage of this API.

This is currently under development, and the API as well as implementation are subject to changes.

strategy a supported tf.distribute.Strategy object. Currently, only tf.distribute.experimental.ParameterServerStrategy is supported.