Callback to back up and restore the training state.

Inherits From: Callback

Used in the notebooks

Used in the tutorials

BackupAndRestore callback is intended to recover from interruptions that happened in the middle of a execution by backing up the training states in a temporary checkpoint file (based on TF CheckpointManager) at the end of each epoch. If training restarted before completion, the training state and model are restored to the most recently saved state at the beginning of a new run. Note that user is responsible to bring jobs back up. This callback is important for the backup and restore mechanism for fault tolerance purpose. And the model to be restored from an previous checkpoint is expected to be the same as the one used to back up. If user changes arguments passed to compile or fit, the checkpoint saved for fault tolerance can become invalid.


  1. This callback is not compatible with disabling eager execution.
  2. A checkpoint is saved at the end of each epoch, when restoring we'll redo any partial work from an unfinished epoch in which the training got restarted (so the work done before a interruption doesn't affect the final model state).
  3. This works for both single worker and multi-worker mode, only MirroredStrategy and MultiWorkerMirroredStrategy are supported for now.


class InterruptingCallback(tf.keras.callbacks.Callback):
  def on_epoch_begin(self, epoch, logs=None):