orbit.actions.SaveCheckpointIfPreempted

Action that saves on-demand checkpoints after a preemption.

cluster_resolver A tf.distribute.cluster_resolver.ClusterResolver object.
checkpoint_manager A tf.train.CheckpointManager object.
checkpoint_number A tf.Variable to indicate the checkpoint_number for checkpoint manager, usually it will be the global step.
keep_running_after_save Whether to keep the job running after the preemption on-demand checkpoint. Only set to True when in-process preemption recovery with tf.distribute.experimental.PreemptionWatcher is enabled.

Methods

__call__

View source

Call self as a function.