TensorFlow is back at Google I/O on May 14! Register now

tf.train.CheckpointManager

{ }

View source on GitHub

Manages multiple checkpoints by keeping some and deleting unneeded ones.

View aliases

Compat aliases for migration

See Migration guide for more details.

tf.compat.v1.train.CheckpointManager

tf.train.CheckpointManager(
    checkpoint,
    directory,
    max_to_keep,
    keep_checkpoint_every_n_hours=None,
    checkpoint_name='ckpt',
    step_counter=None,
    checkpoint_interval=None,
    init_fn=None
)

Used in the notebooks

Used in the guide	Used in the tutorials
Training checkpoints tf.data: Build TensorFlow input pipelines Migrate the fault tolerance mechanism	Distributed training with DTensors Custom training loop with Keras and MultiWorkerMirroredStrategy Multi-worker training with Keras CycleGAN Training with Orbit

Example usage:

import tensorflow as tf
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.train.CheckpointManager(
    checkpoint, directory="/tmp/model", max_to_keep=5)
status = checkpoint.restore(manager.latest_checkpoint)
while True:
  # train
  manager.save()

CheckpointManager preserves its own state across instantiations (see the __init__ documentation for details). Only one should be active in a particular directory at a time.

Args
`checkpoint`	The `tf.train.Checkpoint` instance to save and manage checkpoints for.
`directory`	The path to a directory in which to write checkpoints. A special file named "checkpoint" is also written to this directory (in a human-readable text format) which contains the state of the `CheckpointManager`.
`max_to_keep`	An integer, the number of checkpoints to keep. Unless preserved by `keep_checkpoint_every_n_hours`, checkpoints will be deleted from the active set, oldest first, until only `max_to_keep` checkpoints remain. If `None`, no checkpoints are deleted and everything stays in the active set. Note that `max_to_keep=None` will keep all checkpoint paths in memory and in the checkpoint state protocol buffer on disk.
`keep_checkpoint_every_n_hours`	Upon removal from the active set, a checkpoint will be preserved if it has been at least `keep_checkpoint_every_n_hours` since the last preserved checkpoint. The default setting of `None` does not preserve any checkpoints in this way.
`checkpoint_name`	Custom name for the checkpoint file.
`step_counter`	A `tf.Variable` instance for checking the current step counter value, in case users want to save checkpoints every N steps.
`checkpoint_interval`	An integer, indicates the minimum step interval between two checkpoints.
`init_fn`	Callable. A function to do customized intialization if no checkpoints are in the directory.

Raises
`ValueError`	If `max_to_keep` is not a positive integer.

Attributes
`checkpoint`	Returns the `tf.train.Checkpoint` object.
`checkpoint_interval`
`checkpoints`	A list of managed checkpoints. Note that checkpoints saved due to `keep_checkpoint_every_n_hours` will not show up in this list (to avoid ever-growing filename lists).
`directory`
`latest_checkpoint`	The prefix of the most recent checkpoint in `directory`. Equivalent to `tf.train.latest_checkpoint(directory)` where `directory` is the constructor argument to `CheckpointManager`. Suitable for passing to `tf.train.Checkpoint.restore` to resume training.

Methods

`restore_or_initialize`

View source

restore_or_initialize()

Restore items in checkpoint from the latest checkpoint file.

This method will first try to restore from the most recent checkpoint in directory. If no checkpoints exist in directory, and init_fn is specified, this method will call init_fn to do customized initialization. This can be used to support initialization from pretrained models.

Note that unlike tf.train.Checkpoint.restore(), this method doesn't return a load status object that users can run assertions on (e.g. assert_consumed()). Thus to run assertions, users should directly use tf.train.Checkpoint.restore() method.

Returns
The restored checkpoint path if the lastest checkpoint is found and restored. Otherwise None.

`save`

View source

save(
    checkpoint_number=None, check_interval=True, options=None
)

Creates a new checkpoint and manages it.

Args
`checkpoint_number`	An optional integer, or an integer-dtype `Variable` or `Tensor`, used to number the checkpoint. If `None` (default), checkpoints are numbered using `checkpoint.save_counter`. Even if `checkpoint_number` is provided, `save_counter` is still incremented. A user-provided `checkpoint_number` is not incremented even if it is a `Variable`.
`check_interval`	An optional boolean. The argument is only effective when `checkpoint_interval` is passed into the manager. If `True`, the manager will only save the checkpoint if the interval between checkpoints is larger than `checkpoint_interval`. Otherwise it will always save the checkpoint unless a checkpoint has already been saved for the current step.
`options`	Optional `tf.train.CheckpointOptions` object. This argument only works with TF2 checkpoint objects. For example, options = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')

Returns
The path to the new checkpoint. It is also recorded in the `checkpoints` and `latest_checkpoint` properties. `None` if no checkpoint is saved.

`sync`

View source

sync()

Wait for any outstanding save or restore operations.

tf.train.CheckpointManager

View aliases

Used in the notebooks

Example usage:

Args

Raises

Attributes

Methods

restore_or_initialize

save

sync

`restore_or_initialize`

`save`

`sync`