tf.train.CheckpointManager

TensorFlow 1 version View source on GitHub

Deletes old checkpoints.

tf.train.CheckpointManager(
    checkpoint, directory, max_to_keep, keep_checkpoint_every_n_hours=None,
    checkpoint_name='ckpt'
)

Used in the notebooks

Used in the guide Used in the tutorials

Example usage:

import tensorflow as tf
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.train.CheckpointManager(
    checkpoint, directory="/tmp/model", max_to_keep=5)
status = checkpoint.restore(manager.latest_checkpoint)
while True:
  # train
  manager.save()

CheckpointManager preserves its own state across instantiations (see the __init__ documentation for details). Only one should be active in a particular directory at a time.

Args:

  • checkpoint: The tf.train.Checkpoint instance to save and manage checkpoints for.
  • directory: The path to a directory in which to write checkpoints. A special file named "checkpoint" is also written to this directory (in a human-readable text format) which contains the state of the CheckpointManager.
  • max_to_keep: An integer, the number of checkpoints to keep. Unless preserved by keep_checkpoint_every_n_hours, checkpoints will be deleted from the active set, oldest first, until only max_to_keep checkpoints remain. If None, no checkpoints are deleted and everything stays in the active set. Note that max_to_keep=None will keep all checkpoint paths in memory and in the checkpoint state protocol buffer on disk.
  • keep_checkpoint_every_n_hours: Upon removal from the active set, a checkpoint will be preserved if it has been at least keep_checkpoint_every_n_hours since the last preserved checkpoint. The default setting of None does not preserve any checkpoints in this way.
  • checkpoint_name: Custom name for the checkpoint file.

Attributes:

  • checkpoints: A list of managed checkpoints.

    Note that checkpoints saved due to keep_checkpoint_every_n_hours will not show up in this list (to avoid ever-growing filename lists).

  • latest_checkpoint: The prefix of the most recent checkpoint in directory.

    Equivalent to tf.train.latest_checkpoint(directory) where directory is the constructor argument to CheckpointManager.

    Suitable for passing to tf.train.Checkpoint.restore to resume training.

Raises:

  • ValueError: If max_to_keep is not a positive integer.

Methods

save

View source

save(
    checkpoint_number=None
)

Creates a new checkpoint and manages it.

Args:

  • checkpoint_number: An optional integer, or an integer-dtype Variable or Tensor, used to number the checkpoint. If None (default), checkpoints are numbered using checkpoint.save_counter. Even if checkpoint_number is provided, save_counter is still incremented. A user-provided checkpoint_number is not incremented even if it is a Variable.

Returns:

The path to the new checkpoint. It is also recorded in the checkpoints and latest_checkpoint properties.