![]() |
![]() |
![]() |
![]() |
The phrase "Saving a TensorFlow model" typically means one of two things:
- Checkpoints, OR
- SavedModel.
Checkpoints capture the exact value of all parameters (tf.Variable
objects) used by a model. Checkpoints do not contain any description of the computation defined by the model and thus are typically only useful when source code that will use the saved parameter values is available.
The SavedModel format on the other hand includes a serialized description of the computation defined by the model in addition to the parameter values (checkpoint). Models in this format are independent of the source code that created the model. They are thus suitable for deployment via TensorFlow Serving, TensorFlow Lite, TensorFlow.js, or programs in other programming languages (the C, C++, Java, Go, Rust, C# etc. TensorFlow APIs).
This guide covers APIs for writing and reading checkpoints.
Setup
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
class Net(tf.keras.Model):
"""A simple linear model."""
def __init__(self):
super(Net, self).__init__()
self.l1 = tf.keras.layers.Dense(5)
def call(self, x):
return self.l1(x)
net = Net()
Saving from tf.keras
training APIs
See the tf.keras
guide on saving and
restoring.
tf.keras.Model.save_weights
saves a TensorFlow checkpoint.
net.save_weights('easy_checkpoint')
Writing checkpoints
The persistent state of a TensorFlow model is stored in tf.Variable
objects. These can be constructed directly, but are often created through high-level APIs like tf.keras.layers
or tf.keras.Model
.
The easiest way to manage variables is by attaching them to Python objects, then referencing those objects.
Subclasses of tf.train.Checkpoint
, tf.keras.layers.Layer
, and tf.keras.Model
automatically track variables assigned to their attributes. The following example constructs a simple linear model, then writes checkpoints which contain values for all of the model's variables.
You can easily save a model-checkpoint with Model.save_weights
Manual checkpointing
Setup
To help demonstrate all the features of tf.train.Checkpoint
define a toy dataset and optimization step:
def toy_dataset():
inputs = tf.range(10.)[:, None]
labels = inputs * 5. + tf.range(5.)[None, :]
return tf.data.Dataset.from_tensor_slices(
dict(x=inputs, y=labels)).repeat(10).batch(2)
def train_step(net, example, optimizer):
"""Trains `net` on `example` using `optimizer`."""
with tf.GradientTape() as tape:
output = net(example['x'])
loss = tf.reduce_mean(tf.abs(output - example['y']))
variables = net.trainable_variables
gradients = tape.gradient(loss, variables)
optimizer.apply_gradients(zip(gradients, variables))
return loss
Create the checkpoint objects
To manually make a checkpoint you will need a tf.train.Checkpoint
object. Where the objects you want to checkpoint are set as attributes on the object.
A tf.train.CheckpointManager
can also be helpful for managing multiple checkpoints.
opt = tf.keras.optimizers.Adam(0.1)
ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)
manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)
Train and checkpoint the model
The following training loop creates an instance of the model and of an optimizer, then gathers them into a tf.train.Checkpoint
object. It calls the training step in a loop on each batch of data, and periodically writes checkpoints to disk.
def train_and_checkpoint(net, manager):
ckpt.restore(manager.latest_checkpoint)
if manager.latest_checkpoint:
print("Restored from {}".format(manager.latest_checkpoint))
else:
print("Initializing from scratch.")
for example in toy_dataset():
loss = train_step(net, example, opt)
ckpt.step.assign_add(1)
if int(ckpt.step) % 10 == 0:
save_path = manager.save()
print("Saved checkpoint for step {}: {}".format(int(ckpt.step), save_path))
print("loss {:1.2f}".format(loss.numpy()))
train_and_checkpoint(net, manager)
Initializing from scratch. Saved checkpoint for step 10: ./tf_ckpts/ckpt-1 loss 29.86 Saved checkpoint for step 20: ./tf_ckpts/ckpt-2 loss 23.28 Saved checkpoint for step 30: ./tf_ckpts/ckpt-3 loss 16.73 Saved checkpoint for step 40: ./tf_ckpts/ckpt-4 loss 10.33 Saved checkpoint for step 50: ./tf_ckpts/ckpt-5 loss 6.01
Restore and continue training
After the first you can pass a new model and manager, but pickup training exactly where you left off:
opt = tf.keras.optimizers.Adam(0.1)
net = Net()
ckpt = tf.train.Checkpoint(step=tf.Variable(1), optimizer=opt, net=net)
manager = tf.train.CheckpointManager(ckpt, './tf_ckpts', max_to_keep=3)
train_and_checkpoint(net, manager)
Restored from ./tf_ckpts/ckpt-5 Saved checkpoint for step 60: ./tf_ckpts/ckpt-6 loss 3.24 Saved checkpoint for step 70: ./tf_ckpts/ckpt-7 loss 1.09 Saved checkpoint for step 80: ./tf_ckpts/ckpt-8 loss 0.79 Saved checkpoint for step 90: ./tf_ckpts/ckpt-9 loss 1.09 Saved checkpoint for step 100: ./tf_ckpts/ckpt-10 loss 0.55
The tf.train.CheckpointManager
object deletes old checkpoints. Above it's configured to keep only the three most recent checkpoints.
print(manager.checkpoints) # List the three remaining checkpoints
['./tf_ckpts/ckpt-8', './tf_ckpts/ckpt-9', './tf_ckpts/ckpt-10']
These paths, e.g. './tf_ckpts/ckpt-10'
, are not files on disk. Instead they are prefixes for an index
file and one or more data files which contain the variable values. These prefixes are grouped together in a single checkpoint
file ('./tf_ckpts/checkpoint'
) where the CheckpointManager
saves its state.
!ls ./tf_ckpts
checkpoint ckpt-8.data-00001-of-00002 ckpt-10.data-00000-of-00002 ckpt-8.index ckpt-10.data-00001-of-00002 ckpt-9.data-00000-of-00002 ckpt-10.index ckpt-9.data-00001-of-00002 ckpt-8.data-00000-of-00002 ckpt-9.index
Loading mechanics
TensorFlow matches variables to checkpointed values by traversing a directed graph with named edges, starting from the object being loaded. Edge names typically come from attribute names in objects, for example the "l1"
in self.l1 = tf.keras.layers.Dense(5)
. tf.train.Checkpoint
uses its keyword argument names, as in the "step"
in tf.train.Checkpoint(step=...)
.
The dependency graph from the example above looks like this:
With the optimizer in red, regular variables in blue, and optimizer slot variables in orange. The other nodes, for example representing the tf.train.Checkpoint
, are black.
Slot variables are part of the optimizer's state, but are created for a specific variable. For example the 'm'
edges above correspond to momentum, which the Adam optimizer tracks for each variable. Slot variables are only saved in a checkpoint if the variable and the optimizer would both be saved, thus the dashed edges.
Calling restore()
on a tf.train.Checkpoint
object queues the requested restorations, restoring variable values as soon as there's a matching path from the Checkpoint
object. For example we can load just the kernel from the model we defined above by reconstructing one path to it through the network and the layer.
to_restore = tf.Variable(tf.zeros([5]))
print(to_restore.numpy()) # All zeros
fake_layer = tf.train.Checkpoint(bias=to_restore)
fake_net = tf.train.Checkpoint(l1=fake_layer)
new_root = tf.train.Checkpoint(net=fake_net)
status = new_root.restore(tf.train.latest_checkpoint('./tf_ckpts/'))
print(to_restore.numpy()) # We get the restored value now
[0. 0. 0. 0. 0.] [1.3864772 3.9660616 2.4380682 4.362023 5.2540216]
The dependency graph for these new objects is a much smaller subgraph of the larger checkpoint we wrote above. It includes only the bias and a save counter that tf.train.Checkpoint
uses to number checkpoints.
restore()
returns a status object, which has optional assertions. All of the objects we've created in our new Checkpoint
have been restored, so status.assert_existing_objects_matched()
passes.
status.assert_existing_objects_matched()
<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f8fe901d2e8>
There are many objects in the checkpoint which haven't matched, including the layer's kernel and the optimizer's variables. status.assert_consumed()
only passes if the checkpoint and the program match exactly, and would throw an exception here.
Delayed restorations
Layer
objects in TensorFlow may delay the creation of variables to their first call, when input shapes are available. For example the shape of a Dense
layer's kernel depends on both the layer's input and output shapes, and so the output shape required as a constructor argument is not enough information to create the variable on its own. Since calling a Layer
also reads the variable's value, a restore must happen between the variable's creation and its first use.
To support this idiom, tf.train.Checkpoint
queues restores which don't yet have a matching variable.
delayed_restore = tf.Variable(tf.zeros([1, 5]))
print(delayed_restore.numpy()) # Not restored; still zeros
fake_layer.kernel = delayed_restore
print(delayed_restore.numpy()) # Restored
[[0. 0. 0. 0. 0.]] [[4.787564 4.5514402 4.9668665 4.778242 4.8766174]]
Manually inspecting checkpoints
tf.train.list_variables
lists the checkpoint keys and shapes of variables in a checkpoint. Checkpoint keys are paths in the graph displayed above.
tf.train.list_variables(tf.train.latest_checkpoint('./tf_ckpts/'))
[('_CHECKPOINTABLE_OBJECT_GRAPH', []), ('net/l1/bias/.ATTRIBUTES/VARIABLE_VALUE', [5]), ('net/l1/bias/.OPTIMIZER_SLOT/optimizer/m/.ATTRIBUTES/VARIABLE_VALUE', [5]), ('net/l1/bias/.OPTIMIZER_SLOT/optimizer/v/.ATTRIBUTES/VARIABLE_VALUE', [5]), ('net/l1/kernel/.ATTRIBUTES/VARIABLE_VALUE', [1, 5]), ('net/l1/kernel/.OPTIMIZER_SLOT/optimizer/m/.ATTRIBUTES/VARIABLE_VALUE', [1, 5]), ('net/l1/kernel/.OPTIMIZER_SLOT/optimizer/v/.ATTRIBUTES/VARIABLE_VALUE', [1, 5]), ('optimizer/beta_1/.ATTRIBUTES/VARIABLE_VALUE', []), ('optimizer/beta_2/.ATTRIBUTES/VARIABLE_VALUE', []), ('optimizer/decay/.ATTRIBUTES/VARIABLE_VALUE', []), ('optimizer/iter/.ATTRIBUTES/VARIABLE_VALUE', []), ('optimizer/learning_rate/.ATTRIBUTES/VARIABLE_VALUE', []), ('save_counter/.ATTRIBUTES/VARIABLE_VALUE', []), ('step/.ATTRIBUTES/VARIABLE_VALUE', [])]
List and dictionary tracking
As with direct attribute assignments like self.l1 = tf.keras.layers.Dense(5)
, assigning lists and dictionaries to attributes will track their contents.
save = tf.train.Checkpoint()
save.listed = [tf.Variable(1.)]
save.listed.append(tf.Variable(2.))
save.mapped = {'one': save.listed[0]}
save.mapped['two'] = save.listed[1]
save_path = save.save('./tf_list_example')
restore = tf.train.Checkpoint()
v2 = tf.Variable(0.)
assert 0. == v2.numpy() # Not restored yet
restore.mapped = {'two': v2}
restore.restore(save_path)
assert 2. == v2.numpy()
You may notice wrapper objects for lists and dictionaries. These wrappers are checkpointable versions of the underlying data-structures. Just like the attribute based loading, these wrappers restore a variable's value as soon as it's added to the container.
restore.listed = []
print(restore.listed) # ListWrapper([])
v1 = tf.Variable(0.)
restore.listed.append(v1) # Restores v1, from restore() in the previous cell
assert 1. == v1.numpy()
ListWrapper([])
The same tracking is automatically applied to subclasses of tf.keras.Model
, and may be used for example to track lists of layers.
Saving object-based checkpoints with Estimator
See the Estimator guide.
Estimators by default save checkpoints with variable names rather than the object graph described in the previous sections. tf.train.Checkpoint
will accept name-based checkpoints, but variable names may change when moving parts of a model outside of the Estimator's model_fn
. Saving object-based checkpoints makes it easier to train a model inside an Estimator and then use it outside of one.
import tensorflow.compat.v1 as tf_compat
def model_fn(features, labels, mode):
net = Net()
opt = tf.keras.optimizers.Adam(0.1)
ckpt = tf.train.Checkpoint(step=tf_compat.train.get_global_step(),
optimizer=opt, net=net)
with tf.GradientTape() as tape:
output = net(features['x'])
loss = tf.reduce_mean(tf.abs(output - features['y']))
variables = net.trainable_variables
gradients = tape.gradient(loss, variables)
return tf.estimator.EstimatorSpec(
mode,
loss=loss,
train_op=tf.group(opt.apply_gradients(zip(gradients, variables)),
ckpt.step.assign_add(1)),
# Tell the Estimator to save "ckpt" in an object-based format.
scaffold=tf_compat.train.Scaffold(saver=ckpt))
tf.keras.backend.clear_session()
est = tf.estimator.Estimator(model_fn, './tf_estimator_example/')
est.train(toy_dataset, steps=10)
INFO:tensorflow:Using default config. INFO:tensorflow:Using config: {'_model_dir': './tf_estimator_example/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8fa20e9358>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ./tf_estimator_example/model.ckpt. INFO:tensorflow:loss = 4.6529474, step = 0 INFO:tensorflow:Saving checkpoints for 10 into ./tf_estimator_example/model.ckpt. INFO:tensorflow:Loss for final step: 39.479168. <tensorflow_estimator.python.estimator.estimator.EstimatorV2 at 0x7f8fa20f3f60>
tf.train.Checkpoint
can then load the Estimator's checkpoints from its model_dir
.
opt = tf.keras.optimizers.Adam(0.1)
net = Net()
ckpt = tf.train.Checkpoint(
step=tf.Variable(1, dtype=tf.int64), optimizer=opt, net=net)
ckpt.restore(tf.train.latest_checkpoint('./tf_estimator_example/'))
ckpt.step.numpy() # From est.train(..., steps=10)
10
Summary
TensorFlow objects provide an easy automatic mechanism for saving and restoring the values of variables they use.