View source on GitHub |
Configures TensorFlow ops to run deterministically.
tf.config.experimental.enable_op_determinism()
When op determinism is enabled, TensorFlow ops will be deterministic. This means that if an op is run multiple times with the same inputs on the same hardware, it will have the exact same outputs each time. This is useful for debugging models. Note that determinism in general comes at the expense of lower performance and so your model may run slower when op determinism is enabled.
If you want your TensorFlow program to run deterministically, put the following code near the start of your program.
tf.keras.utils.set_random_seed(1)
tf.config.experimental.enable_op_determinism()
Calling tf.keras.utils.set_random_seed
sets the Python seed, the NumPy seed,
and the TensorFlow seed. Setting these seeds is necessary to ensure any random
numbers your program generates are also deterministic.
By default, op determinism is not enabled, so ops might return different results when run with the same inputs. These differences are often caused by the use of asynchronous threads within the op nondeterministically changing the order in which floating-point numbers are added. Most of these cases of nondeterminism occur on GPUs, which have thousands of hardware threads that are used to run ops. Enabling determinism directs such ops to use a different algorithm, one that does not use threads in a nondeterministic way.
Another potential source of nondeterminism is tf.data
based data processing.
Typically, this can introduce nondeterminsm due to the use of parallelism in
methods such as Dataset.map
producing inputs or running stateful ops in a
nondeterministic order. Enabling determinism will remove such sources of
nondeterminism.
Enabling determinism will likely make your model or your tf.data
data
processing slower. For example, Dataset.map
can become several orders of
magnitude slower when the map function has random ops or other stateful ops.
See the “Determinism and tf.data” section below for more details. In future
TensorFlow releases, we plan on improving the performance of determinism,
especially for common scenarios such as Dataset.map
.
Certain ops will raise an UnimplementedError
because they do not yet have a
deterministic implementation. Additionally, due to bugs, some ops might be
nondeterministic and not raise an UnimplementedError
. If you encounter such
ops, please file an issue.
An example of enabling determinism follows. The
tf.nn.softmax_cross_entropy_with_logits
op is run multiple times and the
output is shown to be the same each time. This example would likely fail when
run on a GPU if determinism were not enabled, because
tf.nn.softmax_cross_entropy_with_logits
uses a nondeterministic algorithm on
GPUs by default.
labels = tf.random.normal((1, 10000))
logits = tf.random.normal((1, 10000))
output = tf.nn.softmax_cross_entropy_with_logits(labels=labels,
logits=logits)
for _ in range(5):
output2 = tf.nn.softmax_cross_entropy_with_logits(labels=labels,
logits=logits)
tf.debugging.assert_equal(output, output2)
Writing deterministic models
You can make your models deterministic by enabling op determinism. This means that you can train a model and finish each run with exactly the same trainable variables. This also means that the inferences of your previously-trained model will be exactly the same on each run. Typically, models can be made deterministic by simply setting the seeds and enabling op determinism, as in the example above. However, to guarantee that your model operates deterministically, you must meet all the following requirements:
- Call
tf.config.experimental.enable_op_determinism()
, as mentioned above. - Reproducibly reset any pseudorandom number generators (PRNGs) you’re using,
such as by setting the seeds for the default PRNGs in TensorFlow, Python,
and NumPy, as mentioned above. Note that certain newer NumPy classes like
numpy.random.default_rng
ignore the global NumPy seed, so a seed must be explicitly passed to such classes, if used. - Use the same hardware configuration in every run.
- Use the same software environment in every run (OS, checkpoints, version of CUDA and TensorFlow, environmental variables, etc). Note that determinism is not guaranteed across different versions of TensorFlow.
- Do not use constructs outside TensorFlow that are nondeterministic, such as
reading from
/dev/random
or using multiple threads/processes in ways that influence TensorFlow’s behavior. - Ensure your input pipeline is deterministic. If you use
tf.data
, this is done automatically (at the expense of performance). See "Determinism and tf.data" below for more information. - Do not use
tf.compat.v1.Session
andtf.distribute.experimental.ParameterServerStrategy
, which can introduce nondeterminism. Besides ops (includingtf.data
ops), these are the only known potential sources of nondeterminism within TensorFlow, (if you find more, please file an issue). Note thattf.compat.v1.Session
is required to use the TF1 API, so determinism cannot be guaranteed when using the TF1 API. - Do not use nondeterministic custom ops.
Additional details on determinism
For stateful ops to be deterministic, the state of the system must be the same
every time the op is run. For example the output of tf.Variable.sparse_read
(obviously) depends on both the variable value and the indices
function
parameter. When determinism is enabled, the side effects of stateful ops are
deterministic.
TensorFlow’s random ops, such as tf.random.normal
, will raise a
RuntimeError
if determinism is enabled and a seed has not been set. However,
attempting to generate nondeterministic random numbers using Python or NumPy
will not raise such errors. Make sure you remember to set the Python and NumPy
seeds. Calling tf.keras.utils.set_random_seed
is an easy way to set all
three seeds.
Note that latency, memory consumption, throughput, and other performance
characteristics are not made deterministic by enabling op determinism.
Only op outputs and side effects are made deterministic. Additionally, a model
may nondeterministically raise a tf.errors.ResourceExhaustedError
from a
lack of memory due to the fact that memory consumption is nondeterministic.
Determinism and tf.data
Enabling deterministic ops makes tf.data
deterministic in several ways:
- For dataset methods with a
deterministic
argument, such asDataset.map
andDataset.batch
, thedeterministic
argument is overridden to beTrue
irrespective of its setting. - The
tf.data.Option.experimental_deterministic
option is overridden to beTrue
irrespective of its setting.. - In
Dataset.map
andDataset.interleave
, if the map or interleave function has stateful random ops or other stateful ops, the function will run serially instead of in parallel. This means thenum_parallel_calls
argument tomap
andinterleave
is effectively ignored. - Prefetching with
Dataset.prefetch
will be disabled if any function run as part of the input pipeline has certain stateful ops. Similarly, any dataset method with anum_parallel_calls
argument will be made to run serially if any function in the input pipeline has such stateful ops. Legacy random ops such astf.random.normal
will not cause such datasets to be changed, but most other stateful ops will.
Unfortunately, due to (3), performance can be greatly reduced when stateful
ops are used in Dataset.map
due to no longer running the map function in
parallel. A common example of stateful ops used in Dataset.map
are random
ops, such as tf.random.normal
, which are typically used for distortions. One
way to work around this is to use stateless random ops instead. Alternatively
you can hoist all random ops into its own separate Dataset.map
call, making
the original Dataset.map
call stateless and thus avoid the need to serialize
its execution.
(4) can also cause performance to be reduced, but occurs less frequently than
(3) because legacy random ops do not cause (4) to take effect. However, unlike
(3), when there are non-random stateful ops in a user-defined function, every
map
and interleave
dataset is affected, instead of just the map
or
interleave
dataset with the function that has stateful ops. Additionally,
prefetch
datasets and any dataset with the num_parallel_calls
argument are
also affected.