TensorFlow debugger (tfdbg) is a specialized debugger for TensorFlow. It provides visibility into the internal structure and states of running TensorFlow graphs. The insight gained from this visibility should facilitate debugging of various types of model bugs during training and inference.
This tutorial showcases the features of tfdbg
command-line interface (CLI), by focusing on how to debug a
type of frequently-encountered bug in TensorFlow model development:
bad numerical values (
infs) causing training to fail.
To observe such an issue, run the following code without the debugger:
python -m tensorflow.python.debug.examples.debug_mnist
This code trains a simple NN for MNIST digit image recognition. Notice that the accuracy increases slightly after the first training step, but then gets stuck at a low (near-chance) level:
Scratching your head, you suspect that certain nodes in the training graph
generated bad numeric values such as
nans. The computation-graph
paradigm of TensorFlow makes it non-trivial to debug such model-internal states
with general-purpose debuggers such as Python's
tfdbg specializes in diagnosing these types of issues and pinpointing the
exact node where the problem first surfaced.
Wrapping TensorFlow Sessions with tfdbg
To add support for tfdbg in our example, we just need to add the following
three lines of code, which wrap the Session object with a debugger wrapper when
--debug flag is provided:
# Let your BUILD target depend on "//tensorflow/python/debug:debug_py" # (You don't need to worry about the BUILD dependency if you are using a pip # install of open-source TensorFlow.) from tensorflow.python import debug as tf_debug sess = tf_debug.LocalCLIDebugWrapperSession(sess) sess.add_tensor_filter("has_inf_or_nan", tf_debug.has_inf_or_nan)
This wrapper has the same interface as Session, so enabling debugging requires no other changes to the code. But the wrapper provides additional features including:
- Bringing up a terminal-based user interface (UI) before and after each
run()call, to let you control the execution and inspect the graph's internal state.
- Allowing you to register special "filters" for tensor values, to facilitate the diagnosis of issues.
In this example, we are registering a tensor filter called
which simply determines if there are any
inf values in any
intermediate tensor of the graph. (This filter is a common enough use case that
we ship it with the
def has_inf_or_nan(datum, tensor): return np.any(np.isnan(tensor)) or np.any(np.isinf(tensor))
TIP: You can also write your own custom filters. See
the API documentation
DebugDumpDir.find() for additional information.
Debugging Model Training with tfdbg
Let's try training the model again with debugging enabled. Execute the command
from above, this time with the
--debug flag added:
python -m tensorflow.python.debug.examples.debug_mnist --debug
The debug wrapper session will prompt you when it is about to execute the first
run() call, with information regarding the fetched tensor and feed
dictionaries displayed on the screen.
This is what we refer to as the run-start UI. If the screen size is too small to display the content of the message in its entirety, you can resize it or use the PageUp / PageDown / Home / End keys to navigate the screen output.
As the screen output indicates, the first
run() call calculates the accuracy
using a test data set—i.e., a forward pass on the graph. You can enter the
run (or its shorthand
r) to launch the
run() call. On terminals
that support mouse events, you can simply click the underlined
run on the top
left corner of the screen to proceed.
This will bring up another screen
right after the
run() call has ended, which will display all dumped
intermediate tensors from the run. (These tensors can also be obtained by
running the command
lt after you executed
run.) This is called the
tfdbg CLI Frequently-Used Commands
Try the following commands at the
tfdbg> prompt (referencing the code at
||Print the value of the tensor
||Print a subarray of the tensor
||For a large tensor like the one here, print its value in its entirety—i.e., without using any ellipsis. May take a long time for large tensors.|
||Navigate to indices [10, 0] in the tensor being displayed.|
||Search the screen output with the regex
||Scroll to the next line with matches to the searched regex (if any).|
||Display information about the node
||Display the stack trace of node
||List the inputs to the node
||List the recipients of the output of the node
||List all dumped tensors whose names match the regular-expression pattern
||List all dumped tensors whose node type is
||List all Python source files responsible for constructing the nodes (and tensors) in the current graph.|
||List Python source files responsible for constructing the nodes whose names match the pattern
||Print the Python source file source.py, with the lines annotated with the ops created at each of them, respectively.|
||Same as the command above, but perform annotation using dumped Tensors, instead of ops.|
||Annotate source.py beginning at line 30.|
||Display information about the current run, including fetches and feeds.|
||Print general help information listing all available tfdbg commands and their flags.|
||Print the help information for the
In this first
run() call, there happen to be no problematic numerical values.
You can move on to the next run by using the command
run or its shorthand
TIP: If you enter
rrepeatedly, you will be able to move through the
run()calls in a sequential manner.
You can also use the
-tflag to move ahead a number of
run()calls at a time, for example:
tfdbg> run -t 10
Instead of entering
run repeatedly and manually searching for
infs in the run-end UI after every
run() call, you can use the following
command to let the debugger repeatedly execute
run() calls without stopping at
the run-start or run-end prompt, until the first
inf value shows up
in the graph. This is analogous to conditional breakpoints in some
tfdbg> run -f has_inf_or_nan
NOTE: This works because we have previously registered a filter for
has_inf_or_nan(as explained previously). If you have registered any other filters, you can let tfdbg run till any tensors pass that filter as well, e.g.,
# In python code: sess.add_tensor_filter('my_filter', my_filter_callable) # Run at tfdbg run-start prompt: tfdbg> run -f my_filter
After you enter
run -f has_inf_or_nan, you will see the following
screen with a red-colored title line indicating tfdbg stopped immediately
run() call generated intermediate tensors that passed the specified
As the screen display indicates, the
has_inf_or_nan filter is first passed
during the fourth
run() call: an Adam optimizer
forward-backward training pass on the graph. In this run, 36 (out of the total
95) intermediate tensors contain
inf values. These tensors are listed
in chronological order, with their timestamps displayed on the left. At the top
of the list, you can see the first tensor in which the bad numerical values
To view the value of the tensor, click the underlined tensor name
cross_entropy/Log:0 or enter the equivalent command:
tfdbg> pt cross_entropy/Log:0
Scroll down a little and you will notice some scattered
inf values. If the
nan are difficult to spot by eye, you can use the
following command to perform a regex search and highlight the output:
Why did these infinities appear? To further debug, display more information
about the node
cross_entropy/Log by clicking the underlined
item on the top or entering the equivalent command:
tfdbg> ni cross_entropy/Log
You can see that this node has the op type
and that its input is the node
softmax/Softmax. Run the following command to
take a closer look at the input tensor:
tfdbg> pt softmax/Softmax:0
Examine the values in the input tensor, and search to see if there are any zeros:
Indeed, there are zeros. Now it is clear that the origin of the bad numerical
values is the node
cross_entropy/Log taking logs of zeros. To find out the
culprit line in the Python source code, use the
-t flag of the
to show the traceback of the node's construction:
tfdbg> ni -t cross_entropy/Log
-t flag is used by default, if you use the clickable "node_info" menu item
at the top of the screen.
From the traceback, you can see that the op is constructed around lines 105-106
diff = y_ * tf.log(y)
*tfdbg has a feature that makes it ease to trace Tensors and ops back to
lines in Python source files. It can annotate lines of a Python file with
the ops or Tensors created by them. To use this feature,
simply click the underlined line numbers in the stack trace output of the
ni -t <op_name> commands, or use the
print_source) command such as:
ps /path/to/source.py. See the screenshot below for an example of
Apply a value clipping on the input to
to resolve this problem:
diff = y_ * tf.log(tf.clip_by_value(y, 1e-8, 1.0))
Now, try training again with
python -m tensorflow.python.debug.examples.debug_mnist --debug
run -f has_inf_or_nan at the
tfdbg> prompt and confirm that no tensors
are flagged as containing
inf values, and accuracy no longer gets
Debugging tf-learn Estimators
For documentation on tfdbg to debug
Experiments, please see
How to Use TensorFlow Debugger (tfdbg) with tf.contrib.learn.
Offline Debugging of Remotely-Running Sessions
Oftentimes, your model is running in a remote machine or process that you don't
have terminal access to. To perform model debugging in such cases, you can use
tfdbg. It operates on dumped data directories.
If the process you are running is written in Python, you can
RunOptions proto that you call your
with, by using the method
This will cause the intermediate tensors and runtime graphs to be dumped to a
shared storage location of your choice when the
Session.run() call occurs.
from tensorflow.python.debug import debug_utils # ... Code where your session and graph are set up... run_options = tf.RunOptions() debug_utils.watch_graph( run_options, session.graph, debug_urls=["file:///shared/storage/location/tfdbg_dumps_1"]) # Be sure to use different directories for different run() calls. session.run(fetches, feed_dict=feeds, options=run_options)
Later, in an environment that you have terminal access to, you can load and
inspect the data in the dump directory on the shared storage by using the
tfdbg. For example:
python -m tensorflow.python.debug.cli.offline_analyzer \ --dump_dir=/shared/storage/location/tfdbg_dumps_1
DumpingDebugWrapperSession offers an easier and more
flexible way to generate dumps on filesystem that can be analyzed offline.
To use it, simply do:
# Let your BUILD target depend on "//tensorflow/python/debug:debug_py # (You don't need to worry about the BUILD dependency if you are using a pip # install of open-source TensorFlow.) from tensorflow.python.debug import debug_utils sess = tf_debug.DumpingDebugWrapperSession( sess, "/shared/storage/location/tfdbg_dumps_1/", watch_fn=my_watch_fn)
watch_fn=my_watch_fn is a
Callable that allows you to configure what
Tensors to watch on different
Session.run() calls, as a function of the
feed_dict to the
run() call and other states. See
the API doc of DumpingDebugWrapperSession
for more details.
If you model code is written in C++ or other languages, you can also
debug_options field of
RunOptions to generate debug dumps that
can be inspected offline. See
the proto definition
for more details.
Other Features of the tfdbg CLI
- Navigation through command history using the Up and Down arrow keys. Prefix-based navigation is also supported.
- Navigation through history of screen outputs using the
nextcommands or by clicking the underlined
-->links near the top of the screen.
- Tab completion of commands and some command arguments.
- Write screen output to file by using bash-style redirection. For example:
tfdbg> pt cross_entropy/Log:0[:, 0:10] > /tmp/xent_value_slices.txt
Frequently Asked Questions
Q: Do the timestamps on the left side of the
lt output reflect actual
performance in a non-debugging session?
A: No. The debugger inserts additional special-purpose debug nodes to the graph to record the values of intermediate tensors. These nodes certainly slow down the graph execution. If you are interested in profiling your model, check out tfprof and other profiling tools for TensorFlow.
Q: How do I link tfdbg against my
Session in Bazel? Why do I see an
error such as "ImportError: cannot import name debug"?
A: In your BUILD rule, declare dependencies:
The first is the dependency that you include to use TensorFlow even
without debugger support; the second enables the debugger.
Then, In your Python file, add:
from tensorflow.python import debug as tf_debug # Then wrap your TensorFlow Session with the local-CLI wrapper. sess = tf_debug.LocalCLIDebugWrapperSession(sess)
Q: Does tfdbg help debugging runtime errors such as shape mismatches?
A: Yes. tfdbg intercepts errors generated by ops during runtime and presents the errors with some debug instructions to the user in the CLI. See examples:
# Debugging shape mismatch during matrix multiplication. python -m tensorflow.python.debug.examples.debug_errors \ --error shape_mismatch --debug # Debugging uninitialized variable. python -m tensorflow.python.debug.examples.debug_errors \ --error uninitialized_variable --debug
Q: How can I let my tfdbg-wrapped Sessions or Hooks run the debug mode only from the main thread?
This is a common use case, in which the
Session object is used from multiple
threads concurrently. Typically, the child threads take care of background tasks
such as running enqueue operations. Oftentimes, you want to debug only the main
thread (or less frequently, only one of the child threads). You can use the
thread_name_filter keyword argument of
achieve this type of thread-selective debugging. For example, if you would like
to debug from only the main thread, you can do:
sess = tf_debug.LocalCLIDebugWrapperSession(sess, thread_name_filter="MainThread$")
The above example relies on the fact that main threads in Python have the
Q: The model I am debugging is very large. The data dumped by tfdbg fills up the free space of my disk. What can I do?
For large models, i.e., models with many intermediate tensors, large sizes in
individual intermediate tensors and/or many iterations in any
that the graph contains, this kind of disk space issue can happen.
There are three possible workarounds or solutions:
- The constructors of
LocalCLIDebugHookprovide a keyword argument,
dump_root, with which you can specify the path to which tfdbg dumps the debug data. For example:
``` python # For LocalCLIDebugWrapperSession sess = tf_debug.LocalCLIDebugWrapperSession(dump_root="/with/lots/of/space")
# For LocalCLIDebugHook
hooks = [tf_debug.LocalCLIDebugHook(dump_root="/with/lots/of/space")]
Make sure that the directory pointed to by dump_root is empty or nonexistent.
**tfdbg** cleans up the dump directories before exiting.
2. Reduce the batch size used during the runs.
3. Use the filtering options of **tfdbg**'srun` command to watch only specific
nodes in the graph. For example:
tfdbg> run --node_name_filter .*hidden.*
tfdbg> run --op_type_filter Variable.*
tfdbg> run --tensor_dtype_filter int.*
Q: Why can't I select text in the tfdbg CLI?
A: This is because the tfdbg CLI enables mouse events in the terminal by
default. This mouse-mask mode
overrides default terminal interactions, including text selection. You
can re-enable text selection by using the command
mouse off or
Q: What are the platform-specific system requirements of tfdbg CLI in open-source TensorFlow?
A: On Mac OS X, the
ncurses library is required. It can be installed with
brew install homebrew/dupes/ncurses. On Windows, the
pyreadline library is
required. If you are using Anaconda3, you can install it with a command such as
"C:\Program Files\Anaconda3\Scripts\pip.exe" install pyreadline.