Using JIT Compilation

Why use just-in-time (JIT) compilation?

The TensorFlow/XLA JIT compiler compiles and runs parts of TensorFlow graphs via XLA. The benefit of this over the standard TensorFlow implementation is that XLA can fuse multiple operators (kernel fusion) into a small number of compiled kernels. Fusing operators can reduce memory bandwidth requirements and improve performance compared to executing operators one-at-a-time, as the TensorFlow executor does.

Running TensorFlow graphs via XLA

There are two ways to run TensorFlow computations via XLA, either by JIT-compiling operators placed on a CPU or GPU device, or by placing operators on the XLA_CPU or XLA_GPU TensorFlow devices. Placing operators directly on a TensorFlow XLA device forces the operator to run on that device and is mainly used for testing.

Turning on JIT compilation

JIT compilation can be turned on at the session level or manually for select operations. Both of these approaches are zero-copy --- data does not need to be copied when passing data between a compiled XLA kernel and a TensorFlow operator placed on the same device.

Session

Turning on JIT compilation at the session level will result in all possible operators being greedily compiled into XLA computations. Each XLA computation will be compiled into one or more kernels for the underlying device.

Subject to a few constraints, if there are two adjacent operators in the graph that both have XLA implementations, then they will be compiled into a single XLA computation.

JIT compilation is turned on at the session level by setting the global_jit_level config to tf.OptimizerOptions.ON_1 and passing the config during session initialization.

# Config to turn on JIT compilation
config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

sess = tf.Session(config=config)

Manual with experimental_jit_scope()

JIT compilation can also be turned on manually for one or more operators. This is done by tagging the operators to compile with the attribute _XlaCompile=true. The simplest way to do this is via the tf.contrib.compiler.jit.experimental_jit_scope() scope defined in tensorflow/contrib/compiler/jit.py. Example usage:

    jit_scope = tf.contrib.compiler.jit.experimental_jit_scope

    x = tf.placeholder(np.float32)
    with jit_scope():
      y = tf.add(x, x)  # The "add" will be compiled with XLA.

The _XlaCompile attribute is currently supported on a best-effort basis. If an operator cannot be compiled, TensorFlow will silently fall back to the normal implementation.

Manual with xla.compile()

Unlike experimental_jit_scope() which silently falls back to normal Tensorflow on uncompilable operator, xla.compile() returns an explicit error. This is useful if you want more predictable behaviors from XLA compilation.

Please see xla.compile() tutorial Colab for how to use it.

Placing operators on XLA devices

Another way to run computations via XLA is to place an operator on a specific XLA device. This method is normally only used for testing. Valid targets are XLA_CPU or XLA_GPU.

with tf.device("/job:localhost/replica:0/task:0/device:XLA_GPU:0"):
  output = tf.add(input1, input2)

Unlike JIT compilation on the standard CPU and GPU devices, these devices make a copy of data when it is transferred on and off the device. The extra copy makes it expensive to mix XLA and TensorFlow operators in the same graph.

Tutorial

This tutorial covers training a simple version of MNIST softmax with JIT turned on. Currently JIT at the session level, which is what is used for the tutorial, only supports GPU.

Before starting the tutorial verify that the LD_LIBRARY environment variable or ldconfig contains $CUDA_ROOT/extras/CUPTI/lib64, which contains libraries for the CUDA Profiling Tools Interface (CUPTI). TensorFlow uses CUPTI to pull tracing information from the GPU.

Step #1: Prepare sample script

Download or move mnist_softmax_xla.py into a folder outside of the TensorFlow source tree.

Step #2: Run without XLA

Execute the python script to train the model without XLA.

python mnist_softmax_xla.py --xla=''

Using the Chrome Trace Event Profiler (browse to chrome://tracing), open the timeline file created when the script finishes: timeline.ctf.json. The rendered timeline should look similar to the picture below with multiple green boxes labeled MatMul, possibly across multiple CPUs.

Step #3 Run with XLA

Execute the python script to train the model with XLA and turn on a debugging feature of XLA via an environmental variable that outputs the XLA graph.

XLA_FLAGS="--xla_hlo_profile --xla_dump_to=/tmp/foo --xla_dump_hlo_as_text"
python mnist_softmax_xla.py

Open the timeline file created (timeline.ctf.json). The rendered timeline should look similar to the picture below with one long bar labeled XlaLaunch.

To understand what is happening in XlaLaunch, look at the console output. Each XLA cluster that's launched will have a corresponding profile (from --xla_hlo_profile) showing how long each HLO took to run.

/tmp/foo will contain the HLO before and after optimizations for each HLO module that's run. You can read this as-is, or you can visualize it using tensorflow/compiler/xla/tools:interactive_graphviz.