Why did we build XLA?
We had several objectives for XLA to work with TensorFlow:
Improve execution speed. Compile subgraphs to reduce the execution time of short-lived Ops to eliminate overhead from the TensorFlow runtime, fuse pipelined operations to reduce memory overhead, and specialize to known tensor shapes to allow for more aggressive constant propagation.
Improve memory usage. Analyze and schedule memory usage, in principle eliminating many intermediate storage buffers.
Reduce reliance on custom Ops. Remove the need for many custom Ops by improving the performance of automatically fused low-level Ops to match the performance of custom Ops that were fused by hand.
Reduce mobile footprint. Eliminate the TensorFlow runtime by ahead-of-time compiling the subgraph and emitting an object/header file pair that can be linked directly into another application. The results can reduce the footprint for mobile inference by several orders of magnitude.
Improve portability. Make it relatively easy to write a new backend for novel hardware, at which point a large fraction of TensorFlow programs will run unmodified on that hardware. This is in contrast with the approach of specializing individual monolithic Ops for new hardware, which requires TensorFlow programs to be rewritten to make use of those Ops.
How does XLA work?
The input language to XLA is called "HLO IR", or just HLO (High Level Operations). The semantics of HLO are described on the Operation Semantics page. It is most convenient to think of HLO as a compiler IR.
XLA takes graphs ("computations") defined in HLO and compiles them into machine instructions for various architectures. XLA is modular in the sense that it is easy to slot in an alternative backend to target some novel HW architecture. The CPU backend for x64 and ARM64 as well as the NVIDIA GPU backend are in the TensorFlow source tree.
The following diagram shows the compilation process in XLA:
XLA comes with several optimizations and analysis passes that are target-independent, such as CSE, target-independent operation fusion, and buffer analysis for allocating runtime memory for the computation.
After the target-independent step, XLA sends the HLO computation to a backend. The backend can perform further HLO-level optimizations, this time with target specific information and needs in mind. For example, the XLA GPU backend may perform operation fusion beneficial specifically for the GPU programming model and determine how to partition the computation into streams. At this stage, backends may also pattern-match certain operations or combinations thereof to optimized library calls.
The next step is target-specific code generation. The CPU and GPU backends included with XLA use LLVM for low-level IR, optimization, and code-generation. These backends emit the LLVM IR necessary to represent the XLA HLO computation in an efficient manner, and then invoke LLVM to emit native code from this LLVM IR.
The GPU backend currently supports NVIDIA GPUs via the LLVM NVPTX backend; the CPU backend supports multiple CPU ISAs.