tf.custom_gradient

Decorator to define a function with a custom gradient.

This decorator allows fine grained control over the gradients of a sequence for operations. This may be useful for multiple reasons, including providing a more efficient or numerically stable gradient for a sequence of operations.

For example, consider the following function that commonly occurs in the computation of cross entropy and log likelihoods:

def log1pexp(x):
  return tf.math.log(1 + tf.exp(x))

Due to numerical instability, the gradient of this function evaluated at x=100 is NaN. For example:

x = tf.constant(100.)
y = log1pexp(x)
dy = tf.gradients(y, x) # Will be NaN when evaluated.

The gradient expression can be analytically simplified to provide numerical stability:

@tf.custom_gradient
def log1pexp(x):
  e = tf.exp(x)
  def grad(dy):
    return dy * (1 - 1 / (1 + e))
  return tf.math.log(1 + e), grad

With this definition, the gradient at x=100 will be correctly evaluated as 1.0.

The variable dy is defined as the upstream gradient. i.e. the gradient from all the layers or functions originating from this layer.

By chain rule we know that dy/dx = dy/x_0 * dx_0/dx_1 * ... * dx_i/dx_i+1 * ... * dx_n/dx

In this case the gradient of our current function defined as dx_i/dx_i+1 = (1 - 1 / (1 + e)). The upstream gradient dy would be dx_i+1/dx_i+2 * dx_i+2/dx_i+3 * ... * dx_n/dx. The upstream gradient multiplied by the current gradient is then passed downstream.

In case the function takes multiple variables as input, the grad function must also return the same number of variables. We take the function z = x * y as an example.

@tf.custom_gradient
def bar(x, y):
  def grad(upstream):
    dz_dx = y
    dz_dy = x
    return upstream * dz_dx, upstream * dz_dy
  z = x * y
  return z, grad
x = tf.constant(2.0, dtype=tf.float32)
y = tf.constant(3.0, dtype=tf.float32)
with tf.GradientTape(persistent=True) as tape:
  tape.watch(x)
  tape.watch(y)
  z = bar(x, y)
z
<tf.Tensor: shape=(), dtype=float32, numpy=6.0>
tape.gradient(z, x)
<tf.Tensor: shape=(), dtype=float32, numpy=3.0>
tape.gradient(z, y)
<tf.Tensor: shape=(), dtype=float32, numpy=2.0>

Nesting custom gradients can lead to unintuitive results. The default behavior does not correspond to n-th order derivatives. For example

@tf.custom_gradient
def op(x):
  y = op1(x)
  @tf.custom_gradient
  def grad_fn(dy):
    gdy = op2(x, y, dy)
    def grad_grad_fn(ddy):  # Not the 2nd order gradient of op w.r.t. x.
      return op3(x, y, dy, ddy)
    return gdy, grad_grad_fn
  return y, grad_fn

The function grad_grad_fn will be calculating the first order gradient of grad_fn with respect to dy, which is used to generate forward-mode gradient graphs from backward-mode gradient graphs, but is not the same as the second order gradient of op with respect to x.

Instead, wrap nested @tf.custom_gradients in another function:

@tf.custom_gradient
def op_with_fused_backprop(x):
  y, x_grad = fused_op(x)
  def first_order_gradient(dy):
    @tf.custom_gradient
    def first_order_custom(unused_x):
      def second_order_and_transpose(ddy):
        return second_order_for_x(...), gradient_wrt_dy(...)
      return x_grad, second_order_and_transpose
    return dy * first_order_custom(x)
  return y, first_order_gradient

Additional arguments to the inner @tf.custom_gradient-decorated function control the expected return values of the innermost function.

See also tf.RegisterGradient which registers a gradient function for a primitive TensorFlow operation. tf.custom_gradient on the other hand allows for fine grained control over the gradient computation of a sequence of operations.

Note that if the decorated function uses Variables, the enclosing variable scope must be using ResourceVariables.

f function f(*x) that returns a tuple (y, grad_fn) where:

  • x is a sequence of (nested structures of) Tensor inputs to the function.
  • y is a (nested structure of) Tensor outputs of applying TensorFlow operations in f to x.
  • grad_fn is a function with the signature g(*grad_ys) which returns a list of Tensors the same size as (flattened) x - the derivatives of Tensors in y with respect to the Tensors in x. grad_ys is a sequence of Tensors the same size as (flattened) y holding the initial value gradients for each Tensor in y.

In a pure mathematical sense, a vector-argument vector-valued function f's derivatives should be its Jacobian matrix J. Here we are expressing the Jacobian J as a function grad_fn which defines how J will transform a vector grad_ys when left-multiplied with it (grad_ys * J, the vector-Jacobian product, or VJP). This functional representation of a matrix is convenient to use for chain-rule calculation (in e.g. the back-propagation algorithm).

If f uses Variables (that are not part of the inputs), i.e. through get_variable, then grad_fn should have signature g(*grad_ys, variables=None), where variables is a list of the Variables, and return a 2-tuple (grad_xs, grad_vars), where grad_xs is the same as above, and grad_vars is a list<Tensor> with the derivatives of Tensors in y with respect to the variables (that is, grad_vars has one Tensor per variable in variables).

A function h(x) which returns the same value as f(x)[0] and whose gradient (as calculated by tf.gradients) is determined by f(x)[1].