tf.autodiff.ForwardAccumulator

Computes Jacobian-vector products ("JVP"s) using forward-mode autodiff.

Compare to tf.GradientTape which computes vector-Jacobian products ("VJP"s) using reverse-mode autodiff (backprop). Reverse mode is more attractive when computing gradients of a scalar-valued function with respect to many inputs (e.g. a neural network with many parameters and a scalar loss). Forward mode works best on functions with many outputs and few inputs. Since it does not hold on to intermediate activations, it is much more memory efficient than backprop where it is applicable.

Consider a simple linear regression:

x = tf.constant([[2.0, 3.0], [1.0, 4.0]])
dense = tf.keras.layers.Dense(1)
dense.build([None, 2])
with tf.autodiff.ForwardAccumulator(
   primals=dense.kernel,
   tangents=tf.constant([[1.], [0.]])) as acc:
  loss = tf.reduce_sum((dense(x) - tf.constant([1., -1.])) ** 2.)
acc.jvp(loss)
<tf.Tensor: shape=(), dtype=float32, numpy=...>

The example has two variables containing parameters, dense.kernel (2 parameters) and dense.bias (1 parameter). Considering the training data x as a constant, this means the Jacobian matrix for the function mapping from parameters to loss has one row and three columns.

With forwardprop, we specify a length-three vector in advance which multiplies the Jacobian. The primals constructor argument is the parameter (a tf.Tensor or tf.Variable) we're specifying a vector for, and the tangents argument is the "vector" in Jacobian-vector product. If our goal is to compute the entire Jacobian matrix, forwardprop computes one column at a time while backprop computes one row at a time. Since the Jacobian in the linear regression example has only one row, backprop requires fewer invocations:

x = tf.constant([[2.0, 3.0], [1.0, 4.0]])
dense = tf.keras.layers.Dense(1)
dense.build([None, 2])
loss_fn = lambda: tf.reduce_sum((dense(x) - tf.constant([1., -1.])) ** 2.)
kernel_fprop = []