Help protect the Great Barrier Reef with TensorFlow on Kaggle Join Challenge

tf.compat.v1.train.MomentumOptimizer

Optimizer that implements the Momentum algorithm.

Inherits From: Optimizer

Migrate to TF2

tf.compat.v1.train.MomentumOptimizer is compatible with eager mode and tf.function. When eager execution is enabled, learning_rate,momentum, can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions.

To switch to native TF2 style, please directly use tf.keras.optimizers.SGD with the momentum argument.

Structural mapping to native TF2

Before:

optimizer = tf.compat.v1.train.MomentumOptimizer(
  learning_rate=learning_rate,
  momentum=momentum,
  use_nesterov=use_nesterov)

After:

optimizer = tf.keras.optimizers.SGD(
  learning_rate=learning_rate,
  momentum=momentum,
  nesterov=use_nesterov)

How to map arguments

TF1 Arg Name TF2 Arg Name Note
learning_rate learning_rate Be careful of setting learning_rate tensor value computed from the global step. In TF1 this was usually meant to imply a dynamic learning rate and would recompute in each step. In TF2 (eager + function) it will treat it as a scalar value that only gets computed once instead of a symbolic placeholder to be computed each time.
momentum momentum -
use_locking - Not applicable in TF2.
use_nesterov nesterov -

Before & after usage example

Before:

x = tf.Variable([1,2,3], dtype=tf.float32)
grad = tf.constant([0.1, 0.2, 0.3])
optimizer = tf.compat.v1.train.MomentumOptimizer(
  learning_rate=0.001,
  momentum=0.9,
  use_nesterov=False)
optimizer.apply_gradients(zip([grad], [x]))

After:

x = tf.Variable([1,2,3], dtype=tf.float32)
grad = tf.constant([0.1, 0.2, 0.3])
optimizer = tf.keras.optimizers.SGD(
  learning_rate=0.001,
  momentum=0.9,
  nesterov=False)
optimizer.apply_gradients(zip([grad], [x]))

Description

Computes (if use_nesterov = False):

accumulation = momentum * accumulation + gradient
variable -= learning_rate * accumulation

Note that in the dense version of this algorithm, accumulation is updated and applied regardless of a gradient's value, whereas the sparse version (when the gradient is an IndexedSlices, typically because of tf.gather or an embedding) only updates variable slices and corresponding accumulation terms when that part of the variable was used in the forward pass.

learning_rate A Tensor or a floating point value. The learning rate.
momentum A Tensor or a floating point value. The momentum.
use_locking If True use locks for update operations.
name Optional name prefix for the operations created when applying gradients. Defaults to "Momentum".
use_nesterov If True use Nesterov Momentum. See (Sutskever et al., 2013). This implementation always computes gradients at the value of the variable(s) passed to the optimizer. Using Nesterov Momentum makes the variable(s) track the values called theta_t + mu*v_t in the paper. This implementation is an approximation of the original formula, valid for high values of momentum. It will compute the "adjusted gradient" in NAG by assuming that the new gradient will be estimated by the current average gradient plus the product of momentum and the change in the average gradient.

Methods

apply_gradients

View source

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Args
grads_and_vars List of (gradient, variable) pairs as returned by compute_gradients().
global_step Optional Variable to increment by one after the variables have been updated.
name Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns
An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.

Raises
TypeError If grads_and_vars is malformed.
ValueError If none of the variables have gradients.
RuntimeError If you should use _distributed_apply() instead.

compute_gradients

View source

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, an IndexedSlices, or None if there is no gradient for the g