TensorFlow 1 version View source on GitHub

Gradient descent (with momentum) optimizer.

Inherits From: Optimizer

Update rule for parameter w with gradient g when momentum is 0:

w = w - learning_rate * g

Update rule when momentum is larger than 0:

velocity = momentum * velocity - learning_rate * g
w = w * velocity

When nesterov=False, this rule becomes:

velocity = momentum * velocity - learning_rate * g
w = w + momentum * velocity - learning_rate * g

learning_rate A Tensor, floating point value, or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defaults to 0.01.
momentum float hyperparameter >= 0 that accelerates gradient descent in the relevant direction and dampens oscillations. Defaults to 0, i.e., vanilla gradient descent.
nesterov boolean. Whether to apply Nesterov momentum. Defaults to False.
name Optional name prefix for the operations created when applying gradients. Defaults to "SGD".
**kwargs Keyword arguments. Allowed to be one of "clipnorm" or "clipvalue". "clipnorm" (float) clips gradients by norm; "clipvalue" (float) clips gradients by value.


opt = tf.keras.optimizers.SGD(learning_rate=0.1)
var = tf.Variable(1.0)
loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
step_count = opt.minimize(loss, [var]).numpy()
# Step is `- learning_rate * grad`
opt = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
var = tf.Variable(1.0)
val0 = var.value()
loss = lambda: (var ** 2)/2.0         # d(loss)/d(var1) = var1
# First step is `- learning_rate * grad`
step_count = opt.minimize(loss, [var]).numpy()
val1 = var.value()
(val0 - val1).numpy()
# On later steps, step-size increases because of momentum
step_count = opt.minimize(loss, [var]).numpy()
val2 = var.value()
(val1 - val2).numpy()


name A non-empty string. The name to use for accumulators created for the optimizer.
**kwargs keyword arguments. Allowed to be {clipnorm, clipvalue, lr, decay}. clipnorm is clip gradients by norm; clipvalue is clip gradients by value, decay is included for backward compatibility to allow time inverse decay of learning rate. lr is included for backward compatibility, recommended to use learning_rate instead.

ValueError If name is malformed.

iterations Variable. The number of training steps this Optimizer has run.
weights Returns variables of this Optimizer based on the order created.



View source

Add a new slot variable for var.


View source


View source

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

The method sums gradients from all replicas in the presence of tf.distribute.Strategy by default. You can aggregate gradients yourself by passing experimental_aggregate_gradients=False.