|View source on GitHub|
An optimizer that applies loss scaling to prevent numeric underflow.
Compat aliases for migration
See Migration guide for more details.
tf.keras.mixed_precision.LossScaleOptimizer( inner_optimizer, dynamic=True, initial_scale=None, dynamic_growth_steps=None )
Used in the notebooks
|Used in the guide|
Loss scaling is a technique to prevent numeric underflow in intermediate gradients when float16 is used. To prevent underflow, the loss is multiplied (or "scaled") by a certain factor called the "loss scale", which causes intermediate gradients to be scaled by the loss scale as well. The final gradients are divided (or "unscaled") by the loss scale to bring them back to their original value.
LossScaleOptimizer wraps another optimizer and applies loss scaling to it.
By default, the loss scale is dynamically updated over time so you do not have
to choose the loss scale. The
minimize method automatically scales the loss,
unscales the gradients, and updates the loss scale so all you have to do is
wrap your optimizer with a
LossScaleOptimizer if you use
opt = tf.keras.optimizers.SGD(0.25)
opt = tf.keras.mixed_precision.LossScaleOptimizer(opt)
var = tf.Variable(1.)
loss_fn = lambda: var ** 2
# 'minimize' applies loss scaling and updates the loss sale.
tf.GradientTape is used to compute gradients instead of
must scale the loss and gradients manually. This can be done with the
LossScaleOptimizer.get_unscaled_gradients methods. For example:
with tf.GradientTape() as tape:
loss = loss_fn()
scaled_loss = opt.get_scaled_loss(loss)
scaled_grad = tape.gradient(scaled_loss, var)
(grad,) = opt.get_unscaled_gradients([scaled_grad])
opt.apply_gradients([(grad, var)]) # Loss scale is updated here
When mixed precision with float16 is used, there is typically no risk of underflow affecting model quality if loss scaling is properly used. See the mixed precision guide for more information on how to use mixed precision.
Bool indicating whether dynamic loss scaling is used. Defaults to
True. If True, the loss scale will be dynamically updated over time using
an algorithm that keeps the loss scale at approximately its optimal value.
If False, a single fixed loss scale is used and
The initial loss scale. If
With dynamic loss scaling, every
LossScaleOptimizer will occasionally skip applying gradients to the
variables, in which case the trainable variables will not change that step.
This is done because the dynamic loss scale will sometimes be raised too
high, causing overflow in the gradients. Typically, the first 2 to 15 steps of
the model are skipped as the initial loss scale is very high, but afterwards
steps will only be skipped on average 0.05% of the time (the fraction of steps
1 / dynamic_growth_steps).
LossScaleOptimizer delegates all public
Optimizer methods to the inner
optimizer. Additionally, in methods
get_gradients, it scales
the loss and unscales the gradients. In methods