Join us at TensorFlow World, Oct 28-31. Use code TF20 for 20% off select passes.

# tf.contrib.opt.ShampooOptimizer

## Class `ShampooOptimizer`

The Shampoo Optimizer

Inherits From: `Optimizer`

Variant of Adagrad using one preconditioner matrix per variable dimension. For details, see https://arxiv.org/abs/1802.09568

gbar is time-weighted accumulated gradient: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]

mat_gbar is time-weighted accumulated gradient square: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1] + mat_gbar_weight[t] * gg_j[t] where if g[t] = g_abcd then gg_a[t] = g_abcd g_a'bcd (Einstein notation)

#### Update rule:

w[t+1] = w[t] - learning_rate[t] * Prod_j mat_gbar_j[t]^(-alpha/n) gbar[t] Again, mat_gbar_j[t]^(-alpha) gbar[t] is a tensor contraction along the j'th dimension of gbar[t] with the first dimension of mat_gbar_j[t]^(-alpha/n), where alpha is a hyperparameter, and n = rank of the variable. Prod_j represents doing this contraction for all j in 0..n-1.

Typically learning_rate is constant, but could be time dependent by passing a lambda function that depends on step.

## `__init__`

View source

``````__init__(
global_step=0,
max_matrix_size=768,
gbar_decay=0.0,
gbar_weight=1.0,
mat_gbar_decay=1.0,
mat_gbar_weight=1.0,
learning_rate=1.0,
svd_interval=1,
precond_update_interval=1,
epsilon=0.0001,
alpha=0.5,
use_iterative_root=False,
use_locking=False,
name='Shampoo'
)
``````

Default values of the various hyper-parameters.

gbar_decay, gbar_weight etc. can be a float or a time varying parameter. For time-varying parameters use e.g. "lambda T: T / (T + 1.0)" where the expression in the lambda is a tensorflow expression

#### Args:

• `global_step`: tensorflow variable indicating the step.
• `max_matrix_size`: We do not perform SVD for matrices larger than this.
• `gbar_decay`: * `gbar_weight`: Used to update gbar: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]
• `mat_gbar_decay`: * `mat_gbar_weight`: Used to update mat_gbar: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1] + mat_gbar_weight[t] * gg_j[t]
• `learning_rate`: Similar to SGD
• `svd_interval`: We should do SVD after this many steps. Default = 1, i.e. every step. Usually 20 leads to no loss of accuracy, and 50 or 100 is also OK. May also want more often early, and less often later - set in caller as for example: "svd_interval = lambda(T): tf.cond( T < 2000, lambda: 20.0, lambda: 1000.0)"
• `precond_update_interval`: We should update the preconditioners after this many steps. Default = 1. Usually less than svd_interval.
• `epsilon`: epsilon * I_n is added to each mat_gbar_j for stability for non-diagonal version of shampoo.
• `alpha`: total power of the preconditioners.
• `use_iterative_root`: should the optimizer use SVD (faster) or the iterative root method (for TPU) for finding the roots of PSD matrices.
• `use_locking`: * `name`: name of optimizer.

## Methods

### `apply_gradients`

View source

``````apply_gradients(
global_step=None,
name=None
)
``````

This is the second part of `minimize()`. It returns an `Operation` that applies gradients.

#### Args:

• `grads_and_vars`: List of (gradient, variable) pairs as returned by `compute_gradients()`.
• `global_step`: Optional `Variable` to increment by one after the variables have been updated.
• `name`: Optional name for the returned operation. Default to the name passed to the `Optimizer` constructor.

#### Returns:

An `Operation` that applies the specified gradients. If `global_step` was not None, that operation also increments `global_step`.

#### Raises:

• `TypeError`: If `grads_and_vars` is malformed.
• `ValueError`: If none of the variables have gradients.
• `RuntimeError`: If you should use `_distributed_apply()` instead.

### `compute_gradients`

View source

``````compute_gradients(
loss,
var_list=None,
aggregation_method=None,
)
``````

Compute gradients of `loss` for the variables in `var_list`.

This is the first part of `minimize()`. It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a `Tensor`, an `IndexedSlices`, or `None` if there is no gradient for the given variable.

#### Args:

• `loss`: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
• `var_list`: Optional list or tuple of `tf.Variable` to update to minimize `loss`. Defaults to the list of variables collected in the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
• `gate_gradients`: How to gate the computation of gradients. Can be `GATE_NONE`, `GATE_OP`, or `GATE_GRAPH`.
• `aggregation_method`: Specifies the method used to combine gradient terms. Valid values are defined in the class `AggregationMethod`.
• `colocate_gradients_with_ops`: If True, try colocating gradients with the corresponding op.
• `grad_loss`: Optional. A `Tensor` holding the gradient computed for `loss`.

#### Returns:

A list of (gradient, variable) pairs. Variable is always present, but gradient can be `None`.

#### Raises:

• `TypeError`: If `var_list` contains anything else than `Variable` objects.
• `ValueError`: If some arguments are invalid.
• `RuntimeError`: If called with eager execution enabled and `loss` is not callable.

#### Eager Compatibility

When eager execution is enabled, `gate_gradients`, `aggregation_method`, and `colocate_gradients_with_ops` are ignored.

### `get_name`

View source

``````get_name()
``````

### `get_slot`

View source

``````get_slot(
var,
name
)
``````

Return a slot named `name` created for `var` by the Optimizer.

Some `Optimizer` subclasses use additional variables. For example `Momentum` and `Adagrad` use variables to accumulate updates. This method gives access to these `Variable` objects if for some reason you need them.

Use `get_slot_names()` to get the list of slot names created by the `Optimizer`.

#### Args:

• `var`: A variable passed to `minimize()` or `apply_gradients()`.
• `name`: A string.

#### Returns:

The `Variable` for the slot if it was created, `None` otherwise.

### `get_slot_names`

View source

``````get_slot_names()
``````

Return a list of the names of slots created by the `Optimizer`.

See `get_slot()`.

#### Returns:

A list of strings.

### `minimize`

View source

``````minimize(
loss,
global_step=None,
var_list=None,
aggregation_method=None,
name=None,
)
``````

Add operations to minimize `loss` by updating `var_list`.

This method simply combines calls `compute_gradients()` and `apply_gradients()`. If you want to process the gradient before applying them call `compute_gradients()` and `apply_gradients()` explicitly instead of using this function.

#### Args:

• `loss`: A `Tensor` containing the value to minimize.
• `global_step`: Optional `Variable` to increment by one after the variables have been updated.
• `var_list`: Optional list or tuple of `Variable` objects to update to minimize `loss`. Defaults to the list of variables collected in the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
• `gate_gradients`: How to gate the computation of gradients. Can be `GATE_NONE`, `GATE_OP`, or `GATE_GRAPH`.
• `aggregation_method`: Specifies the method used to combine gradient terms. Valid values are defined in the class `AggregationMethod`.
• `colocate_gradients_with_ops`: If True, try colocating gradients with the corresponding op.
• `name`: Optional name for the returned operation.
• `grad_loss`: Optional. A `Tensor` holding the gradient computed for `loss`.

#### Returns:

An Operation that updates the variables in `var_list`. If `global_step` was not `None`, that operation also increments `global_step`.

#### Raises:

• `ValueError`: If some of the variables are not `Variable` objects.

#### Eager Compatibility

When eager execution is enabled, `loss` should be a Python function that takes no arguments and computes the value to be minimized. Minimization (and gradient computation) is done with respect to the elements of `var_list` if not None, else with respect to any trainable variables created during the execution of the `loss` function. `gate_gradients`, `aggregation_method`, `colocate_gradients_with_ops` and `grad_loss` are ignored when eager execution is enabled.

### `variables`

View source

``````variables()
``````

A list of variables which encode the current state of `Optimizer`.

Includes slot variables and additional global variables created by the optimizer in the current default graph.

#### Returns:

A list of variables.

## Class Members

• `GATE_GRAPH = 2`
• `GATE_NONE = 0`
• `GATE_OP = 1`