Google I/O is a wrap! Catch up on TensorFlow sessions

# tf.contrib.opt.ShampooOptimizer

The Shampoo Optimizer

Inherits From: `Optimizer`

Variant of Adagrad using one preconditioner matrix per variable dimension. For details, see https://arxiv.org/abs/1802.09568

gbar is time-weighted accumulated gradient: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]

mat_gbar is time-weighted accumulated gradient square: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1]

``````            + mat_gbar_weight[t] * gg_j[t]
``````

where if g[t] = g_abcd then gg_a[t] = g_abcd g_a'bcd (Einstein notation)

#### Update rule:

w[t+1] = w[t] - learning_rate[t] * Prod_j mat_gbar_j[t]^(-alpha/n) gbar[t] Again, mat_gbar_j[t]^(-alpha) gbar[t] is a tensor contraction along the j'th dimension of gbar[t] with the first dimension of mat_gbar_j[t]^(-alpha/n), where alpha is a hyperparameter, and n = rank of the variable. Prod_j represents doing this contraction for all j in 0..n-1.

Typically learning_rate is constant, but could be time dependent by passing a lambda function that depends on step.

`global_step` tensorflow variable indicating the step.
`max_matrix_size` We do not perform SVD for matrices larger than this.
`gbar_decay`

`gbar_weight` Used to update gbar: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]
`mat_gbar_decay`

`mat_gbar_weight` Used to update mat_gbar: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1]

• mat_gbar_weight[t] * gg_j[t]
`learning_rate` Similar to SGD
`svd_interval` We should do SVD after this many steps. Default = 1, i.e. every step. Usually 20 leads to no loss of accuracy, and 50 or 100 is also OK. May also want more often early, and less often later - set in caller as for example: "svd_interval = lambda(T): tf.cond( T < 2000, lambda: 20.0, lambda: 1000.0)"
`precond_update_interval` We should update the preconditioners after this many steps. Default = 1. Usually less than svd_interval.
`epsilon` epsilon * I_n is added to each mat_gbar_j for stability for non-diagonal version of shampoo.
`alpha` total power of the preconditioners.
`use_iterative_root` should the optimizer use SVD (faster) or the iterative root method (for TPU) for finding the roots of PSD matrices.
`use_locking`

`name` name of optimizer.

## Methods

### `apply_gradients`

View source

This is the second part of `minimize()`. It returns an `Operation` that applies gradients.

Args
`grads_and_vars` List of (gradient, variable) pairs as returned by `compute_gradients()`.
`global_step` Optional `Variable` to increment by one after the variables have been updated.
`name` Optional name for the returned operation. Default to the name passed to the `Optimizer` constructor.

Returns
An `Operation` that applies the specified gradients. If `global_step` was not None, that operation also increments `global_step`.

Raises
`TypeError` If `grads_and_vars` is malformed.
`ValueError` If none of the variables have gradients.
`RuntimeError` If you should use `_distributed_apply()` instead.

### `compute_gradients`

View source

Compute gradients of `loss` for the variables in `var_list`.

This is the first part of `minimize()`. It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a `Tensor`, an `IndexedSlices`, or `None` if there is no gradient for the given variable.

Args
`loss` A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
`var_list` Optional list or tuple of `tf.Variable` to update to minimize `loss`. Defaults to the list of variables collected in the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
`gate_gradients` How to gate the computation of gradients. Can be `GATE_NONE`, `GATE_OP`, or `GATE_GRAPH`.
`aggregation_method` Specifies the method used to combine gradient terms. Valid values are defined in the class `AggregationMethod`.
`colocate_gradients_with_ops` If True, try colocating gradients with the corresponding op.
`grad_loss` Optional. A `Tensor` holding the gradient computed for `loss`.

Returns
A list of (gradient, variable) pairs. Variable is always present, but gradient can be `None`.

Raises
`TypeError` If `var_list` contains anything else than `Variable` objects.
`ValueError` If some arguments are invalid.
`RuntimeError` If called with eager execution enabled and `loss` is not callable.

#### Eager Compatibility

When eager execution is enabled, `gate_gradients`, `aggregation_method`, and `colocate_gradients_with_ops` are ignored.

View source

### `get_slot`

View source

Return a slot named `name` created for `var` by the Optimizer.

Some `Optimizer` subclasses use additional variables. For example `Momentum` and `Adagrad` use variables to accumulate updates. This method gives access to these `Variable` objects if for some reason you need them.

Use `get_slot_names()` to get the list of slot names created by the `Optimizer`.

Args
`var` A variable passed to `minimize()` or `apply_gradients()`.
`name` A string.

Returns
The `Variable` for the slot if it was created, `None` otherwise.

### `get_slot_names`

View source

Return a list of the names of slots created by the `Optimizer`.

See `get_slot()`.

Returns
A list of strings.

### `minimize`

View source

Add operations to minimize `loss` by updating `var_list`.

This method simply combines calls `compute_gradients()` and `apply_gradients()`. If you want to process the gradient before applying them call `compute_gradients()` and `apply_gradients()` explicitly instead of using this function.

Args
`loss` A `Tensor` containing the value to minimize.
`global_step` Optional `Variable` to increment by one after the variables have been updated.
`var_list` Optional list or tuple of `Variable` objects to update to minimize `loss`. Defaults to the list of variables collected in the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
`gate_gradients` How to gate the computation of gradients. Can be `GATE_NONE`, `GATE_OP`, or `GATE_GRAPH`.
`aggregation_method` Specifies the method used to combine gradient terms. Valid values are defined in the class `AggregationMethod`.
`colocate_gradients_with_ops` If True, try colocating gradients with the corresponding op.
`name` Optional name for the returned operation.
`grad_loss` Optional. A `Tensor` holding the gradient computed for `loss`.

Returns
An Operation that updates the variables in `var_list`. If `global_step` was not `None`, that operation also increments `global_step`.

Raises
`ValueError` If some of the variables are not `Variable` objects.

#### Eager Compatibility

When eager execution is enabled, `loss` should be a Python function that takes no arguments and computes the value to be minimized. Minimization (and gradient computation) is done with respect to the elements of `var_list` if not None, else with respect to any trainable variables created during the execution of the `loss` function. `gate_gradients`, `aggregation_method`, `colocate_gradients_with_ops` and `grad_loss` are ignored when eager execution is enabled.

### `variables`

View source

A list of variables which encode the current state of `Optimizer`.

Includes slot variables and additional global variables created by the optimizer in the current default graph.

Returns
A list of variables.

## Class Variables

• `GATE_GRAPH = 2`
• `GATE_NONE = 0`
• `GATE_OP = 1`
[{ "type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type": "thumb-down", "id": "tooComplicatedTooManySteps", "label":"Too complicated / too many steps" },{ "type": "thumb-down", "id": "outOfDate", "label":"Out of date" },{ "type": "thumb-down", "id": "samplesCodeIssue", "label":"Samples / code issue" },{ "type": "thumb-down", "id": "otherDown", "label":"Other" }]
[{ "type": "thumb-up", "id": "easyToUnderstand", "label":"Easy to understand" },{ "type": "thumb-up", "id": "solvedMyProblem", "label":"Solved my problem" },{ "type": "thumb-up", "id": "otherUp", "label":"Other" }]