# tf.contrib.opt.ShampooOptimizer

## Class ShampooOptimizer

The Shampoo Optimizer

Inherits From: Optimizer

Variant of Adagrad using one preconditioner matrix per variable dimension. For details, see https://arxiv.org/abs/1802.09568

gbar is time-weighted accumulated gradient: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]

mat_gbar is time-weighted accumulated gradient square: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1] + mat_gbar_weight[t] * gg_j[t] where if g[t] = g_abcd then gg_a[t] = g_abcd g_a'bcd (Einstein notation)

#### Update rule:

w[t+1] = w[t] - learning_rate[t] * Prod_j mat_gbar_j[t]^(-alpha/n) gbar[t] Again, mat_gbar_j[t]^(-alpha) gbar[t] is a tensor contraction along the j'th dimension of gbar[t] with the first dimension of mat_gbar_j[t]^(-alpha/n), where alpha is a hyperparameter, and n = rank of the variable. Prod_j represents doing this contraction for all j in 0..n-1.

Typically learning_rate is constant, but could be time dependent by passing a lambda function that depends on step.

## __init__

View source

__init__(
global_step=0,
max_matrix_size=768,
gbar_decay=0.0,
gbar_weight=1.0,
mat_gbar_decay=1.0,
mat_gbar_weight=1.0,
learning_rate=1.0,
svd_interval=1,
precond_update_interval=1,
epsilon=0.0001,
alpha=0.5,
use_iterative_root=False,
use_locking=False,
name='Shampoo'
)

Default values of the various hyper-parameters.

gbar_decay, gbar_weight etc. can be a float or a time varying parameter. For time-varying parameters use e.g. "lambda T: T / (T + 1.0)" where the expression in the lambda is a tensorflow expression

#### Args:

• global_step: tensorflow variable indicating the step.
• max_matrix_size: We do not perform SVD for matrices larger than this.
• gbar_decay: * gbar_weight: Used to update gbar: gbar[t] = gbar_decay[t] * gbar[t-1] + gbar_weight[t] * g[t]
• mat_gbar_decay: * mat_gbar_weight: Used to update mat_gbar: mat_gbar_j[t] = mat_gbar_decay[t] * mat_gbar_j[t-1] + mat_gbar_weight[t] * gg_j[t]
• learning_rate: Similar to SGD
• svd_interval: We should do SVD after this many steps. Default = 1, i.e. every step. Usually 20 leads to no loss of accuracy, and 50 or 100 is also OK. May also want more often early, and less often later - set in caller as for example: "svd_interval = lambda(T): tf.cond( T < 2000, lambda: 20.0, lambda: 1000.0)"
• precond_update_interval: We should update the preconditioners after this many steps. Default = 1. Usually less than svd_interval.
• epsilon: epsilon * I_n is added to each mat_gbar_j for stability for non-diagonal version of shampoo.
• alpha: total power of the preconditioners.
• use_iterative_root: should the optimizer use SVD (faster) or the iterative root method (for TPU) for finding the roots of PSD matrices.
• use_locking: * name: name of optimizer.

## Methods

View source

global_step=None,
name=None
)

This is the second part of minimize(). It returns an Operation that applies gradients.

#### Args:

• global_step: Optional Variable to increment by one after the variables have been updated.
• name: Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

#### Returns:

An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.

#### Raises:

• TypeError: If grads_and_vars is malformed.
• ValueError: If none of the variables have gradients.
• RuntimeError: If you should use _distributed_apply() instead.

View source

loss,
var_list=None,
aggregation_method=None,
)

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

#### Args:

• loss: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
• var_list: Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
• gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
• aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.

#### Returns:

A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.

#### Raises:

• TypeError: If var_list contains anything else than Variable objects.
• ValueError: If some arguments are invalid.
• RuntimeError: If called with eager execution enabled and loss is not callable.

View source

get_name()

### get_slot

View source

get_slot(
var,
name
)

Return a slot named name created for var by the Optimizer.

Use get_slot_names() to get the list of slot names created by the Optimizer.

#### Args:

• var: A variable passed to minimize() or apply_gradients().
• name: A string.

#### Returns:

The Variable for the slot if it was created, None otherwise.

### get_slot_names

View source

get_slot_names()

Return a list of the names of slots created by the Optimizer.

See get_slot().

#### Returns:

A list of strings.

### minimize

View source

minimize(
loss,
global_step=None,
var_list=None,
aggregation_method=None,
name=None,
)

Add operations to minimize loss by updating var_list.

#### Args:

• loss: A Tensor containing the value to minimize.
• global_step: Optional Variable to increment by one after the variables have been updated.
• var_list: Optional list or tuple of Variable objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
• gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
• aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
• name: Optional name for the returned operation.

#### Returns:

An Operation that updates the variables in var_list. If global_step was not None, that operation also increments global_step.

#### Raises:

• ValueError: If some of the variables are not Variable objects.

#### Eager Compatibility

When eager execution is enabled, loss should be a Python function that takes no arguments and computes the value to be minimized. Minimization (and gradient computation) is done with respect to the elements of var_list if not None, else with respect to any trainable variables created during the execution of the loss function. gate_gradients, aggregation_method, colocate_gradients_with_ops and grad_loss are ignored when eager execution is enabled.

### variables

View source

variables()

A list of variables which encode the current state of Optimizer.

Includes slot variables and additional global variables created by the optimizer in the current default graph.

#### Returns:

A list of variables.

## Class Members

• GATE_GRAPH = 2
• GATE_NONE = 0
• GATE_OP = 1