tf.contrib.kfac.optimizer.KfacOptimizer

Class KfacOptimizer

Inherits From: GradientDescentOptimizer

Defined in tensorflow/contrib/kfac/python/ops/optimizer.py.

The KFAC Optimizer (https://arxiv.org/abs/1503.05671).

Properties

cov_update_op

cov_update_ops

cov_update_thunks

damping

damping_adaptation_interval

inv_update_op

inv_update_ops

inv_update_thunks

variables

A list of variables which encode the current state of Optimizer.

Includes slot variables and additional global variables created by the optimizer in the current default graph.

Returns:

A list of variables.

Methods

__init__

__init__(
    learning_rate,
    cov_ema_decay,
    damping,
    layer_collection,
    var_list=None,
    momentum=0.9,
    momentum_type='regular',
    norm_constraint=None,
    name='KFAC',
    estimation_mode='gradients',
    colocate_gradients_with_ops=True,
    batch_size=None,
    placement_strategy=None,
    **kwargs
)

Initializes the KFAC optimizer with the given settings.

Args:

  • learning_rate: The base learning rate for the optimizer. Should probably be set to 1.0 when using momentum_type = 'qmodel', but can still be set lowered if desired (effectively lowering the trust in the quadratic model.)
  • cov_ema_decay: The decay factor used when calculating the covariance estimate moving averages.
  • damping: The damping factor used to stabilize training due to errors in the local approximation with the Fisher information matrix, and to regularize the update direction by making it closer to the gradient. If damping is adapted during training then this value is used for initializing damping varaible. (Higher damping means the update looks more like a standard gradient update - see Tikhonov regularization.)
  • layer_collection: The layer collection object, which holds the fisher blocks, kronecker factors, and losses associated with the graph. The layer_collection cannot be modified after KfacOptimizer's initialization.
  • var_list: Optional list or tuple of variables to train. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
  • momentum: The momentum decay constant to use. Only applies when momentum_type is 'regular' or 'adam'. (Default: 0.9)
  • momentum_type: The type of momentum to use in this optimizer, one of 'regular', 'adam', or 'qmodel'. (Default: 'regular')
  • norm_constraint: float or Tensor. If specified, the update is scaled down so that its approximate squared Fisher norm v^T F v is at most the specified value. May only be used with momentum type 'regular'. (Default: None)
  • name: The name for this optimizer. (Default: 'KFAC')
  • estimation_mode: The type of estimator to use for the Fishers. Can be 'gradients', 'empirical', 'curvature_propagation', or 'exact'. (Default: 'gradients'). See the doc-string for FisherEstimator for more a more detailed description of these options.
  • colocate_gradients_with_ops: Whether we should request gradients we compute in the estimator be colocated with their respective ops. (Default: True)
  • batch_size: The size of the mini-batch. Only needed when momentum_type == 'qmodel' or when automatic adjustment is used. (Default: None)
  • placement_strategy: string, Device placement strategy used when creating covariance variables, covariance ops, and inverse ops. (Default: None)
  • **kwargs: Arguments to be passesd to specific placement strategy mixin. Check placement.RoundRobinPlacementMixin for example.

Raises:

  • ValueError: If the momentum type is unsupported.
  • ValueError: If clipping is used with momentum type other than 'regular'.
  • ValueError: If no losses have been registered with layer_collection.
  • ValueError: If momentum is non-zero and momentum_type is not 'regular' or 'adam'.

apply_gradients

apply_gradients(
    grads_and_vars,
    *args,
    **kwargs
)

Applies gradients to variables.

Args:

  • grads_and_vars: List of (gradient, variable) pairs.
  • *args: Additional arguments for super.apply_gradients.
  • **kwargs: Additional keyword arguments for super.apply_gradients.

Returns:

An Operation that applies the specified gradients.

compute_gradients

compute_gradients(
    *args,
    **kwargs
)

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Args:

  • loss: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
  • var_list: Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
  • gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
  • aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
  • colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
  • grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:

A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.

Raises:

  • TypeError: If var_list contains anything else than Variable objects.
  • ValueError: If some arguments are invalid.
  • RuntimeError: If called with eager execution enabled and loss is not callable.

Eager Compatibility

When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored.

create_ops_and_vars_thunks

create_ops_and_vars_thunks()

Create thunks that make the ops and vars on demand.

This function returns 4 lists of thunks: cov_variable_thunks, cov_update_thunks, inv_variable_thunks, and inv_update_thunks.

The length of each list is the number of factors and the i-th element of each list corresponds to the i-th factor (given by the "factors" property).

Note that the execution of these thunks must happen in a certain partial order. The i-th element of cov_variable_thunks must execute before the i-th element of cov_update_thunks (and also the i-th element of inv_update_thunks). Similarly, the i-th element of inv_variable_thunks must execute before the i-th element of inv_update_thunks.

TL;DR (oversimplified): Execute the thunks according to the order that they are returned.

Returns:

  • cov_variable_thunks: A list of thunks that make the cov variables.
  • cov_update_thunks: A list of thunks that make the cov update ops.
  • inv_variable_thunks: A list of thunks that make the inv variables.
  • inv_update_thunks: A list of thunks that make the inv update ops.

get_name

get_name()

get_slot

get_slot(
    var,
    name
)

Return a slot named name created for var by the Optimizer.

Some Optimizer subclasses use additional variables. For example Momentum and Adagrad use variables to accumulate updates. This method gives access to these Variable objects if for some reason you need them.

Use get_slot_names() to get the list of slot names created by the Optimizer.

Args:

  • var: A variable passed to minimize() or apply_gradients().
  • name: A string.

Returns:

The Variable for the slot if it was created, None otherwise.

get_slot_names

get_slot_names()

Return a list of the names of slots created by the Optimizer.

See get_slot().

Returns:

A list of strings.

make_ops_and_vars

make_ops_and_vars()

Make ops and vars with device placement self._placement_strategy.

See FisherEstimator.make_ops_and_vars for details.

Returns:

  • cov_update_ops: List of ops that compute the cov updates. Corresponds one-to-one with the list of factors given by the "factors" property.
  • cov_update_op: cov_update_ops grouped into a single op.
  • inv_update_ops: List of ops that compute the inv updates. Corresponds one-to-one with the list of factors given by the "factors" property.
  • cov_update_op: cov_update_ops grouped into a single op.
  • inv_update_op: inv_update_ops grouped into a single op.

make_vars_and_create_op_thunks

make_vars_and_create_op_thunks()

Make vars and create op thunks.

Returns:

  • cov_update_thunks: List of cov update thunks. Corresponds one-to-one with the list of factors given by the "factors" property.
  • inv_update_thunks: List of inv update thunks. Corresponds one-to-one with the list of factors given by the "factors" property.

minimize

minimize(
    *args,
    **kwargs
)

Add operations to minimize loss by updating var_list.

This method simply combines calls compute_gradients() and apply_gradients(). If you want to process the gradient before applying them call compute_gradients() and apply_gradients() explicitly instead of using this function.

Args:

  • loss: A Tensor containing the value to minimize.
  • global_step: Optional Variable to increment by one after the variables have been updated.
  • var_list: Optional list or tuple of Variable objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
  • gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
  • aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
  • colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
  • name: Optional name for the returned operation.
  • grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:

An Operation that updates the variables in var_list. If global_step was not None, that operation also increments global_step.

Raises:

  • ValueError: If some of the variables are not Variable objects.

Eager Compatibility

When eager execution is enabled, loss should be a Python function that takes elements of var_list as arguments and computes the value to be minimized. If var_list is None, loss should take no arguments. Minimization (and gradient computation) is done with respect to the elements of var_list if not None, else with respect to any trainable variables created during the execution of the loss function. gate_gradients, aggregation_method, colocate_gradients_with_ops and grad_loss are ignored when eager execution is enabled.

set_damping_adaptation_params

set_damping_adaptation_params(
    is_chief,
    prev_train_batch,
    loss_fn,
    min_damping=1e-05,
    damping_adaptation_decay=0.99,
    damping_adaptation_interval=5
)

Sets parameters required to adapt damping during training.

When called, enables damping adaptation according to the Levenberg-Marquardt style rule described in Section 6.5 of "Optimizing Neural Networks with Kronecker-factored Approximate Curvature".

Note that this function creates Tensorflow variables which store a few scalars and are accessed by the ops which update the damping (as part of the training op returned by the minimize() method).

Args:

  • is_chief: Boolean, True if the worker is chief.
  • prev_train_batch: Training data used to minimize loss in the previous step. This will be used to evaluate loss by calling loss_fn(prev_train_batch).
  • loss_fn: function that takes as input training data tensor and returns a scalar loss.
  • min_damping: float(Optional), Minimum value the damping parameter can take. Default value 1e-5.
  • damping_adaptation_decay: float(Optional), The damping parameter is multipled by the damping_adaptation_decay every damping_adaptation_interval number of iterations. Default value 0.99.
  • damping_adaptation_interval: int(Optional), Number of steps in between updating the damping parameter. Default value 5.

Raises:

  • ValueError: If set_damping_adaptation_params is already called and the the adapt_damping is True.

Class Members

GATE_GRAPH

GATE_NONE

GATE_OP