View source on GitHub |

Convergence criterion based on inner products between successive gradients.

Inherits From: `ConvergenceCriterion`

```
tfp.optimizer.convergence_criteria.SuccessiveGradientsAreUncorrelated(
window_size=10, min_num_steps=20, name=None
)
```

Let `g[t]`

be the gradient vector at step `t`

, and `g[t-1]`

the previous
gradient. Their inner product:

```
grad_inner_product[t] = sum_i(g[t, i] * g[t - 1, i])
```

measures correlation between successive optimization steps. We expect this to be positive if the optimization is making progress; conversely, it can be shown to be negative in expectation at the stationary distribution of constant-step-size SGD [(Pflug, 1990)][2].

This criterion detects convergence when an exponentially-weighted moving
average of `grad_inner_product`

becomes negative; intuitively, when there has
been no consistent direction to the most recent `window_size`

steps.

Theoretical analysis shows that with no decay (`window_size=np.inf`

), this
rule stops in finite time almost surely, for constant-step-size SGD under
standard assumptions ([Pflug, 1990][2]; [Chee and Toulis, 2017][1]). In
practice, it is often more efficient to use a decaying moving average.

**Batch semantics**: because this criterion does not depend on the loss,
vector-valued losses will not produce vector-valued convergence indicators.
Instead, the returned `has_converged`

is always scalar, and is computed from
the inner product summed across gradients from all variables being
optimized.

TODO(b/150151919): If per-batch convergence indicators are important to you,
please contact `tfprobability@tensorflow.org`

.

#### References

[1] Jerry Chee and Panos Toulis. Convergence diagnostics for stochastic gradient descent with constant step size. _arXiv preprint arXiv:1710.06382, 2017. https://arxiv.org/abs/1710.06382

[2] Georg Ch. Pflug. Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte fur Mathematik, 110(3-4), pp.297-314, 1990.

#### Args:

: int`window_size`

`Tensor`

effective window size for the moving average. The moving average inner product is computed as`moving_average[t] = grad_inner_product[t] + decay * (moving_average[t - 1] - grad_inner_product[t])`

where`decay = 1. - 1. / window_size`

. The non-decaying (`decay = 1.`

) setting can therefore be recovered by passing`window_size=np.inf`

. Default value:`10`

.: int`min_num_steps`

`Tensor`

minimum number of steps before convergence. The criterion will not return`has_converged=True`

until`step >= min_num_steps`

. This should generally be a larger value than`window_size`

. Default value:`20`

.: optional Python`name`

`str`

name prefixed to ops created by this class.

#### Attributes:

`min_num_steps`

`name`

`window_size`

## Methods

`bootstrap`

```
bootstrap(
loss, grads, parameters
)
```

Returns a structure of `Tensors`

for the rule's state at step 0.

The shape of the `Tensor`

s specifying `loss`

, `grads`

, and `parameters`

may
optionally be prefixed by one or more batch dimension(s).

#### Args:

: float`loss`

`Tensor`

initial value of loss being optimized.: list of float`grads`

`Tensor`

gradients of`loss`

wrt`parameters`

.: list of float`parameters`

`Tensor`

initial values of parameters being optimized.

#### Returns:

: (Structure of)`initial_auxiliary_state`

`Tensor`

(s) representing the initial auxiliary state carried forward by this criterion.

`one_step`

```
one_step(
step, loss, grads, parameters, auxiliary_state
)
```

Updates tracked quantities for a new step, and determines if converged.

The shape of the `Tensor`

s specifying `loss`

, `grads`

, and `parameters`

may
optionally be prefixed by one or more batch dimension(s). In this case,
the returned value `has_converged`

will have shape equal to the broadcast
batch shape of whichever of those quantities is used by this convergence
criterion, and the quantities defining the convergence criterion (
`min_num_steps`

, etc.).

#### Args:

: integer`step`

`Tensor`

index of the current step, where`step >= 1`

(on step`0`

,`initial_state`

should be called instead).: float`loss`

`Tensor`

value of loss at the current step.: list of float`grads`

`Tensor`

gradients of`loss`

wrt`parameters`

.: list of float`parameters`

`Tensor`

current values of parameters being optimized.: the (structure of)`auxiliary_state`

`Tensor`

(s) containing state carried forward from the previous step.

#### Returns:

: boolean`has_converged`

`Tensor`

indicating whether the optimization has converged.: (Structure of)`updated_auxiliary_state`

`Tensor`

(s) representing updated quantities tracked by the convergence criterion. This should match the structure of the value returned by`bootstrap`

.