tfp.substrates.jax.mcmc.DualAveragingStepSizeAdaptation

Adapts the inner kernel's step_size based on log_accept_prob.

Inherits From: TransitionKernel

View aliases

Main aliases

tfp.experimental.substrates.jax.mcmc.DualAveragingStepSizeAdaptation

tfp.substrates.jax.mcmc.DualAveragingStepSizeAdaptation(
    inner_kernel,
    num_adaptation_steps,
    target_accept_prob=0.75,
    exploration_shrinkage=0.05,
    shrinkage_target=None,
    step_count_smoothing=10,
    decay_rate=0.75,
    step_size_setter_fn=hmc_like_step_size_setter_fn,
    step_size_getter_fn=hmc_like_step_size_getter_fn,
    log_accept_prob_getter_fn=hmc_like_log_accept_prob_getter_fn,
    reduce_fn=tfp.substrates.jax.math.reduce_logmeanexp,
    experimental_reduce_chain_axis_names=None,
    validate_args=False,
    name=None
)

Used in the notebooks

Used in the tutorials
Probabilistic Programming in Oryx

The dual averaging policy uses a noisy step size for exploration, while averaging over tuning steps to provide a smoothed estimate of an optimal value. It is based on [section 3.2 of Hoffman and Gelman (2013)][1], which modifies the [stochastic convex optimization scheme of Nesterov (2009)][2]. The modified algorithm applies extra weight to recent iterations while keeping the convergence guarantees of Robbins-Monro, and takes care not to make the step size too small too quickly when maintaining a constant trajectory length, to avoid expensive early iterations. A good target acceptance probability depends on the inner kernel. If this kernel is HamiltonianMonteCarlo, then 0.6-0.9 is a good range to aim for. For RandomWalkMetropolis this should be closer to 0.25. See the individual kernels' docstrings for guidance.

In general, adaptation prevents the chain from reaching a stationary distribution, so obtaining consistent samples requires num_adaptation_steps be set to a value somewhat smaller than the number of burnin steps. However, it may sometimes be helpful to set num_adaptation_steps to a larger value during development in order to inspect the behavior of the chain during adaptation.

The step size is assumed to broadcast with the chain state, potentially having leading dimensions corresponding to multiple chains. When there are fewer of those leading dimensions than there are chain dimensions, the corresponding dimensions in the log_accept_prob are averaged (in the direct space, rather than the log space) before being used to adjust the step size. This means that this kernel can do both cross-chain adaptation, or per-chain step size adaptation, depending on the shape of the step size.

For example, if your problem has a state with shape [S], your chain state has shape [C0, C1, S] (meaning that there are C0 * C1 total chains) and log_accept_prob has shape [C0, C1] (one acceptance probability per chain), then depending on the shape of the step size, the following will happen:

Step size has shape [], [S] or [1], the log_accept_prob will be averaged across its C0 and C1 dimensions. This means that you will learn a shared step size based on the mean acceptance probability across all chains. This can be useful if you don't have a lot of steps to adapt and want to average away the noise.
Step size has shape [C1, 1] or [C1, S], the log_accept_prob will be averaged across its C0 dimension. This means that you will learn a shared step size based on the mean acceptance probability across chains that share the coordinate across the C1 dimension. This can be useful when the C1 dimension indexes different distributions, while C0 indexes replicas of a single distribution, all sampled in parallel.
Step size has shape [C0, C1, 1] or [C0, C1, S], then no averaging will happen. This means that each chain will learn its own step size. This can be useful when all chains are sampling from different distributions. Even when all chains are for the same distribution, this can help during the initial warmup period.
Step size has shape [C0, 1, 1] or [C0, 1, S], the log_accept_prob will be averaged across its C1 dimension. This means that you will learn a shared step size based on the mean acceptance probability across chains that share the coordinate across the C0 dimension. This can be useful when the C0 dimension indexes different distributions, while C1 indexes replicas of a single distribution, all sampled in parallel.

By default, the averaging function used above is the arithmetic mean, which is not robust to stuck chains (e.g. average of one chain with p_accept = 0 and three chains with p_accept = 1 will result in an average p_accept = 0.75, which will cause this kernel keep the step size roughly the same rather than reducing it to unstick the stuck chain). A more robust choice would be to set reduce_fn argument to tfp.math.reduce_log_harmonic_mean_exp [4]. Note, however, that the harmonic mean of a set of numbers is usually smaller than the arithmetic mean, so its use will typically produce smaller than optimal step sizes even for well behaved target distributions.

Examples

from tensorflow_probability.python.internal.backend import jax as tf
import tensorflow_probability as tfp; tfp = tfp.substrates.jax
tfd = tfp.distributions

target_log_prob_fn = tfd.Normal(loc=0., scale=1.).log_prob
num_burnin_steps = 500
num_results = 500
num_chains = 64
step_size = 0.1
# Or, if you want per-chain step size:
# step_size = tf.fill([num_chains], step_size)

kernel = tfp.mcmc.HamiltonianMonteCarlo(
    target_log_prob_fn=target_log_prob_fn,
    num_leapfrog_steps=2,
    step_size=step_size)
kernel = tfp.mcmc.DualAveragingStepSizeAdaptation(
    inner_kernel=kernel, num_adaptation_steps=int(num_burnin_steps * 0.8))

# The chain will be stepped for num_results + num_burnin_steps, adapting for
# the first num_adaptation_steps.
samples, [step_size, log_accept_ratio] = tfp.mcmc.sample_chain(
    num_results=num_results,
    num_burnin_steps=num_burnin_steps,
    current_state=tf.zeros(num_chains),
    kernel=kernel,
    trace_fn=lambda _, pkr: [pkr.inner_results.accepted_results.step_size,
                             pkr.inner_results.log_accept_ratio])

# ~0.75
p_accept = tf.math.exp(tfp.math.reduce_logmeanexp(tf.minimum(
    log_accept_ratio, 0.)))

References

[1]: Matthew D. Hoffman, Andrew Gelman. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. In Journal of Machine Learning Research, 15(1):1593-1623, 2014. http://jmlr.org/papers/volume15/hoffman14a/hoffman14a.pdf

[2] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming 120.1 (2009): 221-259 https://link.springer.com/article/10.1007/s10107-007-0149-x

[4]: Hoffman, M., Radul, A., & Sountsov, P. An Adaptive MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo, 2020. In preparation.

Args
`inner_kernel`	`TransitionKernel`-like object.
`num_adaptation_steps`	Scalar `int` `Tensor` number of initial steps to during which to adjust the step size. This may be greater, less than, or equal to the number of burnin steps.
`target_accept_prob`	A floating point `Tensor` representing desired acceptance probability. Must be a positive number less than 1. This can either be a scalar, or have shape [num_chains]. Default value: `0.75` (the [center of asymptotically optimal rate for HMC][1]).
`exploration_shrinkage`	Floating point scalar `Tensor`. How strongly the exploration rate is biased towards the shrinkage target.
`shrinkage_target`	`Tensor` or list of tensors. Value the exploration step size(s) is/are biased towards. As `num_adaptation_steps --> infinity`, this bias goes to zero. Defaults to 10 times the initial step size.
`step_count_smoothing`	Int32 scalar `Tensor`. Number of "pseudo-steps" added to the number of steps taken to prevents noisy exploration during the early samples.
`decay_rate`	Floating point scalar `Tensor`. How much to favor recent iterations over earlier ones. A value of 1 gives equal weight to all history. A value of 0 gives weight only to the most recent iteration.
`step_size_setter_fn`	A callable with the signature `(kernel_results, new_step_size) -> new_kernel_results` where `kernel_results` are the results of the `inner_kernel`, `new_step_size` is a `Tensor` or a nested collection of `Tensor`s with the same structure as returned by the `step_size_getter_fn`, and `new_kernel_results` are a copy of `kernel_results` with the step size(s) set.
`step_size_getter_fn`	A callable with the signature `(kernel_results) -> step_size` where `kernel_results` are the results of the `inner_kernel`, and `step_size` is a floating point `Tensor` or a nested collection of such `Tensor`s.
`log_accept_prob_getter_fn`	A callable with the signature `(kernel_results) -> log_accept_prob` where `kernel_results` are the results of the `inner_kernel`, and `log_accept_prob` is a floating point `Tensor`. `log_accept_prob` can either be a scalar, or have shape [num_chains]. If it's the latter, `step_size` should also have the same leading dimension.
`reduce_fn`	A callable with signature `(input_tensor, axis, keepdims) -> tensor` that returns a log-reduction of `log_accept_prob`, typically some sort of mean. By default, this performs an arithmetic mean.
`experimental_reduce_chain_axis_names`	A `str` or list of `str`s indicating the named axes that should additionally reduced during the log-reduction of `log_accept_prob`.
`validate_args`	Python `bool`. When `True` kernel parameters are checked for validity. When `False` invalid inputs may silently render incorrect outputs.
`name`	Python `str` name prefixed to Ops created by this function. Default value: `None` (i.e., 'dual_averaging_step_size_adaptation').

Attributes
`experimental_reduce_chain_axis_names`
`experimental_shard_axis_names`	The shard axis names for members of the state.
`inner_kernel`
`is_calibrated`	Returns `True` if Markov chain converges to specified distribution. `TransitionKernel`s which are "uncalibrated" are often calibrated by composing them with the `tfp.mcmc.MetropolisHastings` `TransitionKernel`.
`name`
`num_adaptation_steps`
`parameters`	Return `dict` of `__init__` arguments and their values.

Methods

`bootstrap_results`

View source

bootstrap_results(
    init_state
)

Returns an object with the same type as returned by one_step(...)[1].

Args
`init_state`	`Tensor` or Python `list` of `Tensor`s representing the initial state(s) of the Markov chain(s).

Returns
`kernel_results`	A (possibly nested) `tuple`, `namedtuple` or `list` of `Tensor`s representing internal calculations made within this function.

`copy`

View source

copy(
    **override_parameter_kwargs
)

Non-destructively creates a deep copy of the kernel.

Args
`**override_parameter_kwargs`	Python String/value `dictionary` of initialization arguments to override with new values.

Returns
`new_kernel`	`TransitionKernel` object of same type as `self`, initialized with the union of self.parameters and override_parameter_kwargs, with any shared keys overridden by the value of override_parameter_kwargs, i.e., `dict(self.parameters, **override_parameters_kwargs)`.

`experimental_with_shard_axes`

View source

experimental_with_shard_axes(
    shard_axis_names
)

Returns a copy of the kernel with the provided shard axis names.

Args
`shard_axis_names`	a structure of strings indicating the shard axis names for each component of this kernel's state.

Returns
A copy of the current kernel with the shard axis information.

`log_accept_prob_getter_fn`

View source

log_accept_prob_getter_fn(
    kernel_results
)

`one_step`

View source

one_step(
    current_state, previous_kernel_results, seed=None
)

Takes one step of the TransitionKernel.

Must be overridden by subclasses.

Args
`current_state`	`Tensor` or Python `list` of `Tensor`s representing the current state(s) of the Markov chain(s).
`previous_kernel_results`	A (possibly nested) `tuple`, `namedtuple` or `list` of `Tensor`s representing internal calculations made within the previous call to this function (or as returned by `bootstrap_results`).
`seed`	PRNG seed; see `tfp.random.sanitize_seed` for details.

Returns
`next_state`	`Tensor` or Python `list` of `Tensor`s representing the next state(s) of the Markov chain(s).
`kernel_results`	A (possibly nested) `tuple`, `namedtuple` or `list` of `Tensor`s representing internal calculations made within this function.

`reduce_fn`

View source

reduce_fn(
    input_tensor, axis, keepdims, experimental_named_axis=None
)

`step_size_getter_fn`

View source

step_size_getter_fn(
    kernel_results
)

`step_size_setter_fn`

View source

step_size_setter_fn(
    kernel_results, new_step_size
)

tfp.substrates.jax.mcmc.DualAveragingStepSizeAdaptation Stay organized with collections Save and categorize content based on your preferences.

View aliases

Used in the notebooks

Examples

References

Args

Attributes

Methods

bootstrap_results

copy

experimental_with_shard_axes

log_accept_prob_getter_fn

one_step

reduce_fn

step_size_getter_fn

step_size_setter_fn

tfp.substrates.jax.mcmc.DualAveragingStepSizeAdaptation

`bootstrap_results`

`copy`

`experimental_with_shard_axes`

`log_accept_prob_getter_fn`

`one_step`

`reduce_fn`

`step_size_getter_fn`

`step_size_setter_fn`