Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tf_agents.bandits.policies.categorical_policy.CategoricalPolicy

View source on GitHub

Policy that chooses an action based on a categorical distribution.

Inherits From: Base

tf_agents.bandits.policies.categorical_policy.CategoricalPolicy(
    weights, time_step_spec, action_spec, inverse_temperature=1.0,
    emit_log_probability=True, name=None
)

The distribution is specified by a set of weights for each action and an inverse temperature. The unnormalized probability distribution is given by exp(weight * inv_temp). Weights and inverse temperature are typically maintained as Variables, and are updated by an Agent.

Note that this policy does not make use of time_step.observation at all. That is, it is a non-contextual bandit policy.

Args:

  • weights: a vector of weights, corresponding to the unscaled log probabilities of a categorical distribution.
  • time_step_spec: A TimeStep spec of the expected time_steps.
  • action_spec: A tensor_spec of action specification.
  • inverse_temperature: a float value used to scale weights. Lower values will induce a more uniform distribution over actions; higher values will result in a sharper distribution.
  • emit_log_probability: Whether to emit log probabilities or not.
  • name: The name of this policy.

Attributes:

  • action_spec: Describes the TensorSpecs of the Tensors expected by step(action).

    action can be a single Tensor, or a nested dict, list or tuple of Tensors.

  • collect_data_spec: Describes the Tensors written when using this policy with an environment.

  • emit_log_probability: Whether this policy instance emits log probabilities or not.

  • info_spec: Describes the Tensors emitted as info by action and distribution.

    info can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

  • name: Returns the name of this module as passed or determined in the ctor.

    NOTE: This is not the same as the self.name_scope.name which includes parent module names.

  • name_scope: Returns a tf.name_scope instance for this class.

  • observation_and_action_constraint_splitter

  • policy_state_spec: Describes the Tensors expected by step(_, policy_state).

    policy_state can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

  • policy_step_spec: Describes the output of action().

  • submodules: Sequence of all sub-modules.

    Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
assert list(a.submodules) == [b, c]
assert list(b.submodules) == [c]
assert list(c.submodules) == []
  • time_step_spec: Describes the TimeStep tensors returned by step().

  • trainable_variables: Sequence of trainable variables owned by this module and its submodules.

  • trajectory_spec: Describes the Tensors written when using this policy with an environment.

Raises:

  • ValueError: If the number of actions specified by the action_spec does not match the dimension of weights.

Methods

action

View source

action(
    time_step, policy_state=(), seed=None
)

Generates next action given the time_step and policy_state.

Args:

  • time_step: A TimeStep tuple corresponding to time_step_spec().
  • policy_state: A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.
  • seed: Seed to use if action performs sampling (optional).

Returns:

A PolicyStep named tuple containing: action: An action Tensor matching the action_spec(). state: A policy state tensor to be fed into the next call to action. info: Optional side information such as action log probabilities.

Raises:

  • RuntimeError: If subclass init didn't call super().init.

distribution

View source

distribution(
    time_step, policy_state=()
)

Generates the distribution over next actions given the time_step.

Args:

  • time_step: A TimeStep tuple corresponding to time_step_spec().
  • policy_state: A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.

Returns:

A PolicyStep named tuple containing:

action: A tf.distribution capturing the distribution of next actions. state: A policy state tensor for the next call to distribution. info: Optional side information such as action log probabilities.

get_initial_state

View source

get_initial_state(
    batch_size
)

Returns an initial state usable by the policy.

Args:

  • batch_size: Tensor or constant: size of the batch dimension. Can be None in which case not dimensions gets added.

Returns:

A nested object of type policy_state containing properly initialized Tensors.

update

View source

update(
    policy, tau=1.0, tau_non_trainable=None, sort_variables_by_name=False
)

Update the current policy with another policy.

This would include copying the variables from the other policy.

Args:

  • policy: Another policy it can update from.
  • tau: A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables.
  • tau_non_trainable: A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau.
  • sort_variables_by_name: A bool, when True would sort the variables by name before doing the update.

Returns:

An TF op to do the update.

variables

View source

variables()

Returns the list of Variables that belong to the policy.

with_name_scope

@classmethod
with_name_scope(
    cls, method
)

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([8, 32]))
# ==> <tf.Tensor: ...>
mod.w
# ==> <tf.Variable ...'my_module/w:0'>

Args:

  • method: The method to wrap.

Returns:

The original method wrapped such that it enters the module's name scope.