Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tf_agents.policies.actor_policy.ActorPolicy

View source on GitHub

Class to build Actor Policies.

Inherits From: Base

tf_agents.policies.actor_policy.ActorPolicy(
    *args, **kwargs
)

Used in the notebooks

Used in the tutorials

Args:

  • time_step_spec: A TimeStep spec of the expected time_steps.
  • action_spec: A nest of BoundedTensorSpec representing the actions.
  • actor_network: An instance of a tf_agents.networks.network.Network to be used by the policy. The network will be called with call(observation, step_type, policy_state) and should return (actions_or_distributions, new_state).
  • info_spec: A nest of TensorSpec representing the policy info.
  • observation_normalizer: An object to use for observation normalization.
  • clip: Whether to clip actions to spec before returning them. Default True. Most policy-based algorithms (PCL, PPO, REINFORCE) use unclipped continuous actions for training.
  • training: Whether the network should be called in training mode.
  • observation_and_action_constraint_splitter: A function used to process observations with action constraints. These constraints can indicate, for example, a mask of valid/invalid actions for a given state of the environment. The function takes in a full observation and returns a tuple consisting of 1) the part of the observation intended as input to the network and 2) the constraint. An example observation_and_action_constraint_splitter could be as simple as:
def observation_and_action_constraint_splitter(observation):
  return observation['network_input'], observation['constraint']

Note: when using observation_and_action_constraint_splitter, make sure the provided actor_network is compatible with the network-specific half of the output of the observation_and_action_constraint_splitter. In particular, observation_and_action_constraint_splitter will be called on the observation before passing to the network. If observation_and_action_constraint_splitter is None, action constraints are not applied.

  • name: The name of this policy. All variables in this module will fall under that name. Defaults to the class name.

Attributes:

  • action_spec: Describes the TensorSpecs of the Tensors expected by step(action).

    action can be a single Tensor, or a nested dict, list or tuple of Tensors.

  • emit_log_probability: Whether this policy instance emits log probabilities or not.

  • info_spec: Describes the Tensors emitted as info by action and distribution.

    info can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

  • name: Returns the name of this module as passed or determined in the ctor.

    NOTE: This is not the same as the self.name_scope.name which includes parent module names.

  • name_scope: Returns a tf.name_scope instance for this class.

  • observation_and_action_constraint_splitter

  • observation_normalizer

  • policy_state_spec: Describes the Tensors expected by step(_, policy_state).

    policy_state can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

  • policy_step_spec: Describes the output of action().

  • submodules: Sequence of all sub-modules.

    Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
assert list(a.submodules) == [b, c]
assert list(b.submodules) == [c]
assert list(c.submodules) == []
  • time_step_spec: Describes the TimeStep tensors returned by step().

  • trainable_variables: Sequence of trainable variables owned by this module and its submodules.

  • trajectory_spec: Describes the Tensors written when using this policy with an environment.

Raises:

  • ValueError: if actor_network is not of type network.Network.
  • NotImplementedError: if observation_and_action_constraint_splitter is not None but action_spec is not discrete.

Methods

action

View source

action(
    time_step, policy_state=(), seed=None
)

Generates next action given the time_step and policy_state.

Args:

  • time_step: A TimeStep tuple corresponding to time_step_spec().
  • policy_state: A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.
  • seed: Seed to use if action performs sampling (optional).

Returns:

A PolicyStep named tuple containing: action: An action Tensor matching the action_spec(). state: A policy state tensor to be fed into the next call to action. info: Optional side information such as action log probabilities.

Raises:

  • RuntimeError: If subclass init didn't call super().init.

distribution

View source

distribution(
    time_step, policy_state=()
)

Generates the distribution over next actions given the time_step.

Args:

  • time_step: A TimeStep tuple corresponding to time_step_spec().
  • policy_state: A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.

Returns:

A PolicyStep named tuple containing:

action: A tf.distribution capturing the distribution of next actions. state: A policy state tensor for the next call to distribution. info: Optional side information such as action log probabilities.

get_initial_state

View source

get_initial_state(
    batch_size
)

Returns an initial state usable by the policy.

Args:

  • batch_size: Tensor or constant: size of the batch dimension. Can be None in which case not dimensions gets added.

Returns:

A nested object of type policy_state containing properly initialized Tensors.

update

View source

update(
    policy, tau=1.0, tau_non_trainable=None, sort_variables_by_name=False
)

Update the current policy with another policy.

This would include copying the variables from the other policy.

Args:

  • policy: Another policy it can update from.
  • tau: A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables.
  • tau_non_trainable: A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau.
  • sort_variables_by_name: A bool, when True would sort the variables by name before doing the update.

Returns:

An TF op to do the update.

variables

View source

variables()

Returns the list of Variables that belong to the policy.

with_name_scope

@classmethod
with_name_scope(
    cls, method
)

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([8, 32]))
# ==> <tf.Tensor: ...>
mod.w
# ==> <tf.Variable ...'my_module/w:0'>

Args:

  • method: The method to wrap.

Returns:

The original method wrapped such that it enters the module's name scope.