tf_agents.agents.ppo.ppo_policy.PPOPolicy

View source on GitHub

An ActorPolicy that also returns policy_info needed for PPO training.

Inherits From: ActorPolicy

This policy requires two networks: the usual actor_network and the additional value_network. The value network can be executed with the apply_value_network() method.

When the networks have state (RNNs, LSTMs) you must be careful to pass the state for the actor network to action() and the state of the value network to apply_value_network(). Use get_initial_value_state() to access the state of the value network.

time_step_spec A TimeStep spec of the expected time_steps.
action_spec A nest of BoundedTensorSpec representing the actions.
actor_network An instance of a tf_agents.networks.network.Network, with call(observation, step_type, network_state). Network should return one of the following: 1. a nested tuple of tfp.distributions objects matching action_spec, or 2. a nested tuple of tf.Tensors representing actions.
value_network An instance of a tf_agents.networks.network.Network, with call(observation, step_type, network_state). Network should return value predictions for the input state.
observation_normalizer An object to use for obervation normalization.
clip Whether to clip actions to spec before returning them. Default True. Most policy-based algorithms (PCL, PPO, REINFORCE) use unclipped continuous actions for training.
collect If True, creates ops for actions_log_prob, value_preds, and action_distribution_params. (default True)

ValueError if actor_network or value_network is not of type tf_agents.networks.network.Network.

action_spec Describes the TensorSpecs of the Tensors expected by step(action).

action can be a single Tensor, or a nested dict, list or tuple of Tensors.

collect_data_spec Describes the Tensors written when using this policy with an environment.
emit_log_probability Whether this policy instance emits log probabilities or not.
info_spec Describes the Tensors emitted as info by action and distribution.

info can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

name Returns the name of this module as passed or determined in the ctor.

name_scope Returns a tf.name_scope instance for this class.
observation_and_action_constraint_splitter

observation_normalizer

policy_state_spec Describes the Tensors expected by step(_, policy_state).

policy_state can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

policy_step_spec Describes the output of action().
submodules Sequence of all sub-modules.

Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
list(a.submodules) == [b, c]
True
list(b.submodules) == [c]
True
list(c.submodules) == []
True

time_step_spec Describes the TimeStep tensors returned by step().
trainable_variables Sequence of trainable variables owned by this module and its submodules.

trajectory_spec Describes the Tensors written when using this policy with an environment.

Methods

action

View source

Generates next action given the time_step and policy_state.

Args
time_step A TimeStep tuple corresponding to time_step_spec().
policy_state A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.
seed Seed to use if action performs sampling (optional).

Returns
A PolicyStep named tuple containing: action: An action Tensor matching the action_spec(). state: A policy state tensor to be fed into the next call to action. info: Optional side information such as action log probabilities.

Raises
RuntimeError If subclass init didn't call super().init.

apply_value_network

View source

Apply value network to time_step, potentially a sequence.

If observation_normalizer is not None, applies observation normalization.

Args
observations A (possibly nested) observation tensor with outer_dims either (batch_size,) or (batch_size, time_index). If observations is a time series and network is RNN, will run RNN steps over time series.
step_types A (possibly nested) step_types tensor with same outer_dims as observations.
value_state Optional. Initial state for the value_network. If not provided the behavior depends on the value network itself.
training Whether the output value is going to be used for training.

Returns
The output of value_net, which is a tuple of:

  • value_preds with same outer_dims as time_step
  • value_state at the end of the time series

distribution

View source

Generates the distribution over next actions given the time_step.

Args
time_step A TimeStep tuple corresponding to time_step_spec().
policy_state A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.

Returns
A PolicyStep named tuple containing:

action: A tf.distribution capturing the distribution of next actions. state: A policy state tensor for the next call to distribution. info: Optional side information such as action log probabilities.

get_initial_state

View source

Returns an initial state usable by the policy.

Args
batch_size Tensor or constant: size of the batch dimension. Can be None in which case not dimensions gets added.

Returns
A nested object of type policy_state containing properly initialized Tensors.

get_initial_value_state

View source

Returns the initial state of the value network.

Args
batch_size A constant or Tensor holding the batch size. Can be None, in which case the state will not have a batch dimension added.

Returns
A nest of zero tensors matching the spec of the value network state.

update

View source

Update the current policy with another policy.

This would include copying the variables from the other policy.

Args
policy Another policy it can update from.
tau A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables.
tau_non_trainable A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau.
sort_variables_by_name A bool, when True would sort the variables by name before doing the update.

Returns
An TF op to do the update.

variables

View source

Returns the list of Variables that belong to the policy.

with_name_scope

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 3]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([1, 2]))
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=..., dtype=float32)>
mod.w
<tf.Variable 'my_module/Variable:0' shape=(2, 3) dtype=float32,
numpy=..., dtype=float32)>

Args
method The method to wrap.

Returns
The original method wrapped such that it enters the module's name scope.