Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tf_agents.agents.ppo.ppo_clip_agent.PPOClipAgent

View source on GitHub

A PPO Agent implementing the clipped probability ratios.

Inherits From: PPOAgent

tf_agents.agents.ppo.ppo_clip_agent.PPOClipAgent(
    *args, **kwargs
)

Args:

  • time_step_spec: A TimeStep spec of the expected time_steps.
  • action_spec: A nest of BoundedTensorSpec representing the actions.
  • optimizer: Optimizer to use for the agent.
  • actor_net: A function actor_net(observations, action_spec) that returns tensor of action distribution params for each observation. Takes nested observation and returns nested action.
  • value_net: A function value_net(time_steps) that returns value tensor from neural net predictions for each observation. Takes nested observation and returns batch of value_preds.
  • importance_ratio_clipping: Epsilon in clipped, surrogate PPO objective. For more detail, see explanation at the top of the doc.
  • lambda_value: Lambda parameter for TD-lambda computation.
  • discount_factor: Discount factor for return computation.
  • entropy_regularization: Coefficient for entropy regularization loss term.
  • policy_l2_reg: Coefficient for l2 regularization of unshared policy weights.
  • value_function_l2_reg: Coefficient for l2 regularization of unshared value function weights.
  • shared_vars_l2_reg: Coefficient for l2 regularization of weights shared between the policy and value functions.
  • value_pred_loss_coef: Multiplier for value prediction loss to balance with policy gradient loss.
  • num_epochs: Number of epochs for computing policy updates.
  • use_gae: If True (default False), uses generalized advantage estimation for computing per-timestep advantage. Else, just subtracts value predictions from empirical return.
  • use_td_lambda_return: If True (default False), uses td_lambda_return for training value function. (td_lambda_return = gae_advantage + value_predictions)
  • normalize_rewards: If true, keeps moving variance of rewards and normalizes incoming rewards.
  • reward_norm_clipping: Value above and below to clip normalized reward.
  • normalize_observations: If true, keeps moving mean and variance of observations and normalizes incoming observations.
  • log_prob_clipping: +/- value for clipping log probs to prevent inf / NaN values. Default: no clipping.
  • gradient_clipping: Norm length to clip gradients. Default: no clipping.
  • check_numerics: If true, adds tf.debugging.check_numerics to help find NaN / Inf values. For debugging only.
  • debug_summaries: A bool to gather debug summaries.
  • summarize_grads_and_vars: If true, gradient summaries will be written.
  • train_step_counter: An optional counter to increment every time the train op is run. Defaults to the global_step.
  • name: The name of this agent. All variables in this module will fall under that name. Defaults to the class name.

Attributes:

  • action_spec: TensorSpec describing the action produced by the agent.

  • actor_net: Returns actor_net TensorFlow template function.

  • collect_data_spec: Returns a Trajectory spec, as expected by the collect_policy.

  • collect_policy: Return a policy that can be used to collect data from the environment.

  • debug_summaries

  • name: Returns the name of this module as passed or determined in the ctor.

    NOTE: This is not the same as the self.name_scope.name which includes parent module names.

  • name_scope: Returns a tf.name_scope instance for this class.

  • policy: Return the current policy held by the agent.

  • submodules: Sequence of all sub-modules.

    Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
assert list(a.submodules) == [b, c]
assert list(b.submodules) == [c]
assert list(c.submodules) == []
  • summaries_enabled
  • summarize_grads_and_vars
  • time_step_spec: Describes the TimeStep tensors expected by the agent.

  • train_argspec: TensorSpec describing extra supported kwargs to train().

  • train_sequence_length: The number of time steps needed in experience tensors passed to train.

    Train requires experience to be a Trajectory containing tensors shaped [B, T, ...]. This argument describes the value of T required.

    For example, for non-RNN DQN training, T=2 because DQN requires single transitions.

    If this value is None, then train can handle an unknown T (it can be determined at runtime from the data). Most RNN-based agents fall into this category.

  • train_step_counter

  • trainable_variables: Sequence of trainable variables owned by this module and its submodules.

  • variables: Sequence of variables owned by this module and its submodules.

Raises:

  • ValueError: If the actor_net is not a DistributionNetwork.

Methods

adaptive_kl_loss

View source

adaptive_kl_loss(
    kl_divergence, debug_summaries=False
)

compute_advantages

View source

compute_advantages(
    rewards, returns, discounts, value_preds
)

Compute advantages, optionally using GAE.

Based on baselines ppo1 implementation. Removes final timestep, as it needs to use this timestep for next-step value prediction for TD error computation.

Args:

  • rewards: Tensor of per-timestep rewards.
  • returns: Tensor of per-timestep returns.
  • discounts: Tensor of per-timestep discounts. Zero for terminal timesteps.
  • value_preds: Cached value estimates from the data-collection policy.

Returns:

  • advantages: Tensor of length (len(rewards) - 1), because the final timestep is just used for next-step value prediction.

compute_return_and_advantage

View source

compute_return_and_advantage(
    next_time_steps, value_preds
)

Compute the Monte Carlo return and advantage.

Normalazation will be applied to the computed returns and advantages if it's enabled.

Args:

  • next_time_steps: batched tensor of TimeStep tuples after action is taken.
  • value_preds: Batched value prediction tensor. Should have one more entry in time index than time_steps, with the final value corresponding to the value prediction of the final state.

Returns:

tuple of (return, normalized_advantage), both are batched tensors.

entropy_regularization_loss

View source

entropy_regularization_loss(
    time_steps, current_policy_distribution, weights, debug_summaries=False
)

Create regularization loss tensor based on agent parameters.

get_epoch_loss

View source

get_epoch_loss(
    time_steps, actions, act_log_probs, returns, normalized_advantages,
    action_distribution_parameters, weights, train_step, debug_summaries,
    training=False
)

Compute the loss and create optimization op for one training epoch.

All tensors should have a single batch dimension.

Args:

  • time_steps: A minibatch of TimeStep tuples.
  • actions: A minibatch of actions.
  • act_log_probs: A minibatch of action probabilities (probability under the sampling policy).
  • returns: A minibatch of per-timestep returns.
  • normalized_advantages: A minibatch of normalized per-timestep advantages.
  • action_distribution_parameters: Parameters of data-collecting action distribution. Needed for KL computation.
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. Includes a mask for invalid timesteps.
  • train_step: A train_step variable to increment for each train step. Typically the global_step.
  • debug_summaries: True if debug summaries should be created.
  • training: Whether this loss is being used for training.

Returns:

A tf_agent.LossInfo named tuple with the total_loss and all intermediate losses in the extra field contained in a PPOLossInfo named tuple.

initialize

View source

initialize()

Initializes the agent.

Returns:

An operation that can be used to initialize the agent.

Raises:

  • RuntimeError: If the class was not initialized properly (super.__init__ was not called).

kl_cutoff_loss

View source

kl_cutoff_loss(
    kl_divergence, debug_summaries=False
)

kl_penalty_loss

View source

kl_penalty_loss(
    time_steps, action_distribution_parameters, current_policy_distribution,
    weights, debug_summaries=False
)

Compute a loss that penalizes policy steps with high KL.

Based on KL divergence from old (data-collection) policy to new (updated) policy.

All tensors should have a single batch dimension.

Args:

  • time_steps: TimeStep tuples with observations for each timestep. Used for computing new action distributions.
  • action_distribution_parameters: Action distribution params of the data collection policy, used for reconstruction old action distributions.
  • current_policy_distribution: The policy distribution, evaluated on all time_steps.
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. Inlcudes a mask for invalid timesteps.
  • debug_summaries: True if debug summaries should be created.

Returns:

  • kl_penalty_loss: The sum of a squared penalty for KL over a constant threshold, plus an adaptive penalty that encourages updates toward a target KL divergence.

l2_regularization_loss

View source

l2_regularization_loss(
    debug_summaries=False
)

policy_gradient_loss

View source

policy_gradient_loss(
    time_steps, actions, sample_action_log_probs, advantages,
    current_policy_distribution, weights, debug_summaries=False
)

Create tensor for policy gradient loss.

All tensors should have a single batch dimension.

Args:

  • time_steps: TimeSteps with observations for each timestep.
  • actions: Tensor of actions for timesteps, aligned on index.
  • sample_action_log_probs: Tensor of sample probability of each action.
  • advantages: Tensor of advantage estimate for each timestep, aligned on index. Works better when advantage estimates are normalized.
  • current_policy_distribution: The policy distribution, evaluated on all time_steps.
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. Includes a mask for invalid timesteps.
  • debug_summaries: True if debug summaries should be created.

Returns:

  • policy_gradient_loss: A tensor that will contain policy gradient loss for the on-policy experience.

train

View source

train(
    experience, weights=None, **kwargs
)

Trains the agent.

Args:

  • experience: A batch of experience data in the form of a Trajectory. The structure of experience must match that of self.collect_data_spec. All tensors in experience must be shaped [batch, time, ...] where time must be equal to self.train_step_length if that property is not None.
  • weights: (optional). A Tensor, either 0-D or shaped [batch], containing weights to be used when calculating the total train loss. Weights are typically multiplied elementwise against the per-batch loss, but the implementation is up to the Agent.
  • **kwargs: Any additional data as declared by self.train_argspec.

Returns:

A LossInfo loss tuple containing loss and info tensors.

  • In eager mode, the loss values are first calculated, then a train step is performed before they are returned.
  • In graph mode, executing any or all of the loss tensors will first calculate the loss value(s), then perform a train step, and return the pre-train-step LossInfo.

Raises:

  • TypeError: If experience is not type Trajectory. Or if experience does not match self.collect_data_spec structure types.
  • ValueError: If experience tensors' time axes are not compatible with self.train_sequence_length. Or if experience does not match self.collect_data_spec structure.
  • ValueError: If the user does not pass **kwargs matching self.train_argspec.
  • RuntimeError: If the class was not initialized properly (super.__init__ was not called).

update_adaptive_kl_beta

View source

update_adaptive_kl_beta(
    kl_divergence
)

Create update op for adaptive KL penalty coefficient.

Args:

  • kl_divergence: KL divergence of old policy to new policy for all timesteps.

Returns:

  • update_op: An op which runs the update for the adaptive kl penalty term.

value_estimation_loss

View source

value_estimation_loss(
    time_steps, returns, weights, debug_summaries=False, training=False
)

Computes the value estimation loss for actor-critic training.

All tensors should have a single batch dimension.

Args:

  • time_steps: A batch of timesteps.
  • returns: Per-timestep returns for value function to predict. (Should come from TD-lambda computation.)
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. Includes a mask for invalid timesteps.
  • debug_summaries: True if debug summaries should be created.
  • training: Whether this loss is going to be used for training.

Returns:

  • value_estimation_loss: A scalar value_estimation_loss loss.

with_name_scope

@classmethod
with_name_scope(
    cls, method
)

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([8, 32]))
# ==> <tf.Tensor: ...>
mod.w
# ==> <tf.Variable ...'my_module/w:0'>

Args:

  • method: The method to wrap.

Returns:

The original method wrapped such that it enters the module's name scope.