Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tf_agents.agents.ReinforceAgent

View source on GitHub

A REINFORCE Agent.

Inherits From: TFAgent

tf_agents.agents.ReinforceAgent(
    *args, **kwargs
)

Used in the notebooks

Used in the tutorials

Implements:

REINFORCE algorithm from

"Simple statistical gradient-following algorithms for connectionist reinforcement learning" Williams, R.J., 1992. http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf

REINFORCE with state-value baseline, where state-values are estimated with function approximation, from

"Reinforcement learning: An introduction" (Sec. 13.4) Sutton, R.S. and Barto, A.G., 2018. http://incompleteideas.net/book/the-book-2nd.html

The REINFORCE agent can be optionally provided with:

  • value_network: A tf_agents.network.Network which parameterizes state-value estimation as a neural network. The network will be called with call(observation, step_type) and returns a floating point state-values tensor.
  • value_estimation_loss_coef: Weight on the value prediction loss.

If value_network and value_estimation_loss_coef are provided, advantages are computed as advantages = (discounted accumulated rewards) - (estimated state-values) and the overall learning objective becomes: (total loss) = (policy gradient loss) + value_estimation_loss_coef * (squared error of estimated state-values)

Args:

  • time_step_spec: A TimeStep spec of the expected time_steps.
  • action_spec: A nest of BoundedTensorSpec representing the actions.
  • actor_network: A tf_agents.network.Network to be used by the agent. The network will be called with call(observation, step_type).
  • optimizer: Optimizer for the actor network.
  • value_network: (Optional) A tf_agents.network.Network to be used by the agent. The network will be called with call(observation, step_type) and returns a floating point value tensor.
  • value_estimation_loss_coef: (Optional) Multiplier for value prediction loss to balance with policy gradient loss.
  • advantage_fn: A function A(returns, value_preds) that takes returns and value function predictions as input and returns advantages. The default is A(returns, value_preds) = returns - value_preds if a value network is specified and use_advantage_loss=True, otherwise A(returns, value_preds) = returns.
  • use_advantage_loss: Whether to use value function predictions for computing returns. use_advantage_loss=False is equivalent to setting advantage_fn=lambda returns, value_preds: returns.
  • gamma: A discount factor for future rewards.
  • normalize_returns: Whether to normalize returns across episodes when computing the loss.
  • gradient_clipping: Norm length to clip gradients.
  • debug_summaries: A bool to gather debug summaries.
  • summarize_grads_and_vars: If True, gradient and network variable summaries will be written during training.
  • entropy_regularization: Coefficient for entropy regularization loss term.
  • train_step_counter: An optional counter to increment every time the train op is run. Defaults to the global_step.
  • name: The name of this agent. All variables in this module will fall under that name. Defaults to the class name.

Attributes:

  • action_spec: TensorSpec describing the action produced by the agent.

  • collect_data_spec: Returns a Trajectory spec, as expected by the collect_policy.

  • collect_policy: Return a policy that can be used to collect data from the environment.

  • debug_summaries

  • name: Returns the name of this module as passed or determined in the ctor.

    NOTE: This is not the same as the self.name_scope.name which includes parent module names.

  • name_scope: Returns a tf.name_scope instance for this class.

  • policy: Return the current policy held by the agent.

  • submodules: Sequence of all sub-modules.

    Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
assert list(a.submodules) == [b, c]
assert list(b.submodules) == [c]
assert list(c.submodules) == []
  • summaries_enabled
  • summarize_grads_and_vars
  • time_step_spec: Describes the TimeStep tensors expected by the agent.

  • train_sequence_length: The number of time steps needed in experience tensors passed to train.

    Train requires experience to be a Trajectory containing tensors shaped [B, T, ...]. This argument describes the value of T required.

    For example, for non-RNN DQN training, T=2 because DQN requires single transitions.

    If this value is None, then train can handle an unknown T (it can be determined at runtime from the data). Most RNN-based agents fall into this category.

  • train_step_counter

  • trainable_variables: Sequence of trainable variables owned by this module and its submodules.

  • variables: Sequence of variables owned by this module and its submodules.

Methods

entropy_regularization_loss

View source

entropy_regularization_loss(
    actions_distribution, weights=None
)

Computes the optional entropy regularization loss.

Extending REINFORCE by entropy regularization was originally proposed in "Function optimization using connectionist reinforcement learning algorithms." (Williams and Peng, 1991).

Args:

  • actions_distribution: A possibly batched tuple of action distributions.
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. May include a mask for invalid timesteps.

Returns:

  • entropy_regularization_loss: A tensor with the entropy regularization loss.

initialize

View source

initialize()

Initializes the agent.

Returns:

An operation that can be used to initialize the agent.

Raises:

  • RuntimeError: If the class was not initialized properly (super.__init__ was not called).

policy_gradient_loss

View source

policy_gradient_loss(
    actions_distribution, actions, is_boundary, returns, num_episodes, weights=None
)

Computes the policy gradient loss.

Args:

  • actions_distribution: A possibly batched tuple of action distributions.
  • actions: Tensor with a batch of actions.
  • is_boundary: Tensor of booleans that indicate if the corresponding action was in a boundary trajectory and should be ignored.
  • returns: Tensor with a return from each timestep, aligned on index. Works better when returns are normalized.
  • num_episodes: Number of episodes contained in the training data.
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. May include a mask for invalid timesteps.

Returns:

  • policy_gradient_loss: A tensor that will contain policy gradient loss for the on-policy experience.

total_loss

View source

total_loss(
    experience, returns, weights, training=False
)

train

View source

train(
    experience, weights=None
)

Trains the agent.

Args:

  • experience: A batch of experience data in the form of a Trajectory. The structure of experience must match that of self.collect_data_spec. All tensors in experience must be shaped [batch, time, ...] where time must be equal to self.train_step_length if that property is not None.
  • weights: (optional). A Tensor, either 0-D or shaped [batch], containing weights to be used when calculating the total train loss. Weights are typically multiplied elementwise against the per-batch loss, but the implementation is up to the Agent.

Returns:

A LossInfo loss tuple containing loss and info tensors.

  • In eager mode, the loss values are first calculated, then a train step is performed before they are returned.
  • In graph mode, executing any or all of the loss tensors will first calculate the loss value(s), then perform a train step, and return the pre-train-step LossInfo.

Raises:

  • TypeError: If experience is not type Trajectory. Or if experience does not match self.collect_data_spec structure types.
  • ValueError: If experience tensors' time axes are not compatible with self.train_sequence_length. Or if experience does not match self.collect_data_spec structure.
  • RuntimeError: If the class was not initialized properly (super.__init__ was not called).

value_estimation_loss

View source

value_estimation_loss(
    value_preds, returns, num_episodes, weights=None
)

Computes the value estimation loss.

Args:

  • value_preds: Per-timestep estimated values.
  • returns: Per-timestep returns for value function to predict.
  • num_episodes: Number of episodes contained in the training data.
  • weights: Optional scalar or element-wise (per-batch-entry) importance weights. May include a mask for invalid timesteps.

Returns:

  • value_estimation_loss: A scalar value_estimation_loss loss.

with_name_scope

@classmethod
with_name_scope(
    cls, method
)

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([8, 32]))
# ==> <tf.Tensor: ...>
mod.w
# ==> <tf.Variable ...'my_module/w:0'>

Args:

  • method: The method to wrap.

Returns:

The original method wrapped such that it enters the module's name scope.