Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings


View source on GitHub

Abstract base class for TF RL agents.

    time_step_spec, action_spec, policy, collect_policy, train_sequence_length,
    num_outer_dims=2, debug_summaries=False, summarize_grads_and_vars=False,
    enable_summaries=True, train_step_counter=None


  • time_step_spec: A nest of tf.TypeSpec representing the time_steps. Provided by the user.
  • action_spec: A nest of BoundedTensorSpec representing the actions. Provided by the user.
  • policy: An instance of tf_policy.Base representing the Agent's current policy.
  • collect_policy: An instance of tf_policy.Base representing the Agent's current data collection policy (used to set self.step_spec).
  • train_sequence_length: A python integer or None, signifying the number of time steps required from tensors in experience as passed to train(). All tensors in experience will be shaped [B, T, ...] but for certain agents, T should be fixed. For example, DQN requires transitions in the form of 2 time steps, so for a non-RNN DQN Agent, set this value to 2. For agents that don't care, or which can handle T unknown at graph build time (i.e. most RNN-based agents), set this argument to None.
  • num_outer_dims: The number of outer dimensions for the agent. Must be either 1 or 2. If 2, training will require both a batch_size and time dimension on every Tensor; if 1, training will require only a batch_size outer dimension.
  • debug_summaries: A bool; if true, subclasses should gather debug summaries.
  • summarize_grads_and_vars: A bool; if true, subclasses should additionally collect gradient and variable summaries.
  • enable_summaries: A bool; if false, subclasses should not gather any summaries (debug or otherwise); subclasses should gate all summaries using either summaries_enabled, debug_summaries, or summarize_grads_and_vars properties.
  • train_step_counter: An optional counter to increment every time the train op is run. Defaults to the global_step.


  • action_spec: TensorSpec describing the action produced by the agent.

  • collect_data_spec: Returns a Trajectory spec, as expected by the collect_policy.

  • collect_policy: Return a policy that can be used to collect data from the environment.

  • debug_summaries

  • name: Returns the name of this module as passed or determined in the ctor.

    NOTE: This is not the same as the self.name_scope.name which includes parent module names.

  • name_scope: Returns a tf.name_scope instance for this class.

  • policy: Return the current policy held by the agent.

  • submodules: Sequence of all sub-modules.

    Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
assert list(a.submodules) == [b, c]
assert list(b.submodules) == [c]
assert list(c.submodules) == []
  • summaries_enabled
  • summarize_grads_and_vars
  • time_step_spec: Describes the TimeStep tensors expected by the agent.

  • train_sequence_length: The number of time steps needed in experience tensors passed to train.

    Train requires experience to be a Trajectory containing tensors shaped [B, T, ...]. This argument describes the value of T required.

    For example, for non-RNN DQN training, T=2 because DQN requires single transitions.

    If this value is None, then train can handle an unknown T (it can be determined at runtime from the data). Most RNN-based agents fall into this category.

  • train_step_counter

  • trainable_variables: Sequence of trainable variables owned by this module and its submodules.

  • variables: Sequence of variables owned by this module and its submodules.


  • ValueError: If time_step_spec is not an instance of ts.TimeStep.
  • ValueError: If num_outer_dims is not in [1, 2].



View source


Initializes the agent.


An operation that can be used to initialize the agent.


  • RuntimeError: If the class was not initialized properly (super.__init__ was not called).


View source

    experience, weights=None

Trains the agent.


  • experience: A batch of experience data in the form of a Trajectory. The structure of experience must match that of self.collect_data_spec. All tensors in experience must be shaped [batch, time, ...] where time must be equal to self.train_step_length if that property is not None.
  • weights: (optional). A Tensor, either 0-D or shaped [batch], containing weights to be used when calculating the total train loss. Weights are typically multiplied elementwise against the per-batch loss, but the implementation is up to the Agent.


A LossInfo loss tuple containing loss and info tensors.

  • In eager mode, the loss values are first calculated, then a train step is performed before they are returned.
  • In graph mode, executing any or all of the loss tensors will first calculate the loss value(s), then perform a train step, and return the pre-train-step LossInfo.


  • TypeError: If experience is not type Trajectory. Or if experience does not match self.collect_data_spec structure types.
  • ValueError: If experience tensors' time axes are not compatible with self.train_sequence_length. Or if experience does not match self.collect_data_spec structure.
  • RuntimeError: If the class was not initialized properly (super.__init__ was not called).


    cls, method

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([8, 32]))
# ==> <tf.Tensor: ...>
# ==> <tf.Variable ...'my_module/w:0'>


  • method: The method to wrap.


The original method wrapped such that it enters the module's name scope.