If False, time_steps will be unbatched before
passing to py_policy.action(), and a batch dimension will be added to
the returned action. This will only work with time_steps that have a
batch dimension of 1. If True, the time_step (input) and action (output)
are passed exactly as is from/to the py_policy.
The name of this policy. All variables in this module will fall
under that name. Defaults to the class name.
if a non python policy is passed to constructor.
Describes the TensorSpecs of the Tensors expected by step(action).
action can be a single Tensor, or a nested dict, list or tuple of
Describes the Tensors written when using this policy with an environment.
Whether this policy instance emits log probabilities or not.
Describes the Tensors emitted as info by action and distribution.
info can be an empty tuple, a single Tensor, or a nested dict,
list or tuple of Tensors.
Returns the name of this module as passed or determined in the ctor.
Generates next action given the time_step and policy_state.
A TimeStep tuple corresponding to time_step_spec().
A Tensor, or a nested dict, list or tuple of Tensors
representing the previous policy_state.
Seed to use if action performs sampling (optional).
A PolicyStep named tuple containing:
action: An action Tensor matching the action_spec().
state: A policy state tensor to be fed into the next call to action.
info: Optional side information such as action log probabilities.