
Class to build GreedyMultiObjectiveNeuralPolicy objects.

Inherits From: TFPolicy

time_step_spec A TimeStep spec of the expected time_steps.
action_spec A nest of BoundedTensorSpec representing the actions.
scalarizer A tf_agents.bandits.multi_objective.multi_objective_scalarizer.Scalarizer object that implements scalarization of multiple objectives into a single scalar reward.
objective_networks A Sequence of objects to be used by the policy. Each network will be called with call(observation, step_type) and is expected to provide a prediction for a specific objective for all actions.
observation_and_action_constraint_splitter A function used for masking valid/invalid actions with each state of the environment. The function takes in a full observation and returns a tuple consisting of 1) the part of the observation intended as input to the network and 2) the mask. The mask should be a 0-1 Tensor of shape [batch_size, num_actions]. This function should also work with a TensorSpec as input, and should output TensorSpec objects for the observation and mask.
accepts_per_arm_features (bool) Whether the policy accepts per-arm features.
emit_policy_info (tuple of strings) what side information we want to get as part of the policy info. Allowed values can be found in policy_utilities.PolicyInfo.
name The name of this policy. All variables in this module will fall under that name. Defaults to the class name.

NotImplementedError If action_spec contains more than one BoundedTensorSpec or the BoundedTensorSpec is not valid.
NotImplementedError If action_spec is not a BoundedTensorSpec of type int32 and shape ().
ValueError If objective_networks has fewer than two networks.
ValueError If accepts_per_arm_features is true but time_step_spec is None.


action_spec Describes the TensorSpecs of the Tensors expected by step(action).

action can be a single Tensor, or a nested dict, list or tuple of Tensors.

collect_data_spec Describes the Tensors written when using this policy with an environment.
emit_log_probability Whether this policy instance emits log probabilities or not.
info_spec Describes the Tensors emitted as info by action and distribution.

info can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.


policy_state_spec Describes the Tensors expected by step(_, policy_state).

policy_state can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

policy_step_spec Describes the output of action().

time_step_spec Describes the TimeStep tensors returned by step().
trajectory_spec Describes the Tensors written when using this policy with an environment.
validate_args Whether action & distribution validate input and output args.



Generates next action given the time_step and policy_state.

time_step A TimeStep tuple corresponding to time_step_spec().
policy_state A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.
seed Seed to use if action performs sampling (optional).

A PolicyStep named tuple containing: action: An action Tensor matching the action_spec. state: A policy state tensor to be fed into the next call to action. info: Optional side information such as action log probabilities.

RuntimeError If subclass init didn't call super().init. ValueError or TypeError: If validate_args is True and inputs or outputs do not match time_step_spec, policy_state_spec, or policy_step_spec.


Generates the distribution over next actions given the time_step.

time_step A TimeStep tuple corresponding to time_step_spec().
policy_state A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.

A PolicyStep named tuple containing:

action: A tf.distribution capturing the distribution of next actions. state: A policy state tensor for the next call to distribution. info: Optional side information such as action log probabilities.

ValueError or TypeError: If validate_args is True and inputs or outputs do not match time_step_spec, policy_state_spec, or policy_step_spec.


Returns an initial state usable by the policy.

batch_size Tensor or constant: size of the batch dimension. Can be None in which case no dimensions gets added.

A nested object of type policy_state containing properly initialized Tensors.


Update the current policy with another policy.

This would include copying the variables from the other policy.

policy Another policy it can update from.
tau A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables.
tau_non_trainable A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau.
sort_variables_by_name A bool, when True would sort the variables by name before doing the update.

An TF op to do the update.