Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tf_agents.bandits.policies.linear_bandit_policy.LinearBanditPolicy

View source on GitHub

Linear Bandit Policy to be used by LinUCB, LinTS and possibly others.

Inherits From: Base

tf_agents.bandits.policies.linear_bandit_policy.LinearBanditPolicy(
    action_spec, cov_matrix, data_vector, num_samples, time_step_spec=None, explorat
    ion_strategy=tf_agents.bandits.policies.linear_bandit_policy.ExplorationStrategy
    .optimistic, alpha=1.0, eig_vals=(), eig_matrix=(), tikhonov_weight=1.0,
    add_bias=False, emit_policy_info=(), emit_log_probability=False,
    observation_and_action_constraint_splitter=None, name=None
)

Args:

  • action_spec: TensorSpec containing action specification.
  • cov_matrix: list of the covariance matrices A in the paper. There exists one A matrix per arm.
  • data_vector: list of the b vectors in the paper. The b vector is a weighted sum of the observations, where the weight is the corresponding reward. Each arm has its own vector b.
  • num_samples: list of number of samples per arm.
  • time_step_spec: A TimeStep spec of the expected time_steps.
  • exploration_strategy: An Enum of type ExplortionStrategy. The strategy used for choosing the actions to incorporate exploration. Currently supported strategies are optimistic and sampling.
  • alpha: a float value used to scale the confidence intervals.
  • eig_vals: list of eigenvalues for each covariance matrix (one per arm).
  • eig_matrix: list of eigenvectors for each covariance matrix (one per arm).
  • tikhonov_weight: (float) tikhonov regularization term.
  • add_bias: If true, a bias term will be added to the linear reward estimation.
  • emit_policy_info: (tuple of strings) what side information we want to get as part of the policy info. Allowed values can be found in policy_utilities.PolicyInfo.
  • emit_log_probability: Whether to emit log probabilities.
  • observation_and_action_constraint_splitter: A function used for masking valid/invalid actions with each state of the environment. The function takes in a full observation and returns a tuple consisting of 1) the part of the observation intended as input to the bandit policy and 2) the mask. The mask should be a 0-1 Tensor of shape [batch_size, num_actions]. This function should also work with a TensorSpec as input, and should output TensorSpec objects for the observation and mask.
  • name: The name of this policy.

Attributes:

  • action_spec: Describes the TensorSpecs of the Tensors expected by step(action).

    action can be a single Tensor, or a nested dict, list or tuple of Tensors.

  • collect_data_spec: Describes the Tensors written when using this policy with an environment.

  • emit_log_probability: Whether this policy instance emits log probabilities or not.

  • info_spec: Describes the Tensors emitted as info by action and distribution.

    info can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

  • name: Returns the name of this module as passed or determined in the ctor.

    NOTE: This is not the same as the self.name_scope.name which includes parent module names.

  • name_scope: Returns a tf.name_scope instance for this class.

  • observation_and_action_constraint_splitter

  • policy_state_spec: Describes the Tensors expected by step(_, policy_state).

    policy_state can be an empty tuple, a single Tensor, or a nested dict, list or tuple of Tensors.

  • policy_step_spec: Describes the output of action().

  • submodules: Sequence of all sub-modules.

    Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
assert list(a.submodules) == [b, c]
assert list(b.submodules) == [c]
assert list(c.submodules) == []
  • time_step_spec: Describes the TimeStep tensors returned by step().

  • trainable_variables: Sequence of trainable variables owned by this module and its submodules.

  • trajectory_spec: Describes the Tensors written when using this policy with an environment.

Methods

action

View source

action(
    time_step, policy_state=(), seed=None
)

Generates next action given the time_step and policy_state.

Args:

  • time_step: A TimeStep tuple corresponding to time_step_spec().
  • policy_state: A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.
  • seed: Seed to use if action performs sampling (optional).

Returns:

A PolicyStep named tuple containing: action: An action Tensor matching the action_spec(). state: A policy state tensor to be fed into the next call to action. info: Optional side information such as action log probabilities.

Raises:

  • RuntimeError: If subclass init didn't call super().init.

distribution

View source

distribution(
    time_step, policy_state=()
)

Generates the distribution over next actions given the time_step.

Args:

  • time_step: A TimeStep tuple corresponding to time_step_spec().
  • policy_state: A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state.

Returns:

A PolicyStep named tuple containing:

action: A tf.distribution capturing the distribution of next actions. state: A policy state tensor for the next call to distribution. info: Optional side information such as action log probabilities.

get_initial_state

View source

get_initial_state(
    batch_size
)

Returns an initial state usable by the policy.

Args:

  • batch_size: Tensor or constant: size of the batch dimension. Can be None in which case not dimensions gets added.

Returns:

A nested object of type policy_state containing properly initialized Tensors.

update

View source

update(
    policy, tau=1.0, tau_non_trainable=None, sort_variables_by_name=False
)

Update the current policy with another policy.

This would include copying the variables from the other policy.

Args:

  • policy: Another policy it can update from.
  • tau: A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables.
  • tau_non_trainable: A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau.
  • sort_variables_by_name: A bool, when True would sort the variables by name before doing the update.

Returns:

An TF op to do the update.

variables

View source

variables()

Returns the list of Variables that belong to the policy.

with_name_scope

@classmethod
with_name_scope(
    cls, method
)

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 64]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([8, 32]))
# ==> <tf.Tensor: ...>
mod.w
# ==> <tf.Variable ...'my_module/w:0'>

Args:

  • method: The method to wrap.

Returns:

The original method wrapped such that it enters the module's name scope.