tf_agents.bandits.environments.drifting_linear_environment.DriftingLinearDynamics

View source on GitHub

A drifting linear environment dynamics.

Inherits From: EnvironmentDynamics

This is a drifting linear environment which computes rewards as:

rewards(t) = observation(t) * observation_to_reward(t) + additive_reward(t)

where t is the environment time. observation_to_reward slowly rotates over time. The environment time is incremented in the base class after the reward is computed. The parameters observation_to_reward and additive_reward are updated at each time step. In order to preserve the norm of the observation_to_reward (and the range of values of the reward) the drift is applied in form of rotations, i.e.,

observation_to_reward(t) = R(theta(t)) * observation_to_reward(t - 1)

where theta is the angle of the rotation. The angle is sampled from a provided input distribution.

observation_distribution A distribution from tfp.distributions with shape [batch_size, observation_dim] Note that the values of batch_size and observation_dim are deduced from the distribution.
observation_to_reward_distribution A distribution from tfp.distributions with shape [observation_dim, num_actions]. The value observation_dim must match the second dimension of observation_distribution.
drift_distribution A scalar distribution from tfp.distributions of type tf.float32. It represents the angle of rotation.
additive_reward_distribution A distribution from tfp.distributions with shape [num_actions]. This models the non-contextual behavior of the bandit.

action_spec Specification of the actions.
batch_size Returns the batch size used for observations and rewards.
name Returns the name of this module as passed or determined in the ctor.

name_scope Returns a tf.name_scope instance for this class.
observation_spec Specification of the observations.
submodules Sequence of all sub-modules.

Submodules are modules which are properties of this module, or found as properties of modules which are properties of this module (and so on).

a = tf.Module()
b = tf.Module()
c = tf.Module()
a.b = b
b.c = c
list(a.submodules) == [b, c]
True
list(b.submodules) == [c]
True
list(c.submodules) == []
True

trainable_variables Sequence of trainable variables owned by this module and its submodules.

variables Sequence of variables owned by this module and its submodules.

Methods

compute_optimal_action

View source

compute_optimal_reward

View source

observation

View source

Returns an observation batch for the given time.

Args
env_time The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed.

Returns
The observation batch with spec according to observation_spec.

reward

View source

Reward for the given observation and time step.

Args
observation A batch of observations with spec according to observation_spec.
env_time The scalar int64 tensor of the environment time step. This is incremented by the environment after the reward is computed.

Returns
A batch of rewards with spec shape [batch_size, num_actions] containing rewards for all arms.

with_name_scope

Decorator to automatically enter the module name scope.

class MyModule(tf.Module):
  @tf.Module.with_name_scope
  def __call__(self, x):
    if not hasattr(self, 'w'):
      self.w = tf.Variable(tf.random.normal([x.shape[1], 3]))
    return tf.matmul(x, self.w)

Using the above module would produce tf.Variables and tf.Tensors whose names included the module name:

mod = MyModule()
mod(tf.ones([1, 2]))
<tf.Tensor: shape=(1, 3), dtype=float32, numpy=..., dtype=float32)>
mod.w
<tf.Variable 'my_module/Variable:0' shape=(2, 3) dtype=float32,
numpy=..., dtype=float32)>

Args
method The method to wrap.

Returns
The original method wrapped such that it enters the module's name scope.