![]() |
Neural LinUCB Policy.
Inherits From: TFPolicy
tf_agents.bandits.policies.neural_linucb_policy.NeuralLinUCBPolicy(
encoding_network: tf_agents.typing.types.Network
,
encoding_dim: int,
reward_layer: tf.keras.layers.Dense,
epsilon_greedy: float,
actions_from_reward_layer: tf_agents.typing.types.Bool
,
cov_matrix: Sequence[tf_agents.typing.types.Float
],
data_vector: Sequence[tf_agents.typing.types.Float
],
num_samples: Sequence[tf_agents.typing.types.Int
],
time_step_spec: tf_agents.typing.types.TimeStep
,
alpha: float = 1.0,
emit_policy_info: Sequence[Text] = (),
emit_log_probability: bool = False,
accepts_per_arm_features: bool = False,
distributed_use_reward_layer: bool = False,
observation_and_action_constraint_splitter: Optional[types.Splitter] = None,
name: Optional[Text] = None
)
Applies LinUCB on top of an encoding network. Since LinUCB is a linear method, the encoding network is used to capture the non-linear relationship between the context features and the expected rewards. The policy starts with exploration based on epsilon greedy and then switches to LinUCB for exploring more efficiently.
This policy supports both the global-only observation model and the global and per-arm model:
-- In the global-only case, there is one single observation per time step, and every arm has its own reward estimation function. -- In the per-arm case, all arms receive individual observations, and the reward estimation function is identical for all arms.
Reference:
Carlos Riquelme, George Tucker, Jasper Snoek,
Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep
Networks for Thompson Sampling
, ICLR 2018.
Args | |
---|---|
encoding_network
|
network that encodes the observations. |
encoding_dim
|
(int) dimension of the encoded observations. |
reward_layer
|
final layer that predicts the expected reward per arm. In
case the policy accepts per-arm features, the output of this layer has
to be a scalar. This is because in the per-arm case, all encoded
observations have to go through the same computation to get the reward
estimates. The num_actions dimension of the encoded observation is
treated as a batch dimension in the reward layer.
|
epsilon_greedy
|
(float) representing the probability of choosing a random action instead of the greedy action. |
actions_from_reward_layer
|
(boolean variable) whether to get actions from the reward layer or from LinUCB. |
cov_matrix
|
list of the covariance matrices. There exists one covariance matrix per arm, unless the policy accepts per-arm features, in which case this list must have a single element. |
data_vector
|
list of the data vectors. A data vector is a weighted sum of the observations, where the weight is the corresponding reward. Each arm has its own data vector, unless the policy accepts per-arm features, in which case this list must have a single element. |
num_samples
|
list of number of samples per arm. If the policy accepts per- arm features, this is a single-element list counting the number of steps. |
time_step_spec
|
A TimeStep spec of the expected time_steps.
|
alpha
|
(float) non-negative weight multiplying the confidence intervals. |
emit_policy_info
|
(tuple of strings) what side information we want to get
as part of the policy info. Allowed values can be found in
policy_utilities.PolicyInfo .
|
emit_log_probability
|
(bool) whether to emit log probabilities. |
accepts_per_arm_features
|
(bool) Whether the policy accepts per-arm features. |
distributed_use_reward_layer
|
(bool) Whether to pick the actions using
the network or use LinUCB. This applies only in distributed training
setting and has a similar role to the actions_from_reward_layer
mentioned above.
|
observation_and_action_constraint_splitter
|
A function used for masking
valid/invalid actions with each state of the environment. The function
takes in a full observation and returns a tuple consisting of 1) the
part of the observation intended as input to the bandit policy and 2)
the mask. The mask should be a 0-1 Tensor of shape
[batch_size, num_actions] . This function should also work with a
TensorSpec as input, and should output TensorSpec objects for the
observation and mask.
|
name
|
The name of this policy. |
Attributes | |
---|---|
action_spec
|
Describes the TensorSpecs of the Tensors expected by step(action) .
|
collect_data_spec
|
Describes the Tensors written when using this policy with an environment. |
emit_log_probability
|
Whether this policy instance emits log probabilities or not. |
info_spec
|
Describes the Tensors emitted as info by action and distribution .
|
observation_and_action_constraint_splitter
|
|
policy_state_spec
|
Describes the Tensors expected by step(_, policy_state) .
|
policy_step_spec
|
Describes the output of action() .
|
time_step_spec
|
Describes the TimeStep tensors returned by step() .
|
trajectory_spec
|
Describes the Tensors written when using this policy with an environment. |
validate_args
|
Whether action & distribution validate input and output args.
|
Methods
action
action(
time_step: tf_agents.trajectories.time_step.TimeStep
,
policy_state: tf_agents.typing.types.NestedTensor
= (),
seed: Optional[types.Seed] = None
) -> tf_agents.trajectories.policy_step.PolicyStep
Generates next action given the time_step and policy_state.
Args | |
---|---|
time_step
|
A TimeStep tuple corresponding to time_step_spec() .
|
policy_state
|
A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state. |
seed
|
Seed to use if action performs sampling (optional). |
Returns | |
---|---|
A PolicyStep named tuple containing:
action : An action Tensor matching the action_spec .
state : A policy state tensor to be fed into the next call to action.
info : Optional side information such as action log probabilities.
|
Raises | |
---|---|
RuntimeError
|
If subclass init didn't call super().init.
ValueError or TypeError: If validate_args is True and inputs or
outputs do not match time_step_spec , policy_state_spec ,
or policy_step_spec .
|
distribution
distribution(
time_step: tf_agents.trajectories.time_step.TimeStep
,
policy_state: tf_agents.typing.types.NestedTensor
= ()
) -> tf_agents.trajectories.policy_step.PolicyStep
Generates the distribution over next actions given the time_step.
Args | |
---|---|
time_step
|
A TimeStep tuple corresponding to time_step_spec() .
|
policy_state
|
A Tensor, or a nested dict, list or tuple of Tensors representing the previous policy_state. |
Returns | |
---|---|
A PolicyStep named tuple containing:
|
Raises | |
---|---|
ValueError or TypeError: If validate_args is True and inputs or
outputs do not match time_step_spec , policy_state_spec ,
or policy_step_spec .
|
get_initial_state
get_initial_state(
batch_size: Optional[types.Int]
) -> tf_agents.typing.types.NestedTensor
Returns an initial state usable by the policy.
Args | |
---|---|
batch_size
|
Tensor or constant: size of the batch dimension. Can be None in which case no dimensions gets added. |
Returns | |
---|---|
A nested object of type policy_state containing properly
initialized Tensors.
|
update
update(
policy,
tau: float = 1.0,
tau_non_trainable: Optional[float] = None,
sort_variables_by_name: bool = False
) -> tf.Operation
Update the current policy with another policy.
This would include copying the variables from the other policy.
Args | |
---|---|
policy
|
Another policy it can update from. |
tau
|
A float scalar in [0, 1]. When tau is 1.0 (the default), we do a hard update. This is used for trainable variables. |
tau_non_trainable
|
A float scalar in [0, 1] for non_trainable variables. If None, will copy from tau. |
sort_variables_by_name
|
A bool, when True would sort the variables by name before doing the update. |
Returns | |
---|---|
An TF op to do the update. |