Train a Deep Q Network with TF-Agents

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

Introduction

This example shows how to train a DQN (Deep Q Networks) agent on the Cartpole environment using the TF-Agents library.

Cartpole environment

It will walk you through all the components in a Reinforcement Learning (RL) pipeline for training, evaluation and data collection.

To run this code live, click the 'Run in Google Colab' link above.

Setup

If you haven't installed the following dependencies, run:

sudo apt-get install -y xvfb ffmpeg
pip install -q gym
pip install -q 'imageio==2.4.0'
pip install -q PILLOW
pip install -q pyglet
pip install -q pyvirtualdisplay
pip install -q tf-agents



ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
xvfb is already the newest version (2:1.19.6-1ubuntu4.7).
The following packages were automatically installed and are no longer required:
  dconf-gsettings-backend dconf-service dkms freeglut3 freeglut3-dev
  glib-networking glib-networking-common glib-networking-services
  gsettings-desktop-schemas libcairo-gobject2 libcolord2 libdconf1
  libegl1-mesa libepoxy0 libglu1-mesa libglu1-mesa-dev libgtk-3-0
  libgtk-3-common libice-dev libjansson4 libjson-glib-1.0-0
  libjson-glib-1.0-common libproxy1v5 librest-0.7-0 libsm-dev
  libsoup-gnome2.4-1 libsoup2.4-1 libxi-dev libxmu-dev libxmu-headers
  libxnvctrl0 libxt-dev linux-gcp-headers-5.0.0-1026
  linux-headers-5.0.0-1026-gcp linux-image-5.0.0-1026-gcp
  linux-modules-5.0.0-1026-gcp pkg-config policykit-1-gnome python3-xkit
  screen-resolution-extra xserver-xorg-core-hwe-18.04
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 95 not upgraded.

from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import q_network
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common
tf.compat.v1.enable_v2_behavior()

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
tf.version.VERSION
'2.3.1'

Hyperparameters

num_iterations = 20000 # @param {type:"integer"}

initial_collect_steps = 100  # @param {type:"integer"} 
collect_steps_per_iteration = 1  # @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 200  # @param {type:"integer"}

num_eval_episodes = 10  # @param {type:"integer"}
eval_interval = 1000  # @param {type:"integer"}

Environment

In Reinforcement Learning (RL), an environment represents the task or problem to be solved. Standard environments can be created in TF-Agents using tf_agents.environments suites. TF-Agents has suites for loading environments from sources such as the OpenAI Gym, Atari, and DM Control.

Load the CartPole environment from the OpenAI Gym suite.

env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

You can render this environment to see how it looks. A free-swinging pole is attached to a cart. The goal is to move the cart right or left in order to keep the pole pointing up.

env.reset()
PIL.Image.fromarray(env.render())

png

The environment.step method takes an action in the environment and returns a TimeStep tuple containing the next observation of the environment and the reward for the action.

The time_step_spec() method returns the specification for the TimeStep tuple. Its observation attribute shows the shape of observations, the data types, and the ranges of allowed values. The reward attribute shows the same details for the reward.

print('Observation Spec:')
print(env.time_step_spec().observation)
Observation Spec:
BoundedArraySpec(shape=(4,), dtype=dtype('float32'), name='observation', minimum=[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], maximum=[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38])

print('Reward Spec:')
print(env.time_step_spec().reward)
Reward Spec:
ArraySpec(shape=(), dtype=dtype('float32'), name='reward')

The action_spec() method returns the shape, data types, and allowed values of valid actions.

print('Action Spec:')
print(env.action_spec())
Action Spec:
BoundedArraySpec(shape=(), dtype=dtype('int64'), name='action', minimum=0, maximum=1)

In the Cartpole environment:

  • observation is an array of 4 floats:
    • the position and velocity of the cart
    • the angular position and velocity of the pole
  • reward is a scalar float value
  • action is a scalar integer with only two possible values:
    • 0 — "move left"
    • 1 — "move right"
time_step = env.reset()
print('Time step:')
print(time_step)

action = np.array(1, dtype=np.int32)

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)
Time step:
TimeStep(step_type=array(0, dtype=int32), reward=array(0., dtype=float32), discount=array(1., dtype=float32), observation=array([-0.03813788,  0.01544253, -0.04858649,  0.02486702], dtype=float32))
Next time step:
TimeStep(step_type=array(1, dtype=int32), reward=array(1., dtype=float32), discount=array(1., dtype=float32), observation=array([-0.03782903,  0.21122636, -0.04808915, -0.28274098], dtype=float32))

Usually two environments are instantiated: one for training and one for evaluation.

train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

The Cartpole environment, like most environments, is written in pure Python. This is converted to TensorFlow using the TFPyEnvironment wrapper.

The original environment's API uses Numpy arrays. The TFPyEnvironment converts these to Tensors to make it compatible with Tensorflow agents and policies.

train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

Agent

The algorithm used to solve an RL problem is represented by an Agent. TF-Agents provides standard implementations of a variety of Agents, including:

The DQN agent can be used in any environment which has a discrete action space.

At the heart of a DQN Agent is a QNetwork, a neural network model that can learn to predict QValues (expected returns) for all actions, given an observation from the environment.

Use tf_agents.networks.q_network to create a QNetwork, passing in the observation_spec, action_spec, and a tuple describing the number and size of the model's hidden layers.

fc_layer_params = (100,)

q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=fc_layer_params)

Now use tf_agents.agents.dqn.dqn_agent to instantiate a DqnAgent. In addition to the time_step_spec, action_spec and the QNetwork, the agent constructor also requires an optimizer (in this case, AdamOptimizer), a loss function, and an integer step counter.

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

Policies

A policy defines the way an agent acts in an environment. Typically, the goal of reinforcement learning is to train the underlying model until the policy produces the desired outcome.

In this tutorial:

  • The desired outcome is keeping the pole balanced upright over the cart.
  • The policy returns an action (left or right) for each time_step observation.

Agents contain two policies:

  • agent.policy — The main policy that is used for evaluation and deployment.
  • agent.collect_policy — A second policy that is used for data collection.
eval_policy = agent.policy
collect_policy = agent.collect_policy

Policies can be created independently of agents. For example, use tf_agents.policies.random_tf_policy to create a policy which will randomly select an action for each time_step.

random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

To get an action from a policy, call the policy.action(time_step) method. The time_step contains the observation from the environment. This method returns a PolicyStep, which is a named tuple with three components:

  • action — the action to be taken (in this case, 0 or 1)
  • state — used for stateful (that is, RNN-based) policies
  • info — auxiliary data, such as log probabilities of actions
example_environment = tf_py_environment.TFPyEnvironment(
    suite_gym.load('CartPole-v0'))
time_step = example_environment.reset()
random_policy.action(time_step)
PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>, state=(), info=())

Metrics and Evaluation

The most common metric used to evaluate a policy is the average return. The return is the sum of rewards obtained while running a policy in an environment for an episode. Several episodes are run, creating an average return.

The following function computes the average return of a policy, given the policy, environment, and a number of episodes.

def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

Running this computation on the random_policy shows a baseline performance in the environment.

compute_avg_return(eval_env, random_policy, num_eval_episodes)
21.2

Replay Buffer

The replay buffer keeps track of data collected from the environment. This tutorial uses tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer, as it is the most common.

The constructor requires the specs for the data it will be collecting. This is available from the agent using the collect_data_spec method. The batch size and maximum buffer length are also required.

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length)

For most agents, collect_data_spec is a named tuple called Trajectory, containing the specs for observations, actions, rewards, and other items.

agent.collect_data_spec
Trajectory(step_type=TensorSpec(shape=(), dtype=tf.int32, name='step_type'), observation=BoundedTensorSpec(shape=(4,), dtype=tf.float32, name='observation', minimum=array([-4.8000002e+00, -3.4028235e+38, -4.1887903e-01, -3.4028235e+38],
      dtype=float32), maximum=array([4.8000002e+00, 3.4028235e+38, 4.1887903e-01, 3.4028235e+38],
      dtype=float32)), action=BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0), maximum=array(1)), policy_info=(), next_step_type=TensorSpec(shape=(), dtype=tf.int32, name='step_type'), reward=TensorSpec(shape=(), dtype=tf.float32, name='reward'), discount=BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)))
agent.collect_data_spec._fields
('step_type',
 'observation',
 'action',
 'policy_info',
 'next_step_type',
 'reward',
 'discount')

Data Collection

Now execute the random policy in the environment for a few steps, recording the data in the replay buffer.

def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, initial_collect_steps)

# This loop is so common in RL, that we provide standard implementations. 
# For more details see the drivers module.
# https://www.tensorflow.org/agents/api_docs/python/tf_agents/drivers

The replay buffer is now a collection of Trajectories.

# For the curious:
# Uncomment to peel one of these off and inspect it.
# iter(replay_buffer.as_dataset()).next()

The agent needs access to the replay buffer. This is provided by creating an iterable tf.data.Dataset pipeline which will feed data to the agent.

Each row of the replay buffer only stores a single observation step. But since the DQN Agent needs both the current and next observation to compute the loss, the dataset pipeline will sample two adjacent rows for each item in the batch (num_steps=2).

This dataset is also optimized by running parallel calls and prefetching data.

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)


dataset
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/autograph/operators/control_flow.py:1004: ReplayBuffer.get_next (from tf_agents.replay_buffers.replay_buffer) is deprecated and will be removed in a future version.
Instructions for updating:
Use `as_dataset(..., single_deterministic_pass=False) instead.

<PrefetchDataset shapes: (Trajectory(step_type=(64, 2), observation=(64, 2, 4), action=(64, 2), policy_info=(), next_step_type=(64, 2), reward=(64, 2), discount=(64, 2)), BufferInfo(ids=(64, 2), probabilities=(64,))), types: (Trajectory(step_type=tf.int32, observation=tf.float32, action=tf.int64, policy_info=(), next_step_type=tf.int32, reward=tf.float32, discount=tf.float32), BufferInfo(ids=tf.int64, probabilities=tf.float32))>
iterator = iter(dataset)

print(iterator)
<tensorflow.python.data.ops.iterator_ops.OwnedIterator object at 0x7fa591133cf8>

# For the curious:
# Uncomment to see what the dataset iterator is feeding to the agent.
# Compare this representation of replay data 
# to the collection of individual trajectories shown earlier.

# iterator.next()

Training the agent

Two things must happen during the training loop:

  • collect data from the environment
  • use that data to train the agent's neural network(s)

This example also periodicially evaluates the policy and prints the current score.

The following will take ~5 minutes to run.

try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_data(train_env, agent.collect_policy, replay_buffer, collect_steps_per_iteration)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py:201: calling foldr_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldr(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldr(fn, elems))
step = 200: loss = 13.816577911376953
step = 400: loss = 20.659786224365234
step = 600: loss = 53.401100158691406
step = 800: loss = 102.27363586425781
step = 1000: loss = 77.74481201171875
step = 1000: Average Return = 36.900001525878906
step = 1200: loss = 179.69566345214844
step = 1400: loss = 249.6456298828125
step = 1600: loss = 31.39779281616211
step = 1800: loss = 6.143128871917725
step = 2000: loss = 131.95741271972656
step = 2000: Average Return = 44.0
step = 2200: loss = 73.97447204589844
step = 2400: loss = 39.61648178100586
step = 2600: loss = 216.9718017578125
step = 2800: loss = 16.76689910888672
step = 3000: loss = 49.87702941894531
step = 3000: Average Return = 66.4000015258789
step = 3200: loss = 7.337827205657959
step = 3400: loss = 47.15563201904297
step = 3600: loss = 166.861328125
step = 3800: loss = 99.28225708007812
step = 4000: loss = 91.99877166748047
step = 4000: Average Return = 76.30000305175781
step = 4200: loss = 52.10997009277344
step = 4400: loss = 100.31083679199219
step = 4600: loss = 7.679548263549805
step = 4800: loss = 54.49493408203125
step = 5000: loss = 139.12669372558594
step = 5000: Average Return = 47.599998474121094
step = 5200: loss = 8.69581413269043
step = 5400: loss = 54.693870544433594
step = 5600: loss = 5.535274505615234
step = 5800: loss = 61.92593765258789
step = 6000: loss = 89.6006088256836
step = 6000: Average Return = 150.8000030517578
step = 6200: loss = 130.12876892089844
step = 6400: loss = 101.63938903808594
step = 6600: loss = 78.91459655761719
step = 6800: loss = 109.41963958740234
step = 7000: loss = 7.373339653015137
step = 7000: Average Return = 160.60000610351562
step = 7200: loss = 59.23984146118164
step = 7400: loss = 63.5741081237793
step = 7600: loss = 89.69602966308594
step = 7800: loss = 4.7177839279174805
step = 8000: loss = 123.99877166748047
step = 8000: Average Return = 200.0
step = 8200: loss = 13.875931739807129
step = 8400: loss = 39.6651496887207
step = 8600: loss = 429.539794921875
step = 8800: loss = 208.2021942138672
step = 9000: loss = 9.99809455871582
step = 9000: Average Return = 200.0
step = 9200: loss = 150.1873779296875
step = 9400: loss = 214.15692138671875
step = 9600: loss = 13.286837577819824
step = 9800: loss = 323.9583740234375
step = 10000: loss = 17.99197769165039
step = 10000: Average Return = 200.0
step = 10200: loss = 15.2559175491333
step = 10400: loss = 718.7476196289062
step = 10600: loss = 272.7427673339844
step = 10800: loss = 520.397705078125
step = 11000: loss = 434.8645324707031
step = 11000: Average Return = 200.0
step = 11200: loss = 365.37994384765625
step = 11400: loss = 104.24132537841797
step = 11600: loss = 414.3825988769531
step = 11800: loss = 119.94808197021484
step = 12000: loss = 21.695907592773438
step = 12000: Average Return = 200.0
step = 12200: loss = 20.359397888183594
step = 12400: loss = 593.4542236328125
step = 12600: loss = 121.58563232421875
step = 12800: loss = 1297.17041015625
step = 13000: loss = 654.53662109375
step = 13000: Average Return = 200.0
step = 13200: loss = 1112.3994140625
step = 13400: loss = 2061.524658203125
step = 13600: loss = 24.198551177978516
step = 13800: loss = 382.2828369140625
step = 14000: loss = 23.511058807373047
step = 14000: Average Return = 200.0
step = 14200: loss = 34.40627670288086
step = 14400: loss = 35.6805534362793
step = 14600: loss = 39.26678466796875
step = 14800: loss = 30.779754638671875
step = 15000: loss = 58.43966293334961
step = 15000: Average Return = 200.0
step = 15200: loss = 36.629425048828125
step = 15400: loss = 28.734050750732422
step = 15600: loss = 44.572998046875
step = 15800: loss = 1168.993896484375
step = 16000: loss = 55.67586135864258
step = 16000: Average Return = 200.0
step = 16200: loss = 64.40382385253906
step = 16400: loss = 36.047767639160156
step = 16600: loss = 32.93952560424805
step = 16800: loss = 51.61769104003906
step = 17000: loss = 1676.9178466796875
step = 17000: Average Return = 200.0
step = 17200: loss = 41.030418395996094
step = 17400: loss = 90.57270050048828
step = 17600: loss = 43.120521545410156
step = 17800: loss = 1565.3387451171875
step = 18000: loss = 57.31215286254883
step = 18000: Average Return = 200.0
step = 18200: loss = 1056.1688232421875
step = 18400: loss = 57.372440338134766
step = 18600: loss = 54.60246276855469
step = 18800: loss = 1685.3369140625
step = 19000: loss = 688.1283569335938
step = 19000: Average Return = 200.0
step = 19200: loss = 1617.381591796875
step = 19400: loss = 81.62117004394531
step = 19600: loss = 40.715694427490234
step = 19800: loss = 613.3157958984375
step = 20000: loss = 546.9913940429688
step = 20000: Average Return = 200.0

Visualization

Plots

Use matplotlib.pyplot to chart how the policy improved during training.

One iteration of Cartpole-v0 consists of 200 time steps. The environment gives a reward of +1 for each step the pole stays up, so the maximum return for one episode is 200. The charts shows the return increasing towards that maximum each time it is evaluated during training. (It may be a little unstable and not increase monotonically each time.)

iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)
(-0.34000020027160716, 250.0)

png

Videos

Charts are nice. But more exciting is seeing an agent actually performing a task in an environment.

First, create a function to embed videos in the notebook.

def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

Now iterate through a few episodes of the Cartpole game with the agent. The underlying Python environment (the one "inside" the TensorFlow environment wrapper) provides a render() method, which outputs an image of the environment state. These can be collected into a video.

def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)




create_policy_eval_video(agent.policy, "trained-agent")
WARNING:root:IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (400, 600) to (400, 608) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to None (risking incompatibility). You may also see a FFMPEG warning concerning speedloss due to data not being aligned.

For fun, compare the trained agent (above) to an agent moving randomly. (It does not do as well.)

create_policy_eval_video(random_policy, "random-agent")
WARNING:root:IMAGEIO FFMPEG_WRITER WARNING: input image is not divisible by macro_block_size=16, resizing from (400, 600) to (400, 608) to ensure video compatibility with most codecs and players. To prevent resizing, make your input image divisible by the macro_block_size or set the macro_block_size to None (risking incompatibility). You may also see a FFMPEG warning concerning speedloss due to data not being aligned.