本頁面由 Cloud Translation API 翻譯而成。
Switch to English

SAC minitaur

版權所有2018 TF-Agents作者。

在TensorFlow.org上查看 在Google Colab中運行 在GitHub上查看源代碼 下載筆記本

介紹

此示例說明如何使用TF-Agents庫在Minitaur環境上訓練“ 軟角色評論家”代理。

如果您已經通過DQN Colab進行了工作,那應該會感到非常熟悉。顯著的變化包括:

  • 將代理從DQN更改為SAC。
  • 在Minitaur上進行培訓,該環境比CartPole複雜得多。 Minitaur環境旨在訓練四足機器人前進。
  • 我們不使用隨機策略執行初始數據收集。

如果尚未安裝以下依賴項,請運行:

sudo apt-get install -y xvfb ffmpeg
pip install -q 'gym==0.10.11'
pip install -q 'imageio==2.4.0'
pip install -q matplotlib
pip install -q PILLOW
pip install -q --pre tf-agents[reverb]
pip install -q 'pybullet==2.4.2'



ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
xvfb is already the newest version (2:1.19.6-1ubuntu4.4).
The following packages were automatically installed and are no longer required:
  dconf-gsettings-backend dconf-service dkms freeglut3 freeglut3-dev
  glib-networking glib-networking-common glib-networking-services
  gsettings-desktop-schemas libcairo-gobject2 libcolord2 libdconf1
  libegl1-mesa libepoxy0 libglu1-mesa libglu1-mesa-dev libgtk-3-0
  libgtk-3-common libice-dev libjansson4 libjson-glib-1.0-0
  libjson-glib-1.0-common libproxy1v5 librest-0.7-0 libsm-dev
  libsoup-gnome2.4-1 libsoup2.4-1 libxi-dev libxmu-dev libxmu-headers
  libxnvctrl0 libxt-dev linux-gcp-headers-5.0.0-1026
  linux-headers-5.0.0-1026-gcp linux-image-5.0.0-1026-gcp
  linux-modules-5.0.0-1026-gcp pkg-config policykit-1-gnome python3-xkit
  screen-resolution-extra xserver-xorg-core-hwe-18.04
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 90 not upgraded.
WARNING: You are using pip version 20.1.1; however, version 20.2 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.1.1; however, version 20.2 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.1.1; however, version 20.2 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.1.1; however, version 20.2 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.1.1; however, version 20.2 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.
WARNING: You are using pip version 20.1.1; however, version 20.2 is available.
You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command.

建立

首先,我們將導入所需的各種工具,並確保啟用TF-V2行為,因為在整個Colab中更容易在Eager模式下進行迭代。

 from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import PIL.Image

import tensorflow as tf
tf.compat.v1.enable_v2_behavior()

from tf_agents.agents.ddpg import critic_network
from tf_agents.agents.sac import sac_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_pybullet
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import actor_distribution_network
from tf_agents.networks import normal_projection_network
from tf_agents.policies import greedy_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common

 

超參數

 env_name = "MinitaurBulletEnv-v0" # @param {type:"string"}

# use "num_iterations = 1e6" for better results,
# 1e5 is just so this doesn't take too long. 
num_iterations = 100000 # @param {type:"integer"}

initial_collect_steps = 10000 # @param {type:"integer"} 
collect_steps_per_iteration = 1 # @param {type:"integer"}
replay_buffer_capacity = 1000000 # @param {type:"integer"}

batch_size = 256 # @param {type:"integer"}

critic_learning_rate = 3e-4 # @param {type:"number"}
actor_learning_rate = 3e-4 # @param {type:"number"}
alpha_learning_rate = 3e-4 # @param {type:"number"}
target_update_tau = 0.005 # @param {type:"number"}
target_update_period = 1 # @param {type:"number"}
gamma = 0.99 # @param {type:"number"}
reward_scale_factor = 1.0 # @param {type:"number"}
gradient_clipping = None # @param

actor_fc_layer_params = (256, 256)
critic_joint_fc_layer_params = (256, 256)

log_interval = 5000 # @param {type:"integer"}

num_eval_episodes = 30 # @param {type:"integer"}
eval_interval = 10000 # @param {type:"integer"}
 

環境

RL中的環境代表了我們要解決的任務或問題。使用suites可以在TF-Agents中輕鬆創建標準環境。給定字符串環境名稱,我們提供了不同的suites來加載來自OpenAI Gym,Atari,DM Control等來源的環境。

現在,讓我們從Pybullet套件中加載Minituar環境。

 env = suite_pybullet.load(env_name)
env.reset()
PIL.Image.fromarray(env.render())
 
current_dir=/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/pybullet_envs/bullet
urdf_root=/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/pybullet_data
options= 

/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))

png

在這種環境下,代理商的目標是訓練一個策略,該策略將控制Minitaur機器人並使它盡可能快地前進。情節持續了1000步,而回報將是整個情節的獎勵總和。

讓我們看一下環境提供的信息,以observation該策略將用於生成actions

 print('Observation Spec:')
print(env.time_step_spec().observation)
print('Action Spec:')
print(env.action_spec())
 
Observation Spec:
BoundedArraySpec(shape=(28,), dtype=dtype('float32'), name='observation', minimum=[  -3.1515927   -3.1515927   -3.1515927   -3.1515927   -3.1515927
   -3.1515927   -3.1515927   -3.1515927 -167.72488   -167.72488
 -167.72488   -167.72488   -167.72488   -167.72488   -167.72488
 -167.72488     -5.71        -5.71        -5.71        -5.71
   -5.71        -5.71        -5.71        -5.71        -1.01
   -1.01        -1.01        -1.01     ], maximum=[  3.1515927   3.1515927   3.1515927   3.1515927   3.1515927   3.1515927
   3.1515927   3.1515927 167.72488   167.72488   167.72488   167.72488
 167.72488   167.72488   167.72488   167.72488     5.71        5.71
   5.71        5.71        5.71        5.71        5.71        5.71
   1.01        1.01        1.01        1.01     ])
Action Spec:
BoundedArraySpec(shape=(8,), dtype=dtype('float32'), name='action', minimum=-1.0, maximum=1.0)

正如我們所看到的,觀察是相當複雜的。我們收到了28個值,分別代表所有電動機的角度,速度和轉矩。作為回報,環境期望[-1, 1]之間的動作有8個值。這些是所需的電機角度。

通常,我們創建兩種環境:一種用於培訓,一種用於評估。大多數環境都是用純Python編寫的,但是可以使用TFPyEnvironment包裝器將其輕鬆轉換為TensorFlow。原來環境的API使用numpy的陣列中, TFPyEnvironment從這些轉換到/ Tensors ,為您更方便地互動與TensorFlow政策和代理商。

 train_py_env = suite_pybullet.load(env_name)
eval_py_env = suite_pybullet.load(env_name)

train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
 
urdf_root=/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/pybullet_data
options= 
urdf_root=/tmpfs/src/tf_docs_env/lib/python3.6/site-packages/pybullet_data
options= 

代理商

要創建SAC代理,我們首先需要創建它將訓練的網絡。 SAC是演員批評代理,因此我們將需要兩個網絡。

評論家將為我們提供Q(s,a)價值估算。就是說,它將收到輸入和觀察到的動作,並且會給我們估計該動作對於給定狀態的效果。

 observation_spec = train_env.observation_spec()
action_spec = train_env.action_spec()
critic_net = critic_network.CriticNetwork(
    (observation_spec, action_spec),
    observation_fc_layer_params=None,
    action_fc_layer_params=None,
    joint_fc_layer_params=critic_joint_fc_layer_params)
 

我們將使用這位批評家來訓練一個actor網絡,這將使我們能夠在觀察到的情況下產生動作。

ActorNetwork將預測正態分佈的參數。每當我們需要採取行動時,便會根據當前觀測值對這種分佈進行採樣。

 def normal_projection_net(action_spec,init_means_output_factor=0.1):
  return normal_projection_network.NormalProjectionNetwork(
      action_spec,
      mean_transform=None,
      state_dependent_std=True,
      init_means_output_factor=init_means_output_factor,
      std_transform=sac_agent.std_clip_transform,
      scale_distribution=True)


actor_net = actor_distribution_network.ActorDistributionNetwork(
    observation_spec,
    action_spec,
    fc_layer_params=actor_fc_layer_params,
    continuous_projection_net=normal_projection_net)
 

有了這些網絡,我們現在可以實例化代理。

 global_step = tf.compat.v1.train.get_or_create_global_step()
tf_agent = sac_agent.SacAgent(
    train_env.time_step_spec(),
    action_spec,
    actor_network=actor_net,
    critic_network=critic_net,
    actor_optimizer=tf.compat.v1.train.AdamOptimizer(
        learning_rate=actor_learning_rate),
    critic_optimizer=tf.compat.v1.train.AdamOptimizer(
        learning_rate=critic_learning_rate),
    alpha_optimizer=tf.compat.v1.train.AdamOptimizer(
        learning_rate=alpha_learning_rate),
    target_update_tau=target_update_tau,
    target_update_period=target_update_period,
    td_errors_loss_fn=tf.compat.v1.losses.mean_squared_error,
    gamma=gamma,
    reward_scale_factor=reward_scale_factor,
    gradient_clipping=gradient_clipping,
    train_step_counter=global_step)
tf_agent.initialize()
 

政策規定

在TF-Agent中,策略代表RL中策略的標準概念:給定time_step產生操作或操作分佈。主要方法是policy_step = policy.step(time_step) ,其中policy_step是一個命名的元組PolicyStep(action, state, info)policy_step.action是要應用於環境的actionstate代表有狀態(RNN)策略的狀態, info可能包含輔助信息,例如動作的日誌概率。

代理包含兩個策略:主策略(agent.policy)和用於數據收集的行為策略(agent.collect_policy)。對於評估/部署,我們通過用GreedyPolicy()包裝主要策略來採取卑鄙的行動。

 eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)
collect_policy = tf_agent.collect_policy
 

指標與評估

用於評估政策的最常見指標是平均回報。回報是在環境中為某個情節運行策略時獲得的獎勵總和,我們通常將其平均化為幾個情節。我們可以如下計算平均回報率。

 def compute_avg_return(environment, policy, num_episodes=5):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


compute_avg_return(eval_env, eval_policy, num_eval_episodes)

# Please also see the metrics module for standard implementations of different
# metrics.
 
-0.022013525

重播緩衝區

為了跟踪從環境中收集的數據,我們將使用TFUniformReplayBuffer。使用描述要存儲的張量的規範構造此重播緩衝區,可以使用tf_agent.collect_data_spec從代理獲取該tf_agent.collect_data_spec

 replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=tf_agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_capacity)
 

對於大多數代理而言, collect_data_spec是一個名為tuple的Trajectory其中包含觀察,動作,獎勵等。

數據採集

現在,我們將創建一個驅動程序來收集經驗,以作為重播緩衝區的種子。驅動程序為我們提供了一種使用特定策略在環境中收集n步驟或事件的簡單方法。

 initial_collect_driver = dynamic_step_driver.DynamicStepDriver(
        train_env,
        collect_policy,
        observers=[replay_buffer.add_batch],
        num_steps=initial_collect_steps)
initial_collect_driver.run()
 
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tf_agents/drivers/dynamic_step_driver.py:203: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))

(TimeStep(step_type=<tf.Tensor: shape=(1,), dtype=int32, numpy=array([1], dtype=int32)>, reward=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([0.00101085], dtype=float32)>, discount=<tf.Tensor: shape=(1,), dtype=float32, numpy=array([1.], dtype=float32)>, observation=<tf.Tensor: shape=(1, 28), dtype=float32, numpy=
 array([[  1.308478  ,   2.166422  ,   1.5081352 ,   2.0260656 ,
           2.1123457 ,   1.114552  ,   1.5866141 ,   1.524472  ,
           6.9441314 ,   6.6945276 ,  -7.403659  , -20.185253  ,
          -4.8489103 ,  -1.2003611 , -19.449749  , -16.223652  ,
           4.2634044 ,   0.371617  ,  -0.92654324,  -3.8810008 ,
          -5.7       ,   3.10348   ,  -2.9569836 ,   3.916052  ,
           0.0551226 ,   0.10631521,  -0.09753982,   0.9880003 ]],
       dtype=float32)>),
 ())

為了從重播緩衝區中採樣數據,我們將創建一個tf.data管道,我們可以將其饋送給代理以供以後訓練。我們可以指定sample_batch_size來配置從重播緩衝區中採樣的項目數。我們還可以使用並行調用和預取來優化數據管道。

為了節省空間,我們僅將當前觀察值存儲在重放緩衝區的每一行中。但是由於SAC代理需要當前和下一個觀測值來計算損失,因此我們總是通過設置num_steps=2來對批次中每個項目的兩個相鄰行進行採樣。

 # Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, sample_batch_size=batch_size, num_steps=2).prefetch(3)

iterator = iter(dataset)
 
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/autograph/operators/control_flow.py:1004: ReplayBuffer.get_next (from tf_agents.replay_buffers.replay_buffer) is deprecated and will be removed in a future version.
Instructions for updating:
Use `as_dataset(..., single_deterministic_pass=False) instead.

培訓代理商

訓練循環涉及從環境中收集數據和優化座席網絡。在此過程中,我們有時會評估代理商的政策以了解我們的情況。

 collect_driver = dynamic_step_driver.DynamicStepDriver(
    train_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=collect_steps_per_iteration)
 
 
try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
tf_agent.train = common.function(tf_agent.train)
collect_driver.run = common.function(collect_driver.run)

# Reset the train step
tf_agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, eval_policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  collect_driver.run()

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = tf_agent.train(experience)

  step = tf_agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss.loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, eval_policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)
 
WARNING:absl:Need to use a loss function that computes losses per sample, ex: replace losses.mean_squared_error with tf.math.squared_difference. Invalid value passed for `per_example_loss`. Expected a tensor tensor with at least rank 1, received: Tensor("critic_loss/add_1:0", shape=(), dtype=float32)
WARNING:absl:Need to use a loss function that computes losses per sample, ex: replace losses.mean_squared_error with tf.math.squared_difference. Invalid value passed for `per_example_loss`. Expected a tensor tensor with at least rank 1, received: Tensor("critic_loss/add_1:0", shape=(), dtype=float32)

step = 5000: loss = -63.16588592529297
step = 10000: loss = -61.471351623535156
step = 10000: Average Return = 0.07441557198762894
step = 15000: loss = -31.185678482055664
step = 20000: loss = -18.064279556274414
step = 20000: Average Return = -0.12959735095500946
step = 25000: loss = -15.05502986907959
step = 30000: loss = -12.023421287536621
step = 30000: Average Return = -1.4209648370742798
step = 35000: loss = -5.994253635406494
step = 40000: loss = -3.944823741912842
step = 40000: Average Return = -0.6664859652519226
step = 45000: loss = 0.3637888431549072
step = 50000: loss = -3.2982077598571777
step = 50000: Average Return = 0.0521695651113987
step = 55000: loss = -2.7744715213775635
step = 60000: loss = 1.7074693441390991
step = 60000: Average Return = -0.3222312033176422
step = 65000: loss = -1.8334136009216309
step = 70000: loss = -1.4784929752349854
step = 70000: Average Return = 0.6373701095581055
step = 75000: loss = 0.48983949422836304
step = 80000: loss = 1.5974589586257935
step = 80000: Average Return = 0.1859637051820755
step = 85000: loss = -5.309885501861572
step = 90000: loss = 0.42465153336524963
step = 90000: Average Return = 0.8508636951446533
step = 95000: loss = -6.7512335777282715
step = 100000: loss = 1.8088481426239014
step = 100000: Average Return = 0.24124357104301453

可視化

情節

我們可以繪製平均回報率與總體步驟的關係圖,以查看代理商的績效。在Minitaur ,獎勵功能基於Minitaur行走1000步並懲罰能源消耗的距離。

 

steps = range(0, num_iterations + 1, eval_interval)
plt.plot(steps, returns)
plt.ylabel('Average Return')
plt.xlabel('Step')
plt.ylim()
 
(-1.5345562636852264, 0.9644551217556)

png

影片

通過在每個步驟渲染環境來可視化代理的性能很有幫助。在執行此操作之前,讓我們首先創建一個將視頻嵌入此colab的功能。

 def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)
 

以下代碼顯示了幾集的代理策略:

 num_episodes = 3
video_filename = 'sac_minitaur.mp4'
with imageio.get_writer(video_filename, fps=60) as video:
  for _ in range(num_episodes):
    time_step = eval_env.reset()
    video.append_data(eval_py_env.render())
    while not time_step.is_last():
      action_step = tf_agent.policy.action(time_step)
      time_step = eval_env.step(action_step.action)
      video.append_data(eval_py_env.render())

embed_mp4(video_filename)