TF-Agents 0.4 Tutorials : 深層 Q ネットワークを TF-Agents で訓練する (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/18/2020 (0.4)

* 本ページは、TF Agents の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Train a Deep Q Network with TF-Agents

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

深層 Q ネットワークを TF-Agents で訓練する

イントロダクション

このサンプルは TF-Agents ライブラリを使用してカートポール環境でどのように DQN (深層 Q ネットワーク) エージェントを訓練するかを示します。

それは訓練、評価とデータ収集のための強化学習 (RL) パイプラインの総てのコンポーネントを段階的に貴方に説明します。

セットアップ

以下の依存性をインストールしていないのであれば、次を実行してください :

!sudo apt-get install -y xvfb ffmpeg
!pip install 'gym==0.10.11'
!pip install 'imageio==2.4.0'
!pip install PILLOW
!pip install 'pyglet==1.3.2'
!pip install pyvirtualdisplay
!pip install --upgrade tensorflow-probability
!pip install tf-agents

from __future__ import absolute_import, division, print_function

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import q_network
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import trajectory
from tf_agents.utils import common

tf.compat.v1.enable_v2_behavior()

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

tf.version.VERSION

ハイパーパラメータ

num_iterations = 20000 # @param {type:"integer"}

initial_collect_steps = 1000  # @param {type:"integer"} 
collect_steps_per_iteration = 1  # @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 200  # @param {type:"integer"}

num_eval_episodes = 10  # @param {type:"integer"}
eval_interval = 1000  # @param {type:"integer"}

環境

強化学習では、環境は解かれるべきタスクあるいは問題を表します。TF-Agents では標準的な環境は tf_agents.environments スーツを使用して作成できます。TF-Agents は OpenAI Gym, Atari と DM Control のようなソースから環境をロードするためにスーツを持ちます。

OpenAI Gym スーツからカートポール環境をロードします。

env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

この環境をそれがどのようなものかを見るためにレンダリングできます。free-swinging ポールはカートに装着されています。目標はポールを直立するように保持するためにカートを右または左に動かすことです。

#@test {"skip": true}
env.reset()
PIL.Image.fromarray(env.render())

environment.step メソッドは環境でアクションを取り TimeStep タプルを返します、これは環境の次の観測とアクションに対する報酬を含みます。

time_step_spec() メソッドは TimeStep タプルのための仕様を返します。その observation 属性は観測の shape, データ型そして許される値の範囲を示します。reward 属性は報酬のための同じ属性を示します。

print('Observation Spec:')
print(env.time_step_spec().observation)

print('Reward Spec:')
print(env.time_step_spec().reward)

action_spec() メソッドは shape, データ型と正当なアクションの許される値を返します。

print('Action Spec:')
print(env.action_spec())

カートポール環境では :

観測は 4 floats の配列です :
- カートの位置と速度
- ポールの角度位置と (角) 速度
報酬はスカラー float 阿智
アクションは 2 つだけの可能な値を持つスカラー整数です :
- 0 — “move left”
- 1 — “move right”

time_step = env.reset()
print('Time step:')
print(time_step)

action = np.array(1, dtype=np.int32)

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)

通常は 2 つの環境がインスタンス化されます : 一つは訓練のためそして一つは評価のためです。

train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

殆どの環境のように、カートポール環境は純粋な Python で書かれています。これは TFPyEnvironment ラッパーを使用して TensorFlow に変換されます。

元の環境の API は Numpy 配列を使用しています。TFPyEnvironment はそれを TensorFlow エージェントとポリシーと互換にするためにこれらを Tensor に変換します。

train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

エージェント

RL 問題を解くために使用される有りごリズムはエージェントにより表されます。TF-Agents は以下を含む、様々なエージェントの標準的な実装を提供します :

DQN (このチュートリアルで使用されます)
REINFORCE
DDPG
TD3
PPO
SAC

DQN エージェントは離散アクション空間を持つ任意の環境で利用できます。

DQN エージェントの心臓部は QNetwork です、これは環境からの観測が与えられたとき総てのアクションに対して QValue (期待リターン) を予測することを学習できるニューラルネットワーク・モデルです。

QNetwork を作成するために tf_agents.networks.q_network を使用し、observation_spec, action_spec とモデルの隠れ層の数とサイズを記述するタプルを渡します。

fc_layer_params = (100,)

q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=fc_layer_params)

今は DqnAgent をインスタンス化するために tf_agents.agents.dqn.dqn_agent を使用します。time_step_spec, action_spec と QNetwork に加えて、エージェント・コンストラクタは optimizer (この場合、AdamOptimizer), 損失関数そして整数ステップ・カウンターも必要とします。

optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

ポリシー

ポリシーはエージェントが環境で動作する方法を定義します。典型的には、強化学習の目標は基礎的なモデルをポリシーが望まれる結果を生成するまで訓練することです。

このチュートリアルでは :

望まれる結果はポールがカートを越えて直立にバランスが取られ続けることです。
ポリシーは各 time_step 観測に対してアクション (左 or 右) を返します。

エージェントは 2 つのポリシーを含みます :

agent.policy — 評価と配備のために使用される主要ポリシー。
agent.collect_policy — データ収集のために使用される 2 番目のポリシー。

eval_policy = agent.policy
collect_policy = agent.collect_policy

ポリシーはエージェントとは無関係に作成できます。例えば、ポリシーを作成するために tf_agents.policies.random_tf_policy を使用できます、これは各 time_step のためにランダムにアクションを選択します。

random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

ポリシーからアクションを得るには、policy.action(time_step) メソッドを呼び出します。time_step は環境からの観測を含みます。このメソッドは PolicyStep を返します、これは 3 つのコンポーネントを持つ名前付きタプルです :

action — 取られるアクション (この場合、0 or 1)
state — ステートフル (つまり、RNN-ベースの) ポリシーのために使用されます
info — アクションの log 確率のような、補助的データ

example_environment = tf_py_environment.TFPyEnvironment(
    suite_gym.load('CartPole-v0'))

time_step = example_environment.reset()

random_policy.action(time_step)

メトリクスと評価

ポリシーを評価するために使用される最も一般的なメトリックは平均リターンです。リターンはエピソードのための環境でポリシーを実行している間に得られた報酬の総計です。幾つかのエピソードが実行されて、平均リターンを作成します。

次の関数は、ポリシー、環境とエピソードの数が与えられたとき、ポリシーの平均リターンを計算します。

#@test {"skip": true}
def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

random_policy 上でのこの計算の実行は環境でのベースライン・パフォーマンスを示します。

compute_avg_return(eval_env, random_policy, num_eval_episodes)

再生バッファ

再生バッファは環境から収集されたデータを追跡します。このチュートリアルは tf_agents.replay_buffers.tf_uniform_replay_buffer.TFUniformReplayBuffer を使用します、それが最も一般的であるからです。

コンストラクタはそれが収集するデータのための仕様を必要とします。これは collect_data_spec メソッドを使用してエージェントから利用可能です。バッチサイズと最大バッファ長もまた必要です。

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=replay_buffer_max_length)

殆どのエージェントについて、collect_data_spec は Trajectory と呼ばれる名前付きタプルで、観測、アクション、報酬と他の項目のための仕様を含みます。

agent.collect_data_spec

agent.collect_data_spec._fields

データ収集

今は数ステップのために環境でランダム・ポリシーを実行します、データを再生バッファで記録します。

#@test {"skip": true}
def collect_step(environment, policy, buffer):
  time_step = environment.current_time_step()
  action_step = policy.action(time_step)
  next_time_step = environment.step(action_step.action)
  traj = trajectory.from_transition(time_step, action_step, next_time_step)

  # Add trajectory to the replay buffer
  buffer.add_batch(traj)

def collect_data(env, policy, buffer, steps):
  for _ in range(steps):
    collect_step(env, policy, buffer)

collect_data(train_env, random_policy, replay_buffer, steps=100)

# This loop is so common in RL, that we provide standard implementations. 
# For more details see the drivers module.
# https://github.com/tensorflow/agents/blob/master/tf_agents/docs/python/tf_agents/drivers.md

再生バッファは今では Trajectories のコレクションです。

# For the curious:
# Uncomment to peel one of these off and inspect it.
# iter(replay_buffer.as_dataset()).next()

エージェントは再生バッファへのアクセスを必要とします。これは iterable tf.data.Dataset パイプラインを作成することにより提供されます、これはデータをエージェントに供給します。

再生バッファの各行は単一の観測ステップをストアするだけです。しかし DQN エージェントは損失を計算するために現在と次の両者の観測を必要としますので、データセット・パイプラインはバッチ (num_steps=2) の各項目のために 2 つの隣接行をサンプリングします。

データセットはまた並列呼び出しの実行とデータの先取りにより最適化もされます。

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3, 
    sample_batch_size=batch_size, 
    num_steps=2).prefetch(3)


dataset

iterator = iter(dataset)

print(iterator)

# For the curious:
# Uncomment to see what the dataset iterator is feeding to the agent.
# Compare this representation of replay data 
# to the collection of individual trajectories shown earlier.

# iterator.next()

エージェントを訓練する

訓練ループの間に 2 つのことが起きなければなりません :

環境からデータを収集する
エージェントのニューラルネットワークを訓練するためにデータを使用する

このサンプルはまた定期的にポリシーを評価して現在のスコアをプリントします。

以下は実行に ~5 分かかります。

#@test {"skip": true}
try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

for _ in range(num_iterations):

  # Collect a few steps using collect_policy and save to the replay buffer.
  for _ in range(collect_steps_per_iteration):
    collect_step(train_env, agent.collect_policy, replay_buffer)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

可視化

プロット

訓練の間にポリシーがどのように改良されるを図表にするために matplotlib.pyplot を使用します。

Cartpole-v0 の 1 反復は 200 時間ステップから成ります。環境はポールが直立し続ける各ステップのために +1 の報酬を与えますので、1 エピソードのための最大リターンは 200 です。チャートはリターンが訓練の間に評価されるたびにその最大値に向かって増えているうことを示します。(それは少し不安定で毎回単調には増えないかもしれません。)

#@test {"skip": true}

iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)

ビデオ

チャートは良いです。しかしエージェントが実際に環境でタスクを遂行するのを見ることはより刺激的です。

最初に、ビデオをノートブックに埋め込む関数を作成します。

def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  '''.format(b64.decode())

  return IPython.display.HTML(tag)

今はエージェントでカートボール・ゲームの 2, 3 のエピソードを通して反復します。基礎的な Python 環境 (TensorFlow 環境ラッパーの「内側」の一つ) が render() メソッドを提供し、これは環境状態の画像を出力します。これらはビデオに収集できます。

def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)




create_policy_eval_video(agent.policy, "trained-agent")

For fun, compare the trained agent (above) to an agent moving randomly. (It does not do as well.)

create_policy_eval_video(random_policy, "random-agent")

以上

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30