TF-Agents 0.4 Tutorials : ポリシー (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/19/2020 (0.4)

* 本ページは、TF Agents の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Policies

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

ポリシー

イントロダクション

強化学習の用語では、ポリシーは環境からの観測をアクションかアクションに渡る分布にマップします。TF-Agents では、環境からの観測は名前付きタプル TimeStep(‘step_type’, ‘discount’, ‘reward’, ‘observation’) に含まれて、そしてポリシーは時間ステップをアクションかアクションに渡る分布にマップします。殆どのポリシーは timestep.observation を利用し、幾つかのポリシーは timestep.step_type を利用します (e.g. ステートフル・ポリシーでエピソードの始まりで状態をリセットするため) が、timestep.discount と timestep.reward は通常は無視されます。

ポリシーは次のような方法で TF-Agents で他のコンポーネントに関係します。殆どのポリシーはアクション and/or TimeSteps からのアクションに渡る分布を計算するためにニューラルネットワークを持ちます。エージェントは異なる目的のために 1 つまたはそれ以上のポリシーを含むことができます、e.g. 配備のために訓練される主要ポリシー、そしてデータ収集のための noisy ポリシー。ポリシーはセーブ/リストアできて、データコレクション、評価等のためにエージェントとは無関係に使用できます。

幾つかのポリシーは TensorFlow で書くことがより容易です (e.g. ニューラルネットワークを持つもの)、他は Python で書くことが容易であることに反して (e.g. アクションのスクリプトに従っている)。そこで TF agents では、Python と TensorFlow ポリシーの両者を許容します。更に、TensorFlow で書かれたポリシーは Python 環境で使用されなければならないかもしれません、or vice versa、e.g. TensorFlow ポリシーは訓練のために使用されますが後で製品 python 環境で配備されます。これを容易にするために、python と TensorFlow ポリシーの間で変換するためのラッパーを提供します。

ポリシーのもう一つの興味深いクラスはポリシー・ラッパーです、これは与えられたポリシーをある方法で変更します、e.g. 特定のタイプのノイズを追加する、確率的ポリシーの greedy or epsilon-greedy 版を作成する、ランダムにマルチポリシーをミックスする等。

セットアップ

まだ tf-agents をインストールしていないのであれば、以下を実行します :

!pip install --upgrade tensorflow-probability
!pip install tf-agents

from __future__ import division
from __future__ import print_function

import abc
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np

from tf_agents.specs import array_spec
from tf_agents.specs import tensor_spec
from tf_agents.networks import network

from tf_agents.policies import py_policy
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

from tf_agents.policies import tf_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import actor_policy
from tf_agents.policies import q_policy
from tf_agents.policies import greedy_policy

from tf_agents.trajectories import time_step as ts

tf.compat.v1.enable_v2_behavior()

Python ポリシー

Python ポリシーのためのインターフェイスは policies/py_policy.Base で定義されます。主要なメソッドは :

class Base(object):

  @abc.abstractmethod
  def __init__(self, time_step_spec, action_spec, policy_state_spec=()):
    self._time_step_spec = time_step_spec
    self._action_spec = action_spec
    self._policy_state_spec = policy_state_spec

  @abc.abstractmethod
  def reset(self, policy_state=()):
    # return initial_policy_state.
    pass

  @abc.abstractmethod
  def action(self, time_step, policy_state=()):
    # return a PolicyStep(action, state, info) named tuple.
    pass

  @abc.abstractmethod
  def distribution(self, time_step, policy_state=()):
    # Not implemented in python, only for TF policies.
    pass

  @abc.abstractmethod
  def update(self, policy):
    # update self to be similar to the input `policy`.
    pass

  @abc.abstractmethod
  def copy(self):
    # return a copy of self.
    pass

  @property
  def time_step_spec(self):
    return self._time_step_spec

  @property
  def action_spec(self):
    return self._action_spec

  @property
  def policy_state_spec(self):
    return self._policy_state_spec

最も重要なメソッドは action(time_step) です、これは環境からの観測を含む time_step を以下の属性を含む PolicyStep 名前付きタプルにマップします :

action: 環境に適用されるアクション。
state: アクションへの次の呼び出しで供給されるポリシーの状態 (e.g. RNN 状態)。
info: アクション対数確率のようなオプションの副次情報。

time_step_spec と action_spec は入力時間ステップと出力アクションのための仕様です。ポリシーはまたリセット関数を持ちます、これは典型的にはステートフル・ポリシーの状態をリセットするために使用されます。copy 関数は自身のコピーを返しそして update(new_policy) 関数は自身を new_policy に向けて更新します。

今は、python ポリシーの 2, 3 のサンプルを見ましょう。

サンプル 1: ランダム Python ポリシー

PyPolicy の単純なサンプルは RandomPyPolicy で、これは離散/連続な与えられた action_spec のためのランダムアクションを生成します。入力 time_step は無視されます。

action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)
my_random_py_policy = random_py_policy.RandomPyPolicy(time_step_spec=None,
    action_spec=action_spec)
time_step = None
action_step = my_random_py_policy.action(time_step)
print(action_step)
action_step = my_random_py_policy.action(time_step)
print(action_step)

サンプル 2: Scripted Python ポリシー

script ポリシーは (num_repeats, action) タプルのリストとして表されたアクションのスクリプトをプレイバックします。action 関数が呼び出されるたびに、それは反復の指定回数が成されるまでリストから次のアクションを返し、それからリストの次のアクションに進みます。reset メソッドはリストの最初から実行を開始するために呼び出すことができます。

action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)
action_script = [(1, np.array([5, 2], dtype=np.int32)), 
                 (0, np.array([0, 0], dtype=np.int32)), # Setting `num_repeates` to 0 will skip this action.
                 (2, np.array([1, 2], dtype=np.int32)), 
                 (1, np.array([3, 4], dtype=np.int32))]

my_scripted_py_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)

policy_state = my_scripted_py_policy.get_initial_state()
time_step = None
print('Executing scripted policy...')
action_step = my_scripted_py_policy.action(time_step, policy_state)
print(action_step)
action_step= my_scripted_py_policy.action(time_step, action_step.state)
print(action_step)
action_step = my_scripted_py_policy.action(time_step, action_step.state)
print(action_step)

print('Resetting my_scripted_py_policy...')
policy_state = my_scripted_py_policy.get_initial_state()
action_step = my_scripted_py_policy.action(time_step, policy_state)
print(action_step)

TensorFlow ポリシー

TensorFlow ポリシーは Python ポリシーと同じインターフェイスに従います。2, 3 のサンプルを見ましょう。

Example 1: Random TF ポリシー

RandomTFPolicy は与えられた離散/連続な action_spec に従ってランダムアクションを生成するために使用できます。入力 time_step は無視されます。

action_spec = tensor_spec.BoundedTensorSpec(
    (2,), tf.float32, minimum=-1, maximum=3)
input_tensor_spec = tensor_spec.TensorSpec((2,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)

my_random_tf_policy = random_tf_policy.RandomTFPolicy(
    action_spec=action_spec, time_step_spec=time_step_spec)
observation = tf.ones(time_step_spec.observation.shape)
time_step = ts.restart(observation)
action_step = my_random_tf_policy.action(time_step)

print('Action:')
print(action_step.action)

サンプル 2: Actor ポリシー

actor ポリシーは time_steps をアクションにマップするネットワークか、time_steps をアクションに渡る分散にマップするネットワークを使用して作成できます。

アクションネットワークを使用する

ネットワークを次のように定義しましょう :


class ActionNet(network.Network):

  def __init__(self, input_tensor_spec, output_tensor_spec):
    super(ActionNet, self).__init__(
        input_tensor_spec=input_tensor_spec,
        state_spec=(),
        name='ActionNet')
    self._output_tensor_spec = output_tensor_spec
    self._sub_layers = [
        tf.keras.layers.Dense(
            action_spec.shape.num_elements(), activation=tf.nn.tanh),
    ]

  def call(self, observations, step_type, network_state):
    del step_type

    output = tf.cast(observations, dtype=tf.float32)
    for layer in self._sub_layers:
      output = layer(output)
    actions = tf.reshape(output, [-1] + self._output_tensor_spec.shape.as_list())

    # Scale and shift actions to the correct range if necessary.
    return actions, network_state

TensorFlow では殆どのネットワーク層はバッチ演算のために設計されていますので、入力 time_steps はバッチ処理されてネットワーク出力もバッチ処理されることを想定します。ネットワークはまた与えられた action_spec の正しい範囲でアクションを生成する責任も負います。これは慣習的に e.g. [-1, 1] のアクションを生成するために最終層のための tanh 活性を使用してからこれを入力 action_spec として正しい範囲にスケールしてシフトすることにより成されます (e.g. tf_agents/agents/ddpg/networks.actor_network() 参照)。

今は、上のネットワークを使用して actor ポリシーを作成できます。

input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)
action_spec = tensor_spec.BoundedTensorSpec((3,),
                                            tf.float32,
                                            minimum=-1,
                                            maximum=1)

action_net = ActionNet(input_tensor_spec, action_spec)

my_actor_policy = actor_policy.ActorPolicy(
    time_step_spec=time_step_spec,
    action_spec=action_spec,
    actor_network=action_net)

それを time_step_spec に従う time_steps の任意のバッチに適用できます :

batch_size = 2
observations = tf.ones([2] + time_step_spec.observation.shape.as_list())

time_step = ts.restart(observations, batch_size)

action_step = my_actor_policy.action(time_step)
print('Action:')
print(action_step.action)

distribution_step = my_actor_policy.distribution(time_step)
print('Action distribution:')
print(distribution_step.action)

上のサンプルで、アクション tensor を生成するアクション・ネットワークを使用してポリシーを作成しました。この場合、policy.distribution(time_step) は policy.action(time_step) の出力回りの決定論的 (delta) 分布です。確率的ポリシーを生成する一つの方法は actor ポリシーをアクションに noise を追加するポリシー・ラッパーでラップすることです。もう一つの方法は下で示されるようにアクションネットワークの代わりにアクション分布ネットワークを使用して actor ポリシーを作成することです。

アクション分布ネットワークを使用する

class ActionDistributionNet(ActionNet):

  def call(self, observations, step_type, network_state):
    action_means, network_state = super(ActionDistributionNet, self).call(
        observations, step_type, network_state)

    action_std = tf.ones_like(action_means)
    return tfp.distributions.Normal(action_means, action_std), network_state


action_distribution_net = ActionDistributionNet(input_tensor_spec, action_spec)

my_actor_policy = actor_policy.ActorPolicy(
    time_step_spec=time_step_spec,
    action_spec=action_spec,
    actor_network=action_distribution_net)

action_step = my_actor_policy.action(time_step)
print('Action:')
print(action_step.action)
distribution_step = my_actor_policy.distribution(time_step)
print('Action distribution:')
print(distribution_step.action)

上で、アクションは与えられたアクション spec [-1, 1] の範囲にクリップされることに注意してください。これはデフォルトで ActorPolicy のコンストラクタ引数 clip=True であるためです。これを false に設定することはネットワークにより生成されるクリップされないアクションを返します。

確率的ポリシーは例えば、GreedyPolicy ラッパーを使用して決定論的ポリシーに変換できます、これはそのアクションとして stochastic_policy.distribution().mode() を、そしてその distribution() としてこの greedy アクション回りの決定論的/delta 分布を選択します。

サンプル 3: Q ポリシー

Q ポリシーは DQN のようなエージェントで使用されて、各離散アクションのための Q 値を予測する Q ネットワークに基づきます。与えられた時間ステップについて、Q ポリシーのアクション分布はロジットとして q 値を使用して作成される categorical 分布です。

input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)
action_spec = tensor_spec.BoundedTensorSpec((1,),
                                            tf.int32,
                                            minimum=-1,
                                            maximum=1)
num_actions = action_spec.maximum - action_spec.minimum + 1


class QNetwork(network.Network):

  def __init__(self, input_tensor_spec, action_spec, num_actions=num_actions, name=None):
    super(QNetwork, self).__init__(
        input_tensor_spec=input_tensor_spec,
        state_spec=(),
        name=name)
    self._sub_layers = [
        tf.keras.layers.Dense(num_actions),
    ]

  def call(self, inputs, step_type=None, network_state=()):
    del step_type
    inputs = tf.cast(inputs, tf.float32)
    for layer in self._sub_layers:
      inputs = layer(inputs)
    return inputs, network_state


batch_size = 2
observation = tf.ones([batch_size] + time_step_spec.observation.shape.as_list())
time_steps = ts.restart(observation, batch_size=batch_size)

my_q_network = QNetwork(
    input_tensor_spec=input_tensor_spec,
    action_spec=action_spec)
my_q_policy = q_policy.QPolicy(
    time_step_spec, action_spec, q_network=my_q_network)
action_step = my_q_policy.action(time_steps)
distribution_step = my_q_policy.distribution(time_steps)

print('Action:')
print(action_step.action)

print('Action distribution:')
print(distribution_step.action)

ポリシー・ラッパー

ポリシー・ラッパーは与えられたポリシー, e.g. add noise をラップして変更するために使用できます。ポリシー・ラッパーは Policy (Python/TensorFlow) のサブクラスで従ってちょうど任意の他のポリシーのように使用できます。

サンプル: Greedy ポリシー

greedy ラッパーは distribution() を実装する任意の TensorFlow ポリシーをラップするために使用できます。GreedyPolicy.action() は wrapped_policy.distribution().mode() を返してそして GreedyPolicy.distribution() は GreedyPolicy.action() 回りの決定論的/delta 分布です :


my_greedy_policy = greedy_policy.GreedyPolicy(my_q_policy)

action_step = my_greedy_policy.action(time_steps)
print('Action:')
print(action_step.action)

distribution_step = my_greedy_policy.distribution(time_steps)
print('Action distribution:')
print(distribution_step.action)

以上

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30