Keras 2 : examples : 強化学習 – 深層決定論的ポリシー勾配 (DDPG) (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 07/28/2022 (keras 2.9.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Reinforcement Learning – Deep Deterministic Policy Gradient (DDPG) (Author: amifunny)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : 強化学習 – 深層決定論的ポリシー勾配 (DDPG)

Description : 倒立振子問題で DDPG アルゴリズムを実装する。

イントロダクション

深層決定論的ポリシー勾配 (DDPG) は連続的なアクションを学習するためのモデルフリーなオフポリシー・アルゴリズムです。

それは DPG (決定論的ポリシー勾配) と DQN (深層 Q-ネットワーク) からのアイデアを組み合わせています。それは DQN からの経験再生と slow-learning ターゲットネットワークを使用し、DPG に基づきます、これは連続するアクション空間に渡り動作できます。

このチュートリアルはこの論文を密接にフォローしています – Continuous control with deep reinforcement learning

問題

古典的な 倒立振子 制御問題を解くことを試します。この設定では、2 つのアクションだけを取ることができます : 左に振る (= swing) か右に振るかです。

Q-学習アルゴリズムに対してこの問題を難しくしているのは、アクションが 離散的 (= discrete) ではなく連続的 (= continuous) であることです。つまり、-1 か 1 のような 2 つの離散的アクションを使用する代わりに、-2 から 2 までの無限のアクションから選択しなければなりません。

Quick theory

ちょうど Actor-Critic 法のように、2 つのネットワークを持ちます :

アクターActor – それは状態が与えられたときアクションを提案します。
Critic – それは、状態とアクションが与えられたときにアクションが良い (正値) か悪いか (負値) を予測します。

DDPG はオリジナルの DQN にはない、更なる 2 つのテクニックを使用します :

◇ まず、それは 2 つのターゲット・ネットワークを使用します。

Why ? 何故ならばそれは訓練に安定性を与えるからです。要するに、私達は推定されるターゲットから学習していてターゲットネットワークはゆっくりと更新されますので、推定されたターゲットは安定的に維持されます。

概念的には、これは「総ての移動の後にこのゲーム全体をプレイする方法を再学習していきます」と言うのに対して、「これを上手くプレーするアイデアを持っていて、より良いものを見つけるまで少し試してみる」と言っているようなものです。この StackOverflow の回答を見てください。

◇ 二番目に、それは経験再生を使用します。

タプル (state, action, reward, next_state) のリストをストアし、最近の経験からのみ学習する代わりに、そこまでに蓄積された経験の総てからサンプリングして学習します。

次に、それがどのように実装されるか見ましょう。

import gym
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

環境を作成するために OpenAIGym を使用します。後でアクションをスケールするために upper_bound パラメータを使用します。

problem = "Pendulum-v0"
env = gym.make(problem)

num_states = env.observation_space.shape[0]
print("Size of State Space ->  {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space ->  {}".format(num_actions))

upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]

print("Max Value of Action ->  {}".format(upper_bound))
print("Min Value of Action ->  {}".format(lower_bound))

Size of State Space ->  3
Size of Action Space ->  1
Max Value of Action ->  2.0
Min Value of Action ->  -2.0

アクターネットワークによるより良い探索を実装するため、論文で記述されているように、ノイズのある摂動、特にノイズを生成する オルンシュタイン-ウーレンベック過程 を使用します。それは相関性がある正規分布からノイズをサンプリングします。

class OUActionNoise:
    def __init__(self, mean, std_deviation, theta=0.15, dt=1e-2, x_initial=None):
        self.theta = theta
        self.mean = mean
        self.std_dev = std_deviation
        self.dt = dt
        self.x_initial = x_initial
        self.reset()

    def __call__(self):
        # Formula taken from https://www.wikipedia.org/wiki/Ornstein-Uhlenbeck_process.
        x = (
            self.x_prev
            + self.theta * (self.mean - self.x_prev) * self.dt
            + self.std_dev * np.sqrt(self.dt) * np.random.normal(size=self.mean.shape)
        )
        # Store x into x_prev
        # Makes next noise dependent on current one
        self.x_prev = x
        return x

    def reset(self):
        if self.x_initial is not None:
            self.x_prev = self.x_initial
        else:
            self.x_prev = np.zeros_like(self.mean)

Buffer クラスは経験再生を実装しています。

Algorithm

Critic 損失 : y – Q(s, a) の平均二乗誤差, ここで y はターゲットネットワークにより見られる期待リターンで、Q(s, a) は Critic ネットワークにより予測されるアクション値です。y は critic モデルが獲得しようとする移動ターゲットで、ターゲットモデルをゆっくりと更新することでこのターゲットを安定的にします。

アクター損失 : これはアクターネットワークにより取られたアクションに対する Critic ネットワークにより与えられる値の平均を使用して計算されます。この量を最大化することを求めます。

こうしてアクターネットワークが与えられた状態に対して Critic により見られる最大予測値を得るアクションを生成するようにアクターネットワークを更新します。

class Buffer:
    def __init__(self, buffer_capacity=100000, batch_size=64):
        # Number of "experiences" to store at max
        self.buffer_capacity = buffer_capacity
        # Num of tuples to train on.
        self.batch_size = batch_size

        # Its tells us num of times record() was called.
        self.buffer_counter = 0

        # Instead of list of tuples as the exp.replay concept go
        # We use different np.arrays for each tuple element
        self.state_buffer = np.zeros((self.buffer_capacity, num_states))
        self.action_buffer = np.zeros((self.buffer_capacity, num_actions))
        self.reward_buffer = np.zeros((self.buffer_capacity, 1))
        self.next_state_buffer = np.zeros((self.buffer_capacity, num_states))

    # Takes (s,a,r,s') obervation tuple as input
    def record(self, obs_tuple):
        # Set index to zero if buffer_capacity is exceeded,
        # replacing old records
        index = self.buffer_counter % self.buffer_capacity

        self.state_buffer[index] = obs_tuple[0]
        self.action_buffer[index] = obs_tuple[1]
        self.reward_buffer[index] = obs_tuple[2]
        self.next_state_buffer[index] = obs_tuple[3]

        self.buffer_counter += 1

    # Eager execution is turned on by default in TensorFlow 2. Decorating with tf.function allows
    # TensorFlow to build a static graph out of the logic and computations in our function.
    # This provides a large speed up for blocks of code that contain many small TensorFlow operations such as this one.
    @tf.function
    def update(
        self, state_batch, action_batch, reward_batch, next_state_batch,
    ):
        # Training and updating Actor & Critic networks.
        # See Pseudo Code.
        with tf.GradientTape() as tape:
            target_actions = target_actor(next_state_batch, training=True)
            y = reward_batch + gamma * target_critic(
                [next_state_batch, target_actions], training=True
            )
            critic_value = critic_model([state_batch, action_batch], training=True)
            critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))

        critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)
        critic_optimizer.apply_gradients(
            zip(critic_grad, critic_model.trainable_variables)
        )

        with tf.GradientTape() as tape:
            actions = actor_model(state_batch, training=True)
            critic_value = critic_model([state_batch, actions], training=True)
            # Used `-value` as we want to maximize the value given
            # by the critic for our actions
            actor_loss = -tf.math.reduce_mean(critic_value)

        actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)
        actor_optimizer.apply_gradients(
            zip(actor_grad, actor_model.trainable_variables)
        )

    # We compute the loss and update parameters
    def learn(self):
        # Get sampling range
        record_range = min(self.buffer_counter, self.buffer_capacity)
        # Randomly sample indices
        batch_indices = np.random.choice(record_range, self.batch_size)

        # Convert to tensors
        state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
        action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
        reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
        reward_batch = tf.cast(reward_batch, dtype=tf.float32)
        next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])

        self.update(state_batch, action_batch, reward_batch, next_state_batch)


# This update target parameters slowly
# Based on rate `tau`, which is much less than one.
@tf.function
def update_target(target_weights, weights, tau):
    for (a, b) in zip(target_weights, weights):
        a.assign(b * tau + a * (1 - tau))

ここでアクターと Critic ネットワークを定義します。これらは ReLU 活性を持つ基本的な Dense モデルです。

Note : アクターの最後の層に対して -0.003 と 0.003 の間になるように初期化を必要とします、これは初期ステージで 1 or -1 出力値を得ることを防ぐためで、これは tanh 活性を使用して勾配をゼロにスカッシュします。

def get_actor():
    # Initialize weights between -3e-3 and 3-e3
    last_init = tf.random_uniform_initializer(minval=-0.003, maxval=0.003)

    inputs = layers.Input(shape=(num_states,))
    out = layers.Dense(256, activation="relu")(inputs)
    out = layers.Dense(256, activation="relu")(out)
    outputs = layers.Dense(1, activation="tanh", kernel_initializer=last_init)(out)

    # Our upper bound is 2.0 for Pendulum.
    outputs = outputs * upper_bound
    model = tf.keras.Model(inputs, outputs)
    return model


def get_critic():
    # State as input
    state_input = layers.Input(shape=(num_states))
    state_out = layers.Dense(16, activation="relu")(state_input)
    state_out = layers.Dense(32, activation="relu")(state_out)

    # Action as input
    action_input = layers.Input(shape=(num_actions))
    action_out = layers.Dense(32, activation="relu")(action_input)

    # Both are passed through seperate layer before concatenating
    concat = layers.Concatenate()([state_out, action_out])

    out = layers.Dense(256, activation="relu")(concat)
    out = layers.Dense(256, activation="relu")(out)
    outputs = layers.Dense(1)(out)

    # Outputs single value for give state-action
    model = tf.keras.Model([state_input, action_input], outputs)

    return model

policy() はアクターネットワークからサンプリングされたアクションと探索のためのノイズを返します。

def policy(state, noise_object):
    sampled_actions = tf.squeeze(actor_model(state))
    noise = noise_object()
    # Adding noise to action
    sampled_actions = sampled_actions.numpy() + noise

    # We make sure action is within bounds
    legal_action = np.clip(sampled_actions, lower_bound, upper_bound)

    return [np.squeeze(legal_action)]

訓練ハイパーパラメータ

std_dev = 0.2
ou_noise = OUActionNoise(mean=np.zeros(1), std_deviation=float(std_dev) * np.ones(1))

actor_model = get_actor()
critic_model = get_critic()

target_actor = get_actor()
target_critic = get_critic()

# Making the weights equal initially
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())

# Learning rate for actor-critic models
critic_lr = 0.002
actor_lr = 0.001

critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)

total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.005

buffer = Buffer(50000, 64)

次にメイン訓練ループを実装し、エピソードを反復します。各時間ステップで policy() を使用してアクションをサンプリングして learn() で訓練し、レート tau でターゲットネットワークを更新します。

# To store reward history of each episode
ep_reward_list = []
# To store average reward history of last few episodes
avg_reward_list = []

# Takes about 4 min to train
for ep in range(total_episodes):

    prev_state = env.reset()
    episodic_reward = 0

    while True:
        # Uncomment this to see the Actor in action
        # But not in a python notebook.
        # env.render()

        tf_prev_state = tf.expand_dims(tf.convert_to_tensor(prev_state), 0)

        action = policy(tf_prev_state, ou_noise)
        # Recieve state and reward from environment.
        state, reward, done, info = env.step(action)

        buffer.record((prev_state, action, reward, state))
        episodic_reward += reward

        buffer.learn()
        update_target(target_actor.variables, actor_model.variables, tau)
        update_target(target_critic.variables, critic_model.variables, tau)

        # End this episode when `done` is True
        if done:
            break

        prev_state = state

    ep_reward_list.append(episodic_reward)

    # Mean of last 40 episodes
    avg_reward = np.mean(ep_reward_list[-40:])
    print("Episode * {} * Avg Reward is ==> {}".format(ep, avg_reward))
    avg_reward_list.append(avg_reward)

# Plotting graph
# Episodes versus Avg. Rewards
plt.plot(avg_reward_list)
plt.xlabel("Episode")
plt.ylabel("Avg. Epsiodic Reward")
plt.show()

Episode * 0 * Avg Reward is ==> -1269.3278950595395
Episode * 1 * Avg Reward is ==> -1528.3008939716287
Episode * 2 * Avg Reward is ==> -1511.1737868279706
Episode * 3 * Avg Reward is ==> -1512.8568141261057
Episode * 4 * Avg Reward is ==> -1386.054573343386
Episode * 5 * Avg Reward is ==> -1411.4818856846339
Episode * 6 * Avg Reward is ==> -1431.6790621961388
Episode * 7 * Avg Reward is ==> -1427.9515009474867
Episode * 8 * Avg Reward is ==> -1392.9313930075857
Episode * 9 * Avg Reward is ==> -1346.6839043846012
Episode * 10 * Avg Reward is ==> -1325.5818224096574
Episode * 11 * Avg Reward is ==> -1271.778361283553
Episode * 12 * Avg Reward is ==> -1194.0784354001732
Episode * 13 * Avg Reward is ==> -1137.1096928093427
Episode * 14 * Avg Reward is ==> -1087.2426176918214
Episode * 15 * Avg Reward is ==> -1043.5265287176114
Episode * 16 * Avg Reward is ==> -990.0857409180443
Episode * 17 * Avg Reward is ==> -949.0661362879348
Episode * 18 * Avg Reward is ==> -906.1744575963231
Episode * 19 * Avg Reward is ==> -914.0098344966382
Episode * 20 * Avg Reward is ==> -886.8905055354011
Episode * 21 * Avg Reward is ==> -859.3416389004793
Episode * 22 * Avg Reward is ==> -827.5405203616622
Episode * 23 * Avg Reward is ==> -798.3875178404127
Episode * 24 * Avg Reward is ==> -771.289491103158
Episode * 25 * Avg Reward is ==> -741.6622445749622
Episode * 26 * Avg Reward is ==> -727.7080867854874
Episode * 27 * Avg Reward is ==> -710.485046117201
Episode * 28 * Avg Reward is ==> -690.3850022530833
Episode * 29 * Avg Reward is ==> -671.3205042911178
Episode * 30 * Avg Reward is ==> -653.4475135842247
Episode * 31 * Avg Reward is ==> -637.0057392119055
Episode * 32 * Avg Reward is ==> -629.2474166794424
Episode * 33 * Avg Reward is ==> -614.4655398230501
Episode * 34 * Avg Reward is ==> -603.3854873345723
Episode * 35 * Avg Reward is ==> -589.86534490467
Episode * 36 * Avg Reward is ==> -577.1806480684269
Episode * 37 * Avg Reward is ==> -565.1365286280546
Episode * 38 * Avg Reward is ==> -550.6647028563134
Episode * 39 * Avg Reward is ==> -540.0095147571197
Episode * 40 * Avg Reward is ==> -517.3861294233157
Episode * 41 * Avg Reward is ==> -478.705352005952
Episode * 42 * Avg Reward is ==> -444.8350788756713
Episode * 43 * Avg Reward is ==> -409.85293165991334
Episode * 44 * Avg Reward is ==> -390.83984710631546
Episode * 45 * Avg Reward is ==> -360.88156865913675
Episode * 46 * Avg Reward is ==> -325.26685315168595
Episode * 47 * Avg Reward is ==> -290.2315644399411
Episode * 48 * Avg Reward is ==> -268.0351126010609
Episode * 49 * Avg Reward is ==> -247.8952699063706
Episode * 50 * Avg Reward is ==> -222.99123461788048
Episode * 51 * Avg Reward is ==> -209.0830401020491
Episode * 52 * Avg Reward is ==> -205.65143423678765
Episode * 53 * Avg Reward is ==> -201.8910585767988
Episode * 54 * Avg Reward is ==> -192.18560466037357
Episode * 55 * Avg Reward is ==> -189.43475813660137
Episode * 56 * Avg Reward is ==> -191.92700535454787
Episode * 57 * Avg Reward is ==> -188.5196218645745
Episode * 58 * Avg Reward is ==> -188.17872234729674
Episode * 59 * Avg Reward is ==> -167.33043921566485
Episode * 60 * Avg Reward is ==> -165.01361185173954
Episode * 61 * Avg Reward is ==> -164.5316658073024
Episode * 62 * Avg Reward is ==> -164.4025677076815
Episode * 63 * Avg Reward is ==> -167.27842005634784
Episode * 64 * Avg Reward is ==> -167.12049955654845
Episode * 65 * Avg Reward is ==> -170.02761731078783
Episode * 66 * Avg Reward is ==> -167.56039601863873
Episode * 67 * Avg Reward is ==> -164.60482495249738
Episode * 68 * Avg Reward is ==> -167.45278232469394
Episode * 69 * Avg Reward is ==> -167.42407364484592
Episode * 70 * Avg Reward is ==> -167.57794933965346
Episode * 71 * Avg Reward is ==> -170.6408611483338
Episode * 72 * Avg Reward is ==> -163.96954092530822
Episode * 73 * Avg Reward is ==> -160.82007525469245
Episode * 74 * Avg Reward is ==> -158.38239222565778
Episode * 75 * Avg Reward is ==> -158.3554729720654
Episode * 76 * Avg Reward is ==> -158.51036948298994
Episode * 77 * Avg Reward is ==> -158.68906473090686
Episode * 78 * Avg Reward is ==> -164.60260866654318
Episode * 79 * Avg Reward is ==> -161.5493472156026
Episode * 80 * Avg Reward is ==> -152.48077012719403
Episode * 81 * Avg Reward is ==> -149.52532010375975
Episode * 82 * Avg Reward is ==> -149.61942419730423
Episode * 83 * Avg Reward is ==> -149.82443455067468
Episode * 84 * Avg Reward is ==> -149.80009937226978
Episode * 85 * Avg Reward is ==> -144.51659331262107
Episode * 86 * Avg Reward is ==> -150.7545561142967
Episode * 87 * Avg Reward is ==> -153.84772667131307
Episode * 88 * Avg Reward is ==> -151.35200443047225
Episode * 89 * Avg Reward is ==> -148.30392250041828
Episode * 90 * Avg Reward is ==> -151.33886235855053
Episode * 91 * Avg Reward is ==> -151.153096135589
Episode * 92 * Avg Reward is ==> -151.19626034791332
Episode * 93 * Avg Reward is ==> -151.15870791946685
Episode * 94 * Avg Reward is ==> -154.2673372216281
Episode * 95 * Avg Reward is ==> -150.40737651480134
Episode * 96 * Avg Reward is ==> -147.7969116731913
Episode * 97 * Avg Reward is ==> -147.88640802454557
Episode * 98 * Avg Reward is ==> -144.88997165191319
Episode * 99 * Avg Reward is ==> -142.22158276699662

訓練が正しく進めば、平均的なエピソードの報酬は時間とともに増加します。

異なる学習率、tau 値、そしてアクターと Critic ネットワークのためのアーキテクチャを自由に試してください。

倒立振子問題はあまり複雑ではありませが、DDPG は多くの他の問題で素晴らしく動作します。

Another great environment to try this on is LunarLandingContinuous-v2, but it will take more episodes to obtain good results.

# Save the weights
actor_model.save_weights("pendulum_actor.h5")
critic_model.save_weights("pendulum_critic.h5")

target_actor.save_weights("pendulum_target_actor.h5")
target_critic.save_weights("pendulum_target_critic.h5")

Before Training:

After 100 episodes:

以上

2022年7月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31