Keras 2 : examples : 強化学習 – Actor Critic 法 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 07/27/2022 (keras 2.9.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Reinforcement Learning – Actor Critic Method (Author: Apoorv Nandan)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : 強化学習 – Actor Critic 法

Description : カートポール環境で Actor Critic 法を実装する。

イントロダクション

このスクリプトは CartPole-V0 環境上で Actor Critic 法の実装を示します。

Actor Critic 法

エージェントがアクションを取り環境内を移動するにつれ、それは環境の観測された状態を 2 つの可能な出力へマップすることを学習します :

推奨アクション : アクション空間の各アクションに対する確率値。この出力を担うエージェントの部分を アクター と呼称します。
将来の推定報酬 : 将来受け取ることが想定される総ての報酬の総和。この出力を担うエージェントの部分は critic (批評家) です。

エージェントと Critic は、アクターからの推奨アクションが報酬を最大化するように、それらのタスクを遂行することを学習します。

CartPole-V0

ポールは摩擦のない軌道上に置かれたカートに装着されます。エージェントはカートを動かすために力を加えなければなりません。ポールが直立したままの総ての時間ステップについて報酬が与えられます。従って、エージェントはポールが倒れないようにすることを学習しなければなりません。

References

セットアップ

import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Create the environment
env.seed(seed)
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0

Actor Critic ネットワークの実装

このネットワークは 2 つの関数を学習します :

アクター : これは入力として環境の状態を受け取り、アクション空間の各アクションに対する確率を返します。
Critic : これは入力として環境の状態を受け取り、将来的な合計報酬の推定を返します。

私達の実装では、それらは初期化層を共有します。

num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])

訓練

optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

while True:  # Run until solved
    state = env.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

running reward: 8.82 at episode 10
running reward: 23.04 at episode 20
running reward: 28.41 at episode 30
running reward: 53.59 at episode 40
running reward: 53.71 at episode 50
running reward: 77.35 at episode 60
running reward: 74.76 at episode 70
running reward: 57.89 at episode 80
running reward: 46.59 at episode 90
running reward: 43.48 at episode 100
running reward: 63.77 at episode 110
running reward: 111.13 at episode 120
running reward: 142.77 at episode 130
running reward: 127.96 at episode 140
running reward: 113.92 at episode 150
running reward: 128.57 at episode 160
running reward: 139.95 at episode 170
running reward: 154.95 at episode 180
running reward: 171.45 at episode 190
running reward: 171.33 at episode 200
running reward: 177.74 at episode 210
running reward: 184.76 at episode 220
running reward: 190.88 at episode 230
running reward: 154.78 at episode 240
running reward: 114.38 at episode 250
running reward: 107.51 at episode 260
running reward: 128.99 at episode 270
running reward: 157.48 at episode 280
running reward: 174.54 at episode 290
running reward: 184.76 at episode 300
running reward: 190.87 at episode 310
running reward: 194.54 at episode 320
Solved at episode 322!

可視化

In early stages of training:

In later stages of training:

以上

2022年7月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31