Keras 2 : examples : 強化学習 – Proximal ポリシー最適化 (PPO) (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 08/01/2022 (keras 2.9.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Reinforcement Learning – Proximal Policy Optimization (Author: Ilias Chrysovergis)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : 強化学習 – Proximal ポリシー最適化 (PPO)

Description : CartPole-v0 環境のための Proximal ポリシー最適化エージェントの実装。

イントロダクション

このコードサンプルは Proximal ポリシー最適化 (PPO) エージェントを使用して CartPole-v0 環境を解きます。

CartPole-v0

ポールは駆動しないジョイントによりカートに装着されていて、これは摩擦のない軌道に沿って動きます。システムは +1 か -1 の力をカートに適用することで制御されます。振り子は直立から始まり、目標はそれが倒れることを防ぐことです。ポールが直立し続けている時間ステップ毎に +1 の報酬が提供されます。ポールが垂直から 15 度を超えたときか、カートが中央から 2.4 ユニット以上移動したときに、エピソードは終了します。200 ステップ後にエピソードは終了します。こうして、得られた最高のリターンは 200 ステップに等しいです。

CartPole-v0

Proximal ポリシー最適化

PPO はポリシー勾配法で離散的か連続的アクション空間を持つ環境に対して利用できます。それはオンポリシーな方法で確率的ポリシーを訓練します。また、アクター critic 法も利用します。アクターは観測をアクションにマップして、critic は与えられた観測に対するエージェントの報酬の期待値を与えます。最初に、確率的ポリシーの最新版からサンプリングして各エポックに対する軌跡のセットを集めます。そして、ポリシーを更新して価値関数を適合させるために rewards-to-go と advantage 推定値が計算されます。ポリシーは確率的勾配上昇 optimizer により更新され、価値関数はある勾配降下アルゴリズムで適合されます。この手続きは環境が解かれるまで多くのエポックに対して適用されます。

Algorithm

Note
このコードサンプルは Keras と Tensorflow v2 を使用します。それは PPO のオリジナルの論文, OpenAI の PPO の Spinning Up docs, そして Tensorflow v1 を使用した OpenAI の PPO の Spinning Up 実装に基づいています。

OpenAI Spinning Up Github – PPO

ライブラリ

このサンプルのために以下のライブラリが使用されます :

n-次元配列のための numpy
深層 RL PPO エージェントを構築するための tensorflow と keras
環境について必要な総てを取得するための gym
ベクトルの discounted cumulative sum を計算するための scipy.signal

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import gym
import scipy.signal
import time

関数とクラス

def discounted_cumulative_sums(x, discount):
    # Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]


class Buffer:
    # Buffer for storing trajectories
    def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
        # Buffer initialization
        self.observation_buffer = np.zeros(
            (size, observation_dimensions), dtype=np.float32
        )
        self.action_buffer = np.zeros(size, dtype=np.int32)
        self.advantage_buffer = np.zeros(size, dtype=np.float32)
        self.reward_buffer = np.zeros(size, dtype=np.float32)
        self.return_buffer = np.zeros(size, dtype=np.float32)
        self.value_buffer = np.zeros(size, dtype=np.float32)
        self.logprobability_buffer = np.zeros(size, dtype=np.float32)
        self.gamma, self.lam = gamma, lam
        self.pointer, self.trajectory_start_index = 0, 0

    def store(self, observation, action, reward, value, logprobability):
        # Append one step of agent-environment interaction
        self.observation_buffer[self.pointer] = observation
        self.action_buffer[self.pointer] = action
        self.reward_buffer[self.pointer] = reward
        self.value_buffer[self.pointer] = value
        self.logprobability_buffer[self.pointer] = logprobability
        self.pointer += 1

    def finish_trajectory(self, last_value=0):
        # Finish the trajectory by computing advantage estimates and rewards-to-go
        path_slice = slice(self.trajectory_start_index, self.pointer)
        rewards = np.append(self.reward_buffer[path_slice], last_value)
        values = np.append(self.value_buffer[path_slice], last_value)

        deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]

        self.advantage_buffer[path_slice] = discounted_cumulative_sums(
            deltas, self.gamma * self.lam
        )
        self.return_buffer[path_slice] = discounted_cumulative_sums(
            rewards, self.gamma
        )[:-1]

        self.trajectory_start_index = self.pointer

    def get(self):
        # Get all data of the buffer and normalize the advantages
        self.pointer, self.trajectory_start_index = 0, 0
        advantage_mean, advantage_std = (
            np.mean(self.advantage_buffer),
            np.std(self.advantage_buffer),
        )
        self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
        return (
            self.observation_buffer,
            self.action_buffer,
            self.advantage_buffer,
            self.return_buffer,
            self.logprobability_buffer,
        )


def mlp(x, sizes, activation=tf.tanh, output_activation=None):
    # Build a feedforward neural network
    for size in sizes[:-1]:
        x = layers.Dense(units=size, activation=activation)(x)
    return layers.Dense(units=sizes[-1], activation=output_activation)(x)


def logprobabilities(logits, a):
    # Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
    logprobabilities_all = tf.nn.log_softmax(logits)
    logprobability = tf.reduce_sum(
        tf.one_hot(a, num_actions) * logprobabilities_all, axis=1
    )
    return logprobability


# Sample action from actor
@tf.function
def sample_action(observation):
    logits = actor(observation)
    action = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
    return logits, action


# Train the policy by maxizing the PPO-Clip objective
@tf.function
def train_policy(
    observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):

    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        ratio = tf.exp(
            logprobabilities(actor(observation_buffer), action_buffer)
            - logprobability_buffer
        )
        min_advantage = tf.where(
            advantage_buffer > 0,
            (1 + clip_ratio) * advantage_buffer,
            (1 - clip_ratio) * advantage_buffer,
        )

        policy_loss = -tf.reduce_mean(
            tf.minimum(ratio * advantage_buffer, min_advantage)
        )
    policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
    policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))

    kl = tf.reduce_mean(
        logprobability_buffer
        - logprobabilities(actor(observation_buffer), action_buffer)
    )
    kl = tf.reduce_sum(kl)
    return kl


# Train the value function by regression on mean-squared error
@tf.function
def train_value_function(observation_buffer, return_buffer):
    with tf.GradientTape() as tape:  # Record operations for automatic differentiation.
        value_loss = tf.reduce_mean((return_buffer - critic(observation_buffer)) ** 2)
    value_grads = tape.gradient(value_loss, critic.trainable_variables)
    value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))

ハイパーパラメータ

# Hyperparameters of the PPO algorithm
steps_per_epoch = 4000
epochs = 30
gamma = 0.99
clip_ratio = 0.2
policy_learning_rate = 3e-4
value_function_learning_rate = 1e-3
train_policy_iterations = 80
train_value_iterations = 80
lam = 0.97
target_kl = 0.01
hidden_sizes = (64, 64)

# True if you want to render the environment
render = False

初期化

# Initialize the environment and get the dimensionality of the
# observation space and the number of possible actions
env = gym.make("CartPole-v0")
observation_dimensions = env.observation_space.shape[0]
num_actions = env.action_space.n

# Initialize the buffer
buffer = Buffer(observation_dimensions, steps_per_epoch)

# Initialize the actor and the critic as keras models
observation_input = keras.Input(shape=(observation_dimensions,), dtype=tf.float32)
logits = mlp(observation_input, list(hidden_sizes) + [num_actions], tf.tanh, None)
actor = keras.Model(inputs=observation_input, outputs=logits)
value = tf.squeeze(
    mlp(observation_input, list(hidden_sizes) + [1], tf.tanh, None), axis=1
)
critic = keras.Model(inputs=observation_input, outputs=value)

# Initialize the policy and the value function optimizers
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)

# Initialize the observation, episode return and episode length
observation, episode_return, episode_length = env.reset(), 0, 0

訓練

# Iterate over the number of epochs
for epoch in range(epochs):
    # Initialize the sum of the returns, lengths and number of episodes for each epoch
    sum_return = 0
    sum_length = 0
    num_episodes = 0

    # Iterate over the steps of each epoch
    for t in range(steps_per_epoch):
        if render:
            env.render()

        # Get the logits, action, and take one step in the environment
        observation = observation.reshape(1, -1)
        logits, action = sample_action(observation)
        observation_new, reward, done, _ = env.step(action[0].numpy())
        episode_return += reward
        episode_length += 1

        # Get the value and log-probability of the action
        value_t = critic(observation)
        logprobability_t = logprobabilities(logits, action)

        # Store obs, act, rew, v_t, logp_pi_t
        buffer.store(observation, action, reward, value_t, logprobability_t)

        # Update the observation
        observation = observation_new

        # Finish trajectory if reached to a terminal state
        terminal = done
        if terminal or (t == steps_per_epoch - 1):
            last_value = 0 if done else critic(observation.reshape(1, -1))
            buffer.finish_trajectory(last_value)
            sum_return += episode_return
            sum_length += episode_length
            num_episodes += 1
            observation, episode_return, episode_length = env.reset(), 0, 0

    # Get values from the buffer
    (
        observation_buffer,
        action_buffer,
        advantage_buffer,
        return_buffer,
        logprobability_buffer,
    ) = buffer.get()

    # Update the policy and implement early stopping using KL divergence
    for _ in range(train_policy_iterations):
        kl = train_policy(
            observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
        )
        if kl > 1.5 * target_kl:
            # Early Stopping
            break

    # Update the value function
    for _ in range(train_value_iterations):
        train_value_function(observation_buffer, return_buffer)

    # Print mean return and length for each epoch
    print(
        f" Epoch: {epoch + 1}. Mean Return: {sum_return / num_episodes}. Mean Length: {sum_length / num_episodes}"
    )

 Epoch: 1. Mean Return: 18.01801801801802. Mean Length: 18.01801801801802
 Epoch: 2. Mean Return: 21.978021978021978. Mean Length: 21.978021978021978
 Epoch: 3. Mean Return: 27.397260273972602. Mean Length: 27.397260273972602
 Epoch: 4. Mean Return: 36.69724770642202. Mean Length: 36.69724770642202
 Epoch: 5. Mean Return: 48.19277108433735. Mean Length: 48.19277108433735
 Epoch: 6. Mean Return: 66.66666666666667. Mean Length: 66.66666666666667
 Epoch: 7. Mean Return: 133.33333333333334. Mean Length: 133.33333333333334
 Epoch: 8. Mean Return: 166.66666666666666. Mean Length: 166.66666666666666
 Epoch: 9. Mean Return: 181.8181818181818. Mean Length: 181.8181818181818
 Epoch: 10. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 11. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 12. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 13. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 14. Mean Return: 181.8181818181818. Mean Length: 181.8181818181818
 Epoch: 15. Mean Return: 181.8181818181818. Mean Length: 181.8181818181818
 Epoch: 16. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 17. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 18. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 19. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 20. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 21. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 22. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 23. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 24. Mean Return: 190.47619047619048. Mean Length: 190.47619047619048
 Epoch: 25. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 26. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 27. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 28. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 29. Mean Return: 200.0. Mean Length: 200.0
 Epoch: 30. Mean Return: 200.0. Mean Length: 200.0

可視化

Before training:

After 8 epochs of training:

After 20 epochs of training:

以上

2022年8月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31