TensorFlow 2.0 Alpha : 上級 Tutorials : 分散訓練 :- 訓練ループで tf.distribute.Strategy (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/13/2019

* 本ページは、TensorFlow の本家サイトの TF 2.0 Alpha – Advanced Tutorials – Distributed training の以下のページを翻訳した上で適宜、補足説明したものです：

tf.distribute.Strategy with Training Loops

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

分散訓練 :- 訓練ループで tf.distribute.Strategy

このチュートリアルはカスタム訓練ループで tf.distribute.Strategy をどのように使用するかを実演します。
fashion MNIST データセット上で単純な CNN モデルを訓練します。fashion MNIST データセットはサイズ 28 x 28 の 60000 訓練画像とサイズ 28 x 28 の 10000 テスト画像を含みます。

モデルを訓練するためにカスタム訓練ループを使用しています、何故ならばそれらは訓練の上で柔軟性とより素晴らしい制御を与えてくれるからです。更に、モデルと訓練ループをデバッグすることを容易にします。

from __future__ import absolute_import, division, print_function, unicode_literals

# Import TensorFlow
!pip install -q tensorflow==2.0.0-alpha0
import tensorflow as tf

# Helper libraries
import numpy as np
import os

print(tf.__version__)

2.0.0-alpha0

fashion mnist データセットをダウンロードする

fashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

# Adding a dimension to the array -> new shape == (28, 28, 1)
# We are doing this because the first layer in our model is a convolutional 
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]

# Getting the images in [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 1s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

変数とグラフを分散するためのストラテジーを作成する

tf.distribute.MirroredStrategy ストラテジーはどのように動作するのでしょう？

総ての変数とモデル・グラフはレプリカ上で複製されます。
入力はレプリカに渡り均等に分散されます。
各レプリカはそれが受け取った入力のための損失と勾配を計算します。
勾配はそれらを総計することにより総てのレプリカに渡り同期されます。
同期後、同じ更新が各レプリカ上の変数のコピーに行われます。

Note: 下の総てのコードを単一のスコープの内側に配置できます。説明のためにそれを幾つかのコードセルに分割しています。

# If the list of devices is not specified in the 
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy = tf.distribute.MirroredStrategy()

WARNING: Logging before flag parsing goes to stderr.
W0405 15:21:07.446863 140252361586432 cross_device_ops.py:1111] Not all devices in `tf.distribute.Strategy` are visible to TensorFlow.

print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 1

入力パイプラインをセットアップする

モデルがマルチ GPU 上で訓練されるのであれば、特別な計算パワーを効果的に利用するためにバッチサイズはそれに従って増やされるべきです。更に、学習率もそれに従って調整されるべきです。

BUFFER_SIZE = len(train_images)

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

EPOCHS = 10
train_steps_per_epoch = len(train_images) // BATCH_SIZE
test_steps_per_epoch = len(test_images) // BATCH_SIZE

strategy.experimental_make_numpy_iterator は総てのレプリカに渡りデータを均等に分散する iterator を作成します。

これは tf.data.Dataset.from_tensor_slices を直接的に使用するよりも効率的です、何故ならばそれは訓練データをグラフの定数として記録することを回避するからです。

strategy.experimental_make_numpy_iterator を使用していない場合、このように strategy.scope の内側に iterator を作成します :

train_dataset = tf.data.Dataset.from_tensor_slices( (train_images, train_labels)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
train_iterator = strategy.make_dataset_iterator(train_dataset)

Note: この API は近い将来に変更される可能性があります。

with strategy.scope():
  train_iterator = strategy.experimental_make_numpy_iterator(
      (train_images, train_labels), BATCH_SIZE, shuffle=BUFFER_SIZE)

  test_iterator = strategy.experimental_make_numpy_iterator(
      (test_images, test_labels), BATCH_SIZE, shuffle=None)

モデル作成

tf.keras.Sequential を使用してモデルを作成します。これを行なうために Model Subclassing API を使用することもできます。

def create_model():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Conv2D(64, 3, activation='relu'),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10, activation='softmax')
    ])

  return model

# Create a checkpoint directory to store the checkpoints.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")

損失関数を定義する

通常は、1 GPU/CPU を持つ単一マシン上では、損失は入力のバッチにおけるサンプルの数により除算されます。

そこで、tf.distribute.Strategy を使用するとき損失はどのように計算されるのでしょう？

例えば、貴方は 4 GPU と 64 のバッチサイズを持つとしましょう。入力の一つのバッチはレプリカ (4 GPU) に渡り分散され、各レプリカはサイズ 16 の入力を得ます。
各レプリカ上のモデルはそれぞれの入力で forward パスを行ないそして損失を計算します。今は、損失をそれぞれの入力のサンプル数 (16) で除算する代わりに、損失はグローバルな入力サイズ (64) で除算されます。

何故これが行われるのでしょう？

これが行われるのは各レプリカ上で勾配が計算された後、それらを総計することによりそれらはレプリカに渡り同期されるからです。

TensorFlow でこれをどのようにに処理しますか？

tf.keras.losses がこれを自動的に処理します。
カスタム損失関数を分散する場合、それを tf.reduce_mean (これはローカルのバッチサイズで除算します) を使用して実装しないでください、総計をグローバルなバッチサイズで除算してください : scale_loss = tf.reduce_sum(loss) * (1. / global_batch_size)

with strategy.scope():
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy()

損失と精度を追跡するためにメトリクスを定義する

これらのメトリクスは損失と精度を追跡します。累積された統計情報を得るためにいつでも .result() を使用できます。

with strategy.scope():
  train_loss = tf.keras.metrics.Mean(name='train_loss')
  test_loss = tf.keras.metrics.Mean(name='test_loss')

  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='train_accuracy')
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy')

訓練ループ

# model and optimizer must be created under `strategy.scope`.
with strategy.scope():
  model = create_model()

  optimizer = tf.keras.optimizers.Adam()
  
  checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

with strategy.scope():
  # Train step
  def train_step(inputs):
    images, labels = inputs

    with tf.GradientTape() as tape:
      predictions = model(images, training=True)
      loss = loss_object(labels, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(labels, predictions)

  # Test step
  def test_step(inputs):
    images, labels = inputs

    predictions = model(images, training=False)
    t_loss = loss_object(labels, predictions)

    test_loss(t_loss)
    test_accuracy(labels, predictions)

with strategy.scope():
  # `experimental_run` replicates the provided computation and runs it 
  # with the distributed input.
  
  @tf.function
  def distributed_train():
    return strategy.experimental_run(train_step, train_iterator)
  
  @tf.function
  def distributed_test():
    return strategy.experimental_run(test_step, test_iterator)
    
  for epoch in range(EPOCHS):
    # Note: This code is expected to change in the near future.
    
    # TRAIN LOOP
    # Initialize the iterator
    train_iterator.initialize()
    for _ in range(train_steps_per_epoch):
      distributed_train()

    # TEST LOOP
    test_iterator.initialize()
    for _ in range(test_steps_per_epoch):
      distributed_test()
    
    if epoch % 2 == 0:
      checkpoint.save(checkpoint_prefix)

    template = ("Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, "
                "Test Accuracy: {}")
    print (template.format(epoch+1, train_loss.result(), 
                           train_accuracy.result()*100, test_loss.result(), 
                           test_accuracy.result()*100))
    
    train_loss.reset_states()
    test_loss.reset_states()
    train_accuracy.reset_states()
    test_accuracy.reset_states()

Epoch 1, Loss: 0.5225825309753418, Accuracy: 80.9698486328125, Test Loss: 0.432892769575119, Test Accuracy: 84.58534240722656
Epoch 2, Loss: 0.34400758147239685, Accuracy: 87.69676971435547, Test Loss: 0.34699392318725586, Test Accuracy: 87.46995544433594
Epoch 3, Loss: 0.2954360246658325, Accuracy: 89.28929138183594, Test Loss: 0.2975158095359802, Test Accuracy: 89.20272827148438
Epoch 4, Loss: 0.26176968216896057, Accuracy: 90.3832015991211, Test Loss: 0.28567320108413696, Test Accuracy: 89.95392608642578
Epoch 5, Loss: 0.23702645301818848, Accuracy: 91.29869079589844, Test Loss: 0.2787027060985565, Test Accuracy: 89.69351196289062
Epoch 6, Loss: 0.21761010587215424, Accuracy: 92.06243133544922, Test Loss: 0.2878926694393158, Test Accuracy: 89.53324890136719
Epoch 7, Loss: 0.19936144351959229, Accuracy: 92.64107513427734, Test Loss: 0.2678224444389343, Test Accuracy: 90.64503479003906
Epoch 8, Loss: 0.184087872505188, Accuracy: 93.18470001220703, Test Loss: 0.28124910593032837, Test Accuracy: 90.25440979003906
Epoch 9, Loss: 0.1695927232503891, Accuracy: 93.6966323852539, Test Loss: 0.2541474401950836, Test Accuracy: 90.96554565429688
Epoch 10, Loss: 0.15550555288791656, Accuracy: 94.21858215332031, Test Loss: 0.2556819021701813, Test Accuracy: 90.79527282714844

最新のチェックポイントを復元してテストする

tf.distribute.Strategy でチェックポイントされたモデルはストラテジーとともに、またはストラテジーなしでリストアできます。

eval_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='eval_accuracy')

new_model = create_model()
new_optimizer = tf.keras.optimizers.Adam()

test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(BATCH_SIZE)

@tf.function
def eval_step(images, labels):
  predictions = new_model(images, training=False)
  eval_accuracy(labels, predictions)

checkpoint = tf.train.Checkpoint(optimizer=new_optimizer, model=new_model)
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

for images, labels in test_dataset:
  eval_step(images, labels)

print ('Accuracy after restoring the saved model without strategy: {}'.format(
    eval_accuracy.result()*100))

W0405 15:22:20.914693 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d07c8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.920872 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e14520598> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.925516 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d02c8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.930133 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d0318> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.934626 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d0368> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.938569 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d03b8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.942585 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d0408> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
W0405 15:22:20.946619 140252361586432 tf_logging.py:161] Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d06d8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.

WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d07c8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e14520598> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d02c8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d0318> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d0368> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d03b8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d0408> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
WARNING: Entity <method-wrapper '__call__' of weakref object at 0x7f8e145d06d8> could not be transformed and will be staged without change. Error details can be found in the logs when running with the env variable AUTOGRAPH_VERBOSITY >= 1. Please report this to the AutoGraph team. Cause: Object conversion is not yet supported. If you are trying to convert code that uses an existing object, try including the creation of that object in the conversion. For example, instead of converting the method of a class, try converting the entire class instead. See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/README.md#using-the-functional-api for more information.
Accuracy after restoring the saved model without strategy: 90.97000122070312

以上

2019年4月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30