TensorFlow 2.0 : 上級 Tutorials : 分散訓練 :- Estimator でマルチワーカー訓練 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 10/31/2019

* 本ページは、TensorFlow org サイトの TF 2.0 – Advanced Tutorials – Distributed training の以下のページを翻訳した上で
適宜、補足説明したものです：

Multi-worker training with Estimator

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマにウェビナー (WEB セミナー) を定期的に開催しています。スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
Windows PC のブラウザからご参加が可能です。スマートデバイスもご利用可能です。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/

Facebook: https://www.facebook.com/ClassCatJP/

分散訓練 :- Estimator でマルチワーカー訓練

概要

Note: Estimator で tf.distribute API を利用できる一方で、代わりに tf.distribute で Keras を使用することを貴方に勧めます (Multi-worker Training with Keras を見てください)。tf.distribute.Strategy を伴う Estimator 訓練は現時点では制限されたサポートを持ちます。

このチュートリアルは tf.distribute.Strategy が tf.estimator で分散マルチワーカー訓練のためにどのように使用できるかを実演します。貴方のコードを tf.estimator を使用して書いて、そして高パフォーマンスで単一マシンを越えてスケーリングすることに関心がある場合、このチュートリアルは貴方のためにあります。

始める前に、tf.distribute.Strategy ガイドを読んでください。マルチ GPU 訓練ガイドもまた関連があります、何故ならばこのチュートリアルは同じモデルを使用するからです。

セットアップ

最初に、TensorFlow と必要なインポートをセットアップします。

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()

import os, json

入力関数

このチュートリアルは TensorFlow Dataset から MNIST データセットを使用します。ここでのコードは一つの主要な違いとともにマルチ GPU 訓練チュートリアルに類似しています : マルチワーカー訓練のために Estimator を使用するとき、モデル収束を確実にするためにデータセットをシャードする必要があります。入力データは、各ワーカーがデータセットの 1/num_workers の異なる部分を処理するように、ワーカーインデックスでシャードされます。

BUFFER_SIZE = 10000
BATCH_SIZE = 64

def input_fn(mode, input_context=None):
  datasets, info = tfds.load(name='mnist',
                                with_info=True,
                                as_supervised=True)
  mnist_dataset = (datasets['train'] if mode == tf.estimator.ModeKeys.TRAIN else
                   datasets['test'])

  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  if input_context:
    mnist_dataset = mnist_dataset.shard(input_context.num_input_pipelines,
                                        input_context.input_pipeline_id)
  return mnist_dataset.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

収束を獲得するための他の合理的なアプローチは各ワーカーで異なる seed でデータセットをシャッフルすることです。

マルチワーカー configuration

(マルチ GPU 訓練チュートリアルに比較して) このチュートリアルの主要な違いの一つはマルチワーカー・セットアップです。TF_CONFIG 環境変数はクラスタの一部である各ワーカーにクラスタ configuration を指定するための標準的な方法です。

TF_CONFIG の 2 つの構成要素があります : cluster と task です。cluster はクラスタ全体についての情報を提供します、つまりクラスタのワーカーとパラメータサーバです。task は現在のタスクについての情報を提供します。この例では、タスクタイプはワーカーでタスクインデックスは 0 です。

説明目的で、このチュートリアルはローカルホスト上 2 つのワーカーで TF_CONFIG をどのように設定するかを示します。実際には、貴方は外部 IP アドレスとポート上マルチワーカーを作成して各ワーカーで TF_CONFIG を適切に設定します、i.e. タスクインデックスを変更します。

警告: 次のコードを Colab で実行しないでください。TensorFlow のランタイムは指定された IP アドレスとポートで gRPC サーバを作成することを試みて、それは失敗しがちです。

os.environ['TF_CONFIG'] = json.dumps({
    'cluster': {
        'worker': ["localhost:12345", "localhost:23456"]
    },
    'task': {'type': 'worker', 'index': 0}
})

モデルを定義する

訓練のための層、optimizer と損失関数を書きます。このチュートリアルはマルチ GPU 訓練チュートリアルと同様に、Keras 層でモデルを定義します。

LEARNING_RATE = 1e-4
def model_fn(features, labels, mode):
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10)
  ])
  logits = model(features, training=False)

  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {'logits': logits}
    return tf.estimator.EstimatorSpec(labels=labels, predictions=predictions)

  optimizer = tf.compat.v1.train.GradientDescentOptimizer(
      learning_rate=LEARNING_RATE)
  loss = tf.keras.losses.SparseCategoricalCrossentropy(
      from_logits=True, reduction=tf.keras.losses.Reduction.NONE)(labels, logits)
  loss = tf.reduce_sum(loss) * (1. / BATCH_SIZE)
  if mode == tf.estimator.ModeKeys.EVAL:
    return tf.estimator.EstimatorSpec(mode, loss=loss)

  return tf.estimator.EstimatorSpec(
      mode=mode,
      loss=loss,
      train_op=optimizer.minimize(
          loss, tf.compat.v1.train.get_or_create_global_step()))

Note: このサンプルでは学習率は固定されますが、一般にはグローバルバッチサイズに基づいて学習率を調整する必要があるかもしれません。

MultiWorkerMirroredStrategy

モデルを訓練するために、tf.distribute.experimental.MultiWorkerMirroredStrategy のインスタンスを使用します。MultiWorkerMirroredStrategy は総てのワーカーに渡り各デバイス上モデルの層で総ての変数のコピーを作成します。それは勾配を累積して変数を同期して保持するために CollectiveOps、collective 通信のための TensorFlow op を使用します。tf.distribute.Strategy guide ガイドはこのストラテジーについてより詳細を持ちます。

strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

INFO:tensorflow:Single-worker CollectiveAllReduceStrategy with local_devices = ('/device:GPU:0',), communication = CollectiveCommunication.AUTO

モデルを訓練して評価する

次に、estimator のために RunConfig で分散ストラテジーを指定し、そして tf.estimator.train_and_evaluate を呼び出して訓練と評価をします。このチュートリアルは train_distribute を通してストラテジーを指定することにより訓練だけを分散します。eval_distribute を通して評価を分散することも可能です。

config = tf.estimator.RunConfig(train_distribute=strategy)

classifier = tf.estimator.Estimator(
    model_fn=model_fn, model_dir='/tmp/multiworker', config=config)
tf.estimator.train_and_evaluate(
    classifier,
    train_spec=tf.estimator.TrainSpec(input_fn=input_fn),
    eval_spec=tf.estimator.EvalSpec(input_fn=input_fn)
)

INFO:tensorflow:Initializing RunConfig with distribution strategies.
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Using config: {'_master': '', '_save_checkpoints_steps': None, '_num_ps_replicas': 0, '_is_chief': True, '_evaluation_master': '', '_service': None, '_train_distribute': <tensorflow.python.distribute.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7fea2d0fb828>, '_global_id_in_cluster': 0, '_save_summary_steps': 100, '_experimental_max_worker_delay_secs': None, '_keep_checkpoint_every_n_hours': 10000, '_distribute_coordinator_mode': None, '_log_step_count_steps': 100, '_task_type': 'worker', '_keep_checkpoint_max': 5, '_num_worker_replicas': 1, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_device_fn': None, '_session_creation_timeout_secs': 7200, '_tf_random_seed': None, '_save_checkpoints_secs': 600, '_model_dir': '/tmp/multiworker', '_experimental_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fea2d0fb9b0>, '_protocol': None, '_task_id': 0, '_eval_distribute': None}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:The `input_fn` accepts an `input_context` which will be given by DistributionStrategy
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Create CheckpointSaverHook.

INFO:tensorflow:Create CheckpointSaverHook.

INFO:tensorflow:Graph was finalized.

INFO:tensorflow:Graph was finalized.

INFO:tensorflow:Running local_init_op.

INFO:tensorflow:Running local_init_op.

INFO:tensorflow:Done running local_init_op.

INFO:tensorflow:Done running local_init_op.

INFO:tensorflow:Saving checkpoints for 0 into /tmp/multiworker/model.ckpt.

INFO:tensorflow:Saving checkpoints for 0 into /tmp/multiworker/model.ckpt.

INFO:tensorflow:loss = 2.3090205, step = 0

INFO:tensorflow:loss = 2.3090205, step = 0

INFO:tensorflow:global_step/sec: 137.125

INFO:tensorflow:global_step/sec: 137.125

INFO:tensorflow:loss = 2.2972226, step = 100 (0.732 sec)

INFO:tensorflow:loss = 2.2972226, step = 100 (0.732 sec)

INFO:tensorflow:global_step/sec: 145.665

INFO:tensorflow:global_step/sec: 145.665

INFO:tensorflow:loss = 2.2918024, step = 200 (0.686 sec)

INFO:tensorflow:loss = 2.2918024, step = 200 (0.686 sec)

INFO:tensorflow:global_step/sec: 137.544

INFO:tensorflow:global_step/sec: 137.544

INFO:tensorflow:loss = 2.305677, step = 300 (0.727 sec)

INFO:tensorflow:loss = 2.305677, step = 300 (0.727 sec)

INFO:tensorflow:global_step/sec: 137.924

INFO:tensorflow:global_step/sec: 137.924

INFO:tensorflow:loss = 2.2915964, step = 400 (0.725 sec)

INFO:tensorflow:loss = 2.2915964, step = 400 (0.725 sec)

INFO:tensorflow:global_step/sec: 137.804

INFO:tensorflow:global_step/sec: 137.804

INFO:tensorflow:loss = 2.2914124, step = 500 (0.725 sec)

INFO:tensorflow:loss = 2.2914124, step = 500 (0.725 sec)

INFO:tensorflow:global_step/sec: 142.391

INFO:tensorflow:global_step/sec: 142.391

INFO:tensorflow:loss = 2.2710123, step = 600 (0.703 sec)

INFO:tensorflow:loss = 2.2710123, step = 600 (0.703 sec)

INFO:tensorflow:global_step/sec: 138.232

INFO:tensorflow:global_step/sec: 138.232

INFO:tensorflow:loss = 2.272681, step = 700 (0.723 sec)

INFO:tensorflow:loss = 2.272681, step = 700 (0.723 sec)

INFO:tensorflow:global_step/sec: 160.382

INFO:tensorflow:global_step/sec: 160.382

INFO:tensorflow:loss = 2.2810445, step = 800 (0.623 sec)

INFO:tensorflow:loss = 2.2810445, step = 800 (0.623 sec)

INFO:tensorflow:global_step/sec: 643.312

INFO:tensorflow:global_step/sec: 643.312

INFO:tensorflow:loss = 2.2849498, step = 900 (0.154 sec)

INFO:tensorflow:loss = 2.2849498, step = 900 (0.154 sec)

INFO:tensorflow:Saving checkpoints for 938 into /tmp/multiworker/model.ckpt.

INFO:tensorflow:Saving checkpoints for 938 into /tmp/multiworker/model.ckpt.

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Starting evaluation at 2019-10-01T01:21:23Z

INFO:tensorflow:Starting evaluation at 2019-10-01T01:21:23Z

INFO:tensorflow:Graph was finalized.

INFO:tensorflow:Graph was finalized.

INFO:tensorflow:Restoring parameters from /tmp/multiworker/model.ckpt-938

INFO:tensorflow:Restoring parameters from /tmp/multiworker/model.ckpt-938

INFO:tensorflow:Running local_init_op.

INFO:tensorflow:Running local_init_op.

INFO:tensorflow:Done running local_init_op.

INFO:tensorflow:Done running local_init_op.

INFO:tensorflow:Evaluation [10/100]

INFO:tensorflow:Evaluation [10/100]

INFO:tensorflow:Evaluation [20/100]

INFO:tensorflow:Evaluation [20/100]

INFO:tensorflow:Evaluation [30/100]

INFO:tensorflow:Evaluation [30/100]

INFO:tensorflow:Evaluation [40/100]

INFO:tensorflow:Evaluation [40/100]

INFO:tensorflow:Evaluation [50/100]

INFO:tensorflow:Evaluation [50/100]

INFO:tensorflow:Evaluation [60/100]

INFO:tensorflow:Evaluation [60/100]

INFO:tensorflow:Evaluation [70/100]

INFO:tensorflow:Evaluation [70/100]

INFO:tensorflow:Evaluation [80/100]

INFO:tensorflow:Evaluation [80/100]

INFO:tensorflow:Evaluation [90/100]

INFO:tensorflow:Evaluation [90/100]

INFO:tensorflow:Evaluation [100/100]

INFO:tensorflow:Evaluation [100/100]

INFO:tensorflow:Finished evaluation at 2019-10-01-01:21:25

INFO:tensorflow:Finished evaluation at 2019-10-01-01:21:25

INFO:tensorflow:Saving dict for global step 938: global_step = 938, loss = 2.2761376

INFO:tensorflow:Saving dict for global step 938: global_step = 938, loss = 2.2761376

INFO:tensorflow:Saving 'checkpoint_path' summary for global step 938: /tmp/multiworker/model.ckpt-938

INFO:tensorflow:Saving 'checkpoint_path' summary for global step 938: /tmp/multiworker/model.ckpt-938

INFO:tensorflow:Loss for final step: 1.138028.

INFO:tensorflow:Loss for final step: 1.138028.

({'global_step': 938, 'loss': 2.2761376}, [])

訓練パフォーマンスを最適化する

今ではモデルと tf.distribute.Strategy を装備したマルチワーカー capable Estimator を持ちます。マルチワーカー訓練のパフォーマンスを最適化するために次のテクニックを試すことができます :

バッチサイズを増やす: ここで指定されるバッチサイズは GPU 毎です。一般に、GPU メモリに収まる最大のバッチサイズが賢明です。
変数をキャストする: 可能であれば変数を tf.float にキャストします。公式 ResNet モデルはこれがどのように成されるかのサンプルを含みます。
collective 通信を使用する: MultiWorkerMirroredStrategy は複数の collective 通信実装を提供します。
- RING は gRPC を使用して ring-based collective を cross-ホスト通信層として実装します。
- NCCL は collective を実装するために Nvidia の NCCL を使用します。
- AUTO は選択をランタイムに任せます。
collective 実装の最善の選択は GPU の数と種類、そしてクラスタのネットワーク相互作用に依拠します。自動選択をオーバーライドするには、MultiWorkerMirroredStrategy のコンストラクタの communication パラメータに正当な値を指定することです、e.g., communication=tf.distribute.experimental.CollectiveCommunication.NCCL。

他のコードサンプル

Kubernetes テンプレートを使用する tensorflow/ecosystem のマルチワーカー訓練のための end-to-end サンプル。このサンプルは Keras モデルで始めてそれを tf.keras.estimator.model_to_estimator API を使用して Estimator に変換します。
公式 ResNet50 モデル、これは MirroredStrategy または MultiWorkerMirroredStrategy を使用して訓練できます。

以上

2019年10月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31