TensorFlow 2.0 Alpha : 上級 Tutorials : データのロード :- tf.data で画像をロードする (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/11/2019

* 本ページは、TensorFlow の本家サイトの TF 2.0 Alpha – Advanced Tutorials – Loading data の以下のページを翻訳した上で適宜、補足説明したものです：

Load images with tf.data

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

データのロード :- tf.data で画像をロードする

このチュートリアルは tf.data を使用してどのように画像データセットをロードするかの単純なサンプルを提供します。

このサンプルで使用されるデータセットは画像のディレクトリとして分配・配置されます、ディレクトリ毎に画像の 1 クラスです。

セットアップ

from __future__ import absolute_import, division, print_function, unicode_literals

!pip install -q tensorflow==2.0.0-alpha0
import tensorflow as tf

AUTOTUNE = tf.data.experimental.AUTOTUNE

データセットをダウンロードして調査する

画像を取得する

どのような訓練でも始める前に、認識することを望む新しいクラスについてネットワークに教えるための画像のセットが必要です。最初に使用するために creative-commons license の花の写真のアーカイブを作成しました。

import pathlib
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
                                         fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228818944/228813984 [==============================] - 9s 0us/step
/root/.keras/datasets/flower_photos

218 MB をダウンロード後、今では利用可能な花の写真のコピーを持つはずです :

for item in data_root.iterdir():
  print(item)

/root/.keras/datasets/flower_photos/sunflowers
/root/.keras/datasets/flower_photos/LICENSE.txt
/root/.keras/datasets/flower_photos/tulips
/root/.keras/datasets/flower_photos/daisy
/root/.keras/datasets/flower_photos/roses
/root/.keras/datasets/flower_photos/dandelion

import random
all_image_paths = list(data_root.glob('*/*'))
all_image_paths = [str(path) for path in all_image_paths]
random.shuffle(all_image_paths)

image_count = len(all_image_paths)
image_count

all_image_paths[:10]

['/root/.keras/datasets/flower_photos/sunflowers/9555827829_74e6f60f1d_m.jpg',
 '/root/.keras/datasets/flower_photos/dandelion/10437652486_aa86c14985.jpg',
 '/root/.keras/datasets/flower_photos/roses/1801614110_bb9fa46830.jpg',
 '/root/.keras/datasets/flower_photos/roses/7187035716_5d0fb95c31_n.jpg',
 '/root/.keras/datasets/flower_photos/tulips/8677713853_1312f65e71.jpg',
 '/root/.keras/datasets/flower_photos/sunflowers/16832961488_5f7e70eb5e_n.jpg',
 '/root/.keras/datasets/flower_photos/roses/685724528_6cd5cbe203.jpg',
 '/root/.keras/datasets/flower_photos/tulips/8689672277_b289909f97_n.jpg',
 '/root/.keras/datasets/flower_photos/tulips/16582481123_06e8e6b966_n.jpg',
 '/root/.keras/datasets/flower_photos/dandelion/2453532367_fc373df4de.jpg']

画像を調べる

画像の 2,3 を簡単に見てみましょう、そうすれば何を処理しているのかを知るでしょう :

import os
attributions = (data_root/"LICENSE.txt").open(encoding='utf-8').readlines()[4:]
attributions = [line.split(' CC-BY') for line in attributions]
attributions = dict(attributions)

import IPython.display as display

def caption_image(image_path):
    image_rel = pathlib.Path(image_path).relative_to(data_root)
    return "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel)].split(' - ')[:-1])

for n in range(3):
  image_path = random.choice(all_image_paths)
  display.display(display.Image(image_path))
  print(caption_image(image_path))
  print()

Image (CC BY 2.0)  by Warren Rachele

Image (CC BY 2.0)  by William Warby

Image (CC BY 2.0)  by liz west

各画像に対するラベルを決定する

利用可能なラベルをリストする :

label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
label_names

['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']

各ラベルにインデックスを割り当てます :

label_to_index = dict((name, index) for index,name in enumerate(label_names))
label_to_index

{'daisy': 0, 'dandelion': 1, 'roses': 2, 'sunflowers': 3, 'tulips': 4}

総てのファイルのリスト、そしてそのラベル・インデックスを作成します。

all_image_labels = [label_to_index[pathlib.Path(path).parent.name]
                    for path in all_image_paths]

print("First 10 labels indices: ", all_image_labels[:10])

First 10 labels indices:  [3, 1, 2, 2, 4, 3, 2, 4, 4, 1]

画像をロードしてフォーマットする

TensorFlow は画像をロードして処理するために必要な総てのツールを含みます :

img_path = all_image_paths[0]
img_path

'/root/.keras/datasets/flower_photos/sunflowers/9555827829_74e6f60f1d_m.jpg'

ここに生データがあります :

img_raw = tf.io.read_file(img_path)
print(repr(img_raw)[:100]+"...")

<tf.Tensor: id=1, shape=(), dtype=string, numpy=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x...

それを画像 tensor にデコードします :

img_tensor = tf.image.decode_image(img_raw)

print(img_tensor.shape)
print(img_tensor.dtype)

(240, 160, 3)

それを貴方のモデルのためにリサイズします :

img_final = tf.image.resize(img_tensor, [192, 192])
img_final = img_final/255.0
print(img_final.shape)
print(img_final.numpy().min())
print(img_final.numpy().max())

(192, 192, 3)
0.0
0.99987745

後のためにこれらを単純な関数にラップします。

def preprocess_image(image):
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, [192, 192])
  image /= 255.0  # normalize to [0,1] range

  return image

def load_and_preprocess_image(path):
  image = tf.io.read_file(path)
  return preprocess_image(image)

import matplotlib.pyplot as plt

image_path = all_image_paths[0]
label = all_image_labels[0]

plt.imshow(load_and_preprocess_image(img_path))
plt.grid(False)
plt.xlabel(caption_image(img_path))
plt.title(label_names[label].title())
print()

tf.data.Dataset を構築する

画像のデータセット

tf.data.Dataset を構築するための最も容易な方法は from_tensor_slices メソッドを使用することです。

文字列の配列をスライスすると、文字列のデータセットという結果になります :

path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)

shapes と types はデータセットの各アイテムの内容を記述します。この場合それはスカラー binary-strings のセットです。

print(path_ds)

<TensorSliceDataset shapes: (), types: tf.string>

さてパスのデータセットに渡り preprocess_image をマップすることにより、画像を on the fly にロードしてフォーマットする新しいデータセットを作成します。

image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)

import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
for n,image in enumerate(image_ds.take(4)):
  plt.subplot(2,2,n+1)
  plt.imshow(image)
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  plt.xlabel(caption_image(all_image_paths[n]))

(image, label) ペアのデータセット

同じ from_tensor_slices メソッドを使用してラベルのデータセットを構築できます。

label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))

for label in label_ds.take(10):
  print(label_names[label.numpy()])

sunflowers
dandelion
roses
roses
tulips
sunflowers
roses
tulips
tulips
dandelion

データセットは同じ順序にあるので (image, label) ペアのデータセットを得るために単にそれらを一緒に zip できます。

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

新しいデータセットの shapes と types も shapes と types のタプルで、各フィールドを記述します :

print(image_label_ds)

<ZipDataset shapes: ((192, 192, 3), ()), types: (tf.float32, tf.int64)>

Note: all_image_labels と all_image_paths のような配列を持つとき tf.data.dataset.Dataset.zip の代替は配列のペアをスライスすることです。

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

# The tuples are unpacked into the positional arguments of the mapped function 
def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)
image_label_ds

<MapDataset shapes: ((192, 192, 3), ()), types: (tf.float32, tf.int32)>

訓練のための基本メソッド

このデータセットでモデルを訓練するためにはデータに以下を望むでしょう :

上手くシャッフルされる。
バッチ化される。
永久に反復する。
可能な限り早くバッチが利用可能となる。

これらの特徴は tf.data api を使用して容易に追加できます。

BATCH_SIZE = 32

# Setting a shuffle buffer size as large as the dataset ensures that the data is
# completely shuffled.
ds = image_label_ds.shuffle(buffer_size=image_count)
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
# `prefetch` lets the dataset fetch batches, in the background while the model is training.
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int32)>

ここで注意すべき 2, 3 のことがあります :

順序は重要です。
- .repeat の前の .shuffle はエポック境界を越えて項目をシャッフルします (ある項目は他が見られる前に 2 度見られるでしょう)。
- .batch 後の .shuffle はバッチの順序をシャッフルしますが、バッチを越えて項目をシャッフルはしません。
完全なシャッフルのために buffer_size をデータセットと同じサイズ使用します。データセットのサイズまで、巨大な値はより良いランダム化を提供しますが、より多くのメモリを使用します。
shuffle バッファは任意の要素がそれから引き出される前に満たされます。そのため巨大な buffer_size はデータセットが開始されるときに遅延を引き起こすかもしれません。
シャッフルされたデータセットは shuffle-buffer が完全に空になるまでデータセットの終わりを報告しません。データセットは .repeat により再スタートされ、shuffle-buffer が満たされるためのもう一つの wait を引き起こします。

この最後のポイントは融合された tf.data.experimental.shuffle_and_repeat 関数を伴い tf.data.Dataset.apply メソッドを使用することによりアドレスされます :

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int32)>

データセットをモデルにパイプする

tf.keras.applications から MobileNet v2 のコピーを取得します。

これは単純な転移学習サンプルのために使用されます。

MobileNet 重みを非訓練可能として設定します :

mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)
mobile_net.trainable=False

Downloading data from https://github.com/JonathanCMitchell/mobilenet_v2_keras/releases/download/v1.1/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_192_no_top.h5
9412608/9406464 [==============================] - 1s 0us/step

このモデルはその入力に [-1, 1] 範囲に正規化されることを想定します :

help(keras_applications.mobilenet_v2.preprocess_input)

...
This function applies the "Inception" preprocessing which converts
the RGB values from [0, 255] to [-1, 1] 
...

そこでそれを MobilNet モデルに渡す前に、入力を [0, 1] の範囲から [-1, 1] に変換する必要があります。

def change_range(image,label):
  return 2*image-1, label

keras_ds = ds.map(change_range)

MobileNet は各画像に対して特徴の 6×6 空間グリッドを返します。

次を見るためにそれに画像のバッチを渡します :

# The dataset may take a few seconds to start, as it fills its shuffle buffer.
image_batch, label_batch = next(iter(keras_ds))

feature_map_batch = mobile_net(image_batch)
print(feature_map_batch.shape)

(32, 6, 6, 1280)

それから MobileNet をラップしたモデルを構築して、そして出力 tf.keras.layers.Dense 層の前にそれらの空間次元に渡り平均するために tf.keras.layers.GlobalAveragePooling2D を使用します :

model = tf.keras.Sequential([
  mobile_net,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names))])

今ではそれは期待された shape の出力を生成します :

logit_batch = model(image_batch).numpy()

print("min logit:", logit_batch.min())
print("max logit:", logit_batch.max())
print()

print("Shape:", logit_batch.shape)

min logit: -2.1251462
max logit: 2.2703485

Shape: (32, 5)

訓練手続きを記述するためにモデルをコンパイルします :

model.compile(optimizer=tf.keras.optimizers.Adam(), 
              loss='sparse_categorical_crossentropy',
              metrics=["accuracy"])

2 つ訓練可能な変数があります : Dense 重みとバイアスです :

len(model.trainable_variables)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
mobilenetv2_1.00_192 (Model) (None, 6, 6, 1280)        2257984   
_________________________________________________________________
global_average_pooling2d (Gl (None, 1280)              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 6405      
=================================================================
Total params: 2,264,389
Trainable params: 6,405
Non-trainable params: 2,257,984
_________________________________________________________________

モデルを訓練します。

通常はエポック毎に現実的なステップ数を指定するでしょうが、デモ目的で 3 ステップだけを実行します。

steps_per_epoch=tf.math.ceil(len(all_image_paths)/BATCH_SIZE).numpy()
steps_per_epoch

115.0

model.fit(ds, epochs=1, steps_per_epoch=3)

3/3 [==============================] - 10s 3s/step - loss: 9.4592 - accuracy: 0.1562

パフォーマンス

Note: このセクションはパフォーマンスの助けとなるかもしれない 2, 3 の容易なトリックを示します。深いガイドについては入力パイプライン・パフォーマンスを見てください。

上で使用された単純なパイプラインは各ファイルを個々に読みます、各エポックで。これは CPU 上のローカル訓練のためには良いですが GPU 訓練のためには十分でないかもしれません、そしてどのような種類の分散訓練のためには総合的に不適当です。

調査するために、最初にデータセットのパフォーマンスをチェックする単純な関数を構築します :

import time
default_timeit_steps = 2*steps_per_epoch+1

def timeit(ds, steps=default_timeit_steps):
  overall_start = time.time()
  # Fetch a single batch to prime the pipeline (fill the shuffle buffer),
  # before starting the timer
  it = iter(ds.take(steps+1))
  next(it)

  start = time.time()
  for i,(images,labels) in enumerate(it):
    if i%10 == 0:
      print('.',end='')
  print()
  end = time.time()

  duration = end-start
  print("{} batches: {} s".format(steps, duration))
  print("{:0.5f} Images/s".format(BATCH_SIZE*steps/duration))
  print("Total time: {}s".format(end-overall_start))

現在のデータセットのパフォーマンスは :

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int32)>

timeit(ds)

........................
231.0 batches: 22.15600061416626 s
333.63422 Images/s
Total time: 31.439151525497437s

キャッシュ

エポックを越えて計算を容易にキャッシュするために tf.data.Dataset.cache を使用します。これは dataq がメモリに収まる場合には特にパフォーマンスが高いです。

ここで画像が前処理 (デコードとリサイズ) された後に、キャッシュされます :

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int32)>

timeit(ds)

........................
231.0 batches: 0.7710902690887451 s
9586.42625 Images/s
Total time: 9.224573850631714s

in-memory キャッシュを使用する一つのデメリットはキャッシュが各実行で再構築されなければならないことで、データセットが開始されるたびに同じスタートアップ遅延を与えます :

timeit(ds)

........................
231.0 batches: 0.8165614604949951 s
9052.59476 Images/s
Total time: 9.110528230667114s

データがメモリに収まらない場合は、キャッシュ・ファイルを使用します :

ds = image_label_ds.cache(filename='./cache.tf-data')
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(1)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int32)>

timeit(ds)

........................
231.0 batches: 2.653285264968872 s
2785.98012 Images/s
Total time: 15.944914102554321s

キャッシュファイルはまた、キャッシュを再構築することなしに迅速にデータセットを再起動するために使用できるという優位点を持ちます。それが 2 回目にどのくらい早いかに注意してください :

timeit(ds)

........................
231.0 batches: 2.185948371887207 s
3381.59862 Images/s
Total time: 3.409513235092163s

TFRecord ファイル

生画像データ

TFRecord ファイルはバイナリ blob のシークエンスをストアするための単純なフォーマットです。複数のサンプルを同じファイルにパックすることにより、TensorFlow は一度に複数のサンプルを読むことができ、それは GCS のようなリモート・ストレージサービスを使用するときパフォーマンスのために特に重要です。

最初に、生画像データから TFRecord ファイルを構築します :

image_ds = tf.data.Dataset.from_tensor_slices(all_image_paths).map(tf.io.read_file)
tfrec = tf.data.experimental.TFRecordWriter('images.tfrec')
tfrec.write(image_ds)

次に先に定義した preprocess_image 関数を使用して TFRecord ファイルから読み、画像をデコード/再フォーマットするデータセットを構築します。

image_ds = tf.data.TFRecordDataset('images.tfrec').map(preprocess_image)

想定する (image, label) ペアを得るためにそれを先に定義したラベル・データセットと共に zip します。

ds = tf.data.Dataset.zip((image_ds, label_ds))
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds=ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int64)>

timeit(ds)

........................
231.0 batches: 21.571000337600708 s
342.68230 Images/s
Total time: 31.82367253303528s

これはキャッシュ・バージョンよりも遅いです、何故ならば前処理をキャッシュしていないからです。

Serialized Tensors

ある前処理を TFRecord ファイルにセーブするためには、最初に処理された画像のデータセットを作成します、前のように :

paths_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = paths_ds.map(load_and_preprocess_image)
image_ds

<MapDataset shapes: (192, 192, 3), types: tf.float32>

今では .jpeg 文字列のデータセットの代わりに、これは tensor のデータセットです。これを TFRecord ファイルにシリアライズするためには、最初に tensor のデータセットを文字列 (= strings) のデータセットに変換します。

ds = image_ds.map(tf.io.serialize_tensor)
ds

<MapDataset shapes: (), types: tf.string>

tfrec = tf.data.experimental.TFRecordWriter('images.tfrec')
tfrec.write(ds)

キャッシュされた前処理と共に、データは TFRecord ファイルから極めて効率的にロードできます。それを使用しようとする前に tensor を単にデシリアライズすることを覚えていてください。

ds = tf.data.TFRecordDataset('images.tfrec')

def parse(x):
  result = tf.io.parse_tensor(x, out_type=tf.float32)
  result = tf.reshape(result, [192, 192, 3])
  return result

ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
ds

<ParallelMapDataset shapes: (192, 192, 3), types: tf.float32>

今、ラベルを追加して前のように同じ標準的な演算を適用します :

ds = tf.data.Dataset.zip((ds, label_ds))
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds=ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)
ds

<PrefetchDataset shapes: ((None, 192, 192, 3), (None,)), types: (tf.float32, tf.int64)>

timeit(ds)

........................
231.0 batches: 1.9944682121276855 s
3706.25110 Images/s
Total time: 2.8945322036743164s

以上

2019年4月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30