Keras 2 : ガイド : 前処理層の利用 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 10/29/2021 (keras 2.6.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Working with preprocessing layers (Author: Francois Chollet, Mark Omernick)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス ★ 無料 Web セミナー開催中 ★

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しております。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援
テレワーク & オンライン授業を支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
ウェビナー運用には弊社製品「ClassCat® Webinar」を利用しています。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/ ; Facebook

Keras 2 : ガイド : 前処理層の利用

Keras 前処理

Keras 前処理層 API は開発者に Keras-native な入力処理パイプラインを構築することを可能にします。これらの入力処理パイプラインは Keras モデルと直接結び付けられ、そして Keras SavedModel の一部としてエクスポートされた、非-Keras ワークフローにおける独立した前処理コードとして利用できます。

Keras 前処理層で、真に end-to-end なモデル : raw 画像や raw 構造化データを入力として受け取るモデル ; 特徴正規化やそれら自身の上の特徴値インデキシングを処理するモデルを構築してエクスポートできます。

利用可能な前処理層

テキスト前処理

tf.keras.layers.TextVectorization : raw 文字列を Embedding 層や Dense 層で読める、エンコードされた表現に変換します。

数値特徴の前処理

tf.keras.layers.Normalization : 入力特徴の特徴単位の正規化を実行します。
tf.keras.layers.Discretization : 連続な数値特徴を整数カテゴリカル特徴に変換します。

カテゴリカル特徴の前処理層

tf.keras.layers.CategoryEncoding : 整数カテゴリカル特徴を one-hot, マルチ-hot あるいはカウント dense 表現に変換します。
tf.keras.layers.Hashing : 「ハッシュトリック (= hashing trick)」としても知られる、カテゴリカル特徴ハッシングを実行します。
tf.keras.layers.StringLookup : 文字列カテゴリカル値を Embedding 層や Dense 層で読める、エンコードされた表現に変換します。
tf.keras.layers.IntegerLookup : 整数カテゴリカル値を Embedding 層や Dense 層で読める、エンコードされた表現に変換します。

画像前処理層

これらの層は画像モデルの入力を標準化するためです。

tf.keras.layers.Resizing : 画像のバッチをターゲットサイズにリサイズする。
tf.keras.layers.Rescaling : 画像のバッチの値をリスケールしてオフセットする (e.g. [0, 255] 範囲の入力から [0, 1] 範囲の入力に進める)。
tf.keras.layers.CenterCrop : 画像のバッチである場合に中心クロップを返す。

画像データ増強層

これらの層は画像のバッチにランダムな増強変換を適用します。それらは訓練の間だけアクティブです。

adapt() メソッド

幾つかの前処理層は訓練データのサンプルに基づいて計算できる内部状態を持ちます。stateful な前処理層のリストは :

TextVectorization: 文字列トークンと整数インデックスの間のマッピングを保持します。
StringLookup と IntegerLookup: 入力値と出力インデックスの間のマッピングを保持します。
Normalization: 特徴の平均と標準偏差を保持します。
Discretization (離散化) : 値バケット境界についての情報を保持します。

重要なことは、これらの層は 非-訓練可能 であることです (訓練可能ではありません)。それらの状態は訓練の間に設定されません ; それは 訓練の前 に設定されなければなりません、事前計算された定数によりそれらを初期化するか、データ上でそれらを “adaptation (適応)” させるかのいずれかです。

adapt() メソッドを通して、それを訓練データに晒すことにより前処理層の状態を設定します :

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers

data = np.array([[0.1, 0.2, 0.3], [0.8, 0.9, 1.0], [1.5, 1.6, 1.7],])
layer = layers.Normalization()
layer.adapt(data)
normalized_data = layer(data)

print("Features mean: %.2f" % (normalized_data.numpy().mean()))
print("Features std: %.2f" % (normalized_data.numpy().std()))

Features mean: -0.00
Features std: 1.00

adapt() メソッドは Numpy 配列か tf.data.Dataset オブジェクトのいずれかを取ります。StringLookup と TextVectorization の場合には、文字列のリストを渡すこともできます :

data = [
    "ξεῖν᾽, ἦ τοι μὲν ὄνειροι ἀμήχανοι ἀκριτόμυθοι",
    "γίγνοντ᾽, οὐδέ τι πάντα τελείεται ἀνθρώποισι.",
    "δοιαὶ γάρ τε πύλαι ἀμενηνῶν εἰσὶν ὀνείρων:",
    "αἱ μὲν γὰρ κεράεσσι τετεύχαται, αἱ δ᾽ ἐλέφαντι:",
    "τῶν οἳ μέν κ᾽ ἔλθωσι διὰ πριστοῦ ἐλέφαντος,",
    "οἵ ῥ᾽ ἐλεφαίρονται, ἔπε᾽ ἀκράαντα φέροντες:",
    "οἱ δὲ διὰ ξεστῶν κεράων ἔλθωσι θύραζε,",
    "οἵ ῥ᾽ ἔτυμα κραίνουσι, βροτῶν ὅτε κέν τις ἴδηται.",
]
layer = layers.TextVectorization()
layer.adapt(data)
vectorized_text = layer(data)
print(vectorized_text)

tf.Tensor(
[[37 12 25  5  9 20 21  0  0]
 [51 34 27 33 29 18  0  0  0]
 [49 52 30 31 19 46 10  0  0]
 [ 7  5 50 43 28  7 47 17  0]
 [24 35 39 40  3  6 32 16  0]
 [ 4  2 15 14 22 23  0  0  0]
 [36 48  6 38 42  3 45  0  0]
 [ 4  2 13 41 53  8 44 26 11]], shape=(8, 9), dtype=int64)

加えて、適応可能な (= adaptable) 層はコンストラクタ引数や重み割当てを通して、状態を直接設定するためのオプションを常に公開しています。意図された状態値が層構築時に知られていたり、adapt() 呼び出しの外側で計算される場合、それらは層の内部計算に依存することなく設定できます。例えば、TextVectorization, StringLookup や IntegerLookup 層のための外部語彙ファイルが既に存在する場合、それらは語彙ファイルへのパスを層のコンストラクタ引数に渡すことにより検索テーブルに直接ロードできます。

ここに事前計算された語彙で StringLookup 層をインスタンス化するサンプルがあります :

vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
layer = layers.StringLookup(vocabulary=vocab)
vectorized_data = layer(data)
print(vectorized_data)

tf.Tensor(
[[1 3 4]
 [4 0 2]], shape=(2, 3), dtype=int64)

モデルの前かモデルの内部でデータを処理する

前処理層を使用できる 2 つの方法があります :

オプション 1: このように、それらをモデルの一部にします :

inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = rest_of_the_model(x)
model = keras.Model(inputs, outputs)

このオプションでは、モデルの残りの実行と同期して、前処理はデバイス上で発生します、それは GPU アクセラレーションから恩恵を受けることを意味します。GPU 上で訓練している場合、これは Normalization (正規化) 層と、総ての画像前処理とデータ増強層のために最善なオプションです。

オプション 2: 前処理されたデータのバッチを yield するデータセットを得るために、tf.data.Dataset にそれを適用します、このようにです :

dataset = dataset.map(lambda x, y: (preprocessing_layer(x), y))

このオプションでは、前処理は CPU 上で発生し、そして非同期に、モデルに進む前にバッファリングされます。更に、データセット上で dataset.prefetch(tf.data.AUTOTUNE) を呼び出す場合、前処理は訓練と並列に効率的に起きます。

dataset = dataset.map(lambda x, y: (preprocessing_layer(x), y))
dataset = dataset.prefetch(tf.data.AUTOTUNE)
model.fit(dataset, ...)

これは TextVectorization 、そして総ての構造化データ前処理層のために最善なオプションです。それはまた CPU 上で訓練していて画像前処理層を使用する場合に良いオプションである可能性もあります。

TPU 上で実行する場合、前処理層は tf.data パイプライン内に常に配置するべきです (Normalization と Rescaling は例外です、これらは TPU 上でうまく動作し、画像モデルの最初の層として一般に使用されます)

推論時にモデルの内部で前処理を行なうメリット

オプション 2 で進める場合でさえ、前処理層を含む、推論-only な end-to-end モデルを後でエクスポートすることを望むかもしれません。これを行なう主要なメリットはそれが モデルを可搬にして 訓練/サービングのねじれ (= skew) を減じる ことに役立つからです。

総てのデータ前処理がモデルの一部であるとき、他の人は、各特徴がエンコードされて正規化されることをどのように想定されているかを知らなければならないことなくモデルをロードして利用できます。貴方の推論モデルは raw 画像や raw 構造化データを処理することができて、モデルのユーザに例えばテキストのために使用されるトークン化スキーム、カテゴリカル特徴のために使用されるインデキシング・スキーム、画像ピクセル値が [-1, +1] か [0, 1] に正規化されるかどうか、等々の詳細を知ることを要求しません。これは貴方がモデルを TensorFlow.js のような他のランタイムにエクスポートしている場合に特に強力です : JavaScript で前処理パイプラインを再実装しなくても構いません。

最初に前処理層を tf.data パイプラインに配置する場合、前処理をパッケージ化する推論モデルをエクスポートできます。単純に前処理層と訓練モデルを連鎖する新しいモデルをインスタンス化します :

inputs = keras.Input(shape=input_shape)
x = preprocessing_layer(inputs)
outputs = training_model(x)
inference_model = keras.Model(inputs, outputs)

クイック・レシピ

画像データ増強

画像データ増強層は訓練の間だけアクティブであることに注意してください (Dropout 層と同様に)。

from tensorflow import keras
from tensorflow.keras import layers

# Create a data augmentation stage with horizontal flipping, rotations, zooms
data_augmentation = keras.Sequential(
    [
        layers.RandomFlip("horizontal"),
        layers.RandomRotation(0.1),
        layers.RandomZoom(0.1),
    ]
)

# Load some data
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
input_shape = x_train.shape[1:]
classes = 10

# Create a tf.data pipeline of augmented images (and their labels)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.batch(16).map(lambda x, y: (data_augmentation(x), y))


# Create a model and train it on the augmented image data
inputs = keras.Input(shape=input_shape)
x = layers.Rescaling(1.0 / 255)(inputs)  # Rescale inputs
outputs = keras.applications.ResNet50(  # Add the rest of the model
    weights=None, input_shape=input_shape, classes=classes
)(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy")
model.fit(train_dataset, steps_per_epoch=5)

5/5 [==============================] - 11s 527ms/step - loss: 9.2445

サンプル image classification from scratch で実際に同様のセットアップを見ることができます。

数値特徴を正規化する

# Load some data
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
x_train = x_train.reshape((len(x_train), -1))
input_shape = x_train.shape[1:]
classes = 10

# Create a Normalization layer and set its internal state using the training data
normalizer = layers.Normalization()
normalizer.adapt(x_train)

# Create a model that include the normalization layer
inputs = keras.Input(shape=input_shape)
x = normalizer(inputs)
outputs = layers.Dense(classes, activation="softmax")(x)
model = keras.Model(inputs, outputs)

# Train the model
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
model.fit(x_train, y_train)

1563/1563 [==============================] - 2s 889us/step - loss: 2.1196

<keras.callbacks.History at 0x162738a50>

one-hot エンコーディングを通して文字列カテゴリカル特徴をエンコードする

# Define some toy data
data = tf.constant([["a"], ["b"], ["c"], ["b"], ["c"], ["a"]])

# Use StringLookup to build an index of the feature values and encode output.
lookup = layers.StringLookup(output_mode="one_hot")
lookup.adapt(data)

# Convert new test data (which includes unknown feature values)
test_data = tf.constant([["a"], ["b"], ["c"], ["d"], ["e"], [""]])
encoded_data = lookup(test_data)
print(encoded_data)

tf.Tensor(
[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]], shape=(6, 4), dtype=float32)

インデックス 1 は out-of-vocabulary (語彙外の) 値 (adapt() の間に見られなかった値) のために予約されていることに注意してください。

Structured data classification from scratch サンプルで実際に StringLookup を見ることができます。

one-hot エンコーディングを通して整数カテゴリカル特徴をエンコードする

# Define some toy data
data = tf.constant([[10], [20], [20], [10], [30], [0]])

# Use IntegerLookup to build an index of the feature values and encode output.
lookup = layers.IntegerLookup(output_mode="one_hot")
lookup.adapt(data)

# Convert new test data (which includes unknown feature values)
test_data = tf.constant([[10], [10], [20], [50], [60], [0]])
encoded_data = lookup(test_data)
print(encoded_data)

tf.Tensor(
[[0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]], shape=(6, 5), dtype=float32)

インデックス 0 は欠損値 (これは値 0 として指定するべきです) のために予約されていて、インデックス 1 は out-of-vocabulary (語彙外の) 値 (adapt() の間に見られなかった値) のために予約されていることに注意してください。IntegerLookup の mask_value と oov_token コンストラクタ引数を使用してこれを configure できます。

サンプル structured data classification from scratch で実際に IntegerLookup を見ることができます。

整数カテゴリカル特徴にハッシュトリックを適用する

データで各値が数回だけ現れるような、 (10e3 かそれ以上のオーダーで) 多くの異なる値を取れるカテゴリカル特徴を持つ場合、それは特徴値をインデックスして one-hot エンコードすることは実現困難で効果的ではありません。代わりに、「ハッシュトリック (= hashing trick)」を適用することは良い考えであり得ます : 値を固定サイズのベクトルにハッシュ化します。これは特徴空間のサイズを管理可能にし、明示的なインデキシングの必要性を取り除きます。

# Sample data: 10,000 random integers with values between 0 and 100,000
data = np.random.randint(0, 100000, size=(10000, 1))

# Use the Hashing layer to hash the values to the range [0, 64]
hasher = layers.Hashing(num_bins=64, salt=1337)

# Use the CategoryEncoding layer to multi-hot encode the hashed values
encoder = layers.CategoryEncoding(num_tokens=64, output_mode="multi_hot")
encoded_data = encoder(hasher(data))
print(encoded_data.shape)

(10000, 64)

テキストをトークン・インデックスのシークエンスとしてエンコードする

これは Embedding 層に渡されるテキストをどのように前処理するべきかです。

# Define some text data to adapt the layer
adapt_data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)

# Create a TextVectorization layer
text_vectorizer = layers.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
text_vectorizer.adapt(adapt_data)

# Try out the layer
print(
    "Encoded text:\n", text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
)

# Create a simple model
inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(input_dim=text_vectorizer.vocabulary_size(), output_dim=16)(inputs)
x = layers.GRU(8)(x)
outputs = layers.Dense(1)(x)
model = keras.Model(inputs, outputs)

# Create a labeled dataset (which includes unknown tokens)
train_dataset = tf.data.Dataset.from_tensor_slices(
    (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
)

# Preprocess the string inputs, turning them into int sequences
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
# Train the model on the int sequences
print("\nTraining model...")
model.compile(optimizer="rmsprop", loss="mse")
model.fit(train_dataset)

# For inference, you can export a model that accepts strings as input
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = model(x)
end_to_end_model = keras.Model(inputs, outputs)

# Call the end-to-end model on test data (which includes unknown tokens)
print("\nCalling end-to-end model on test string...")
test_data = tf.constant(["The one the other will absorb"])
test_output = end_to_end_model(test_data)
print("Model output:", test_output)

Encoded text:
 [[ 2 19 14  1  9  2  1]]

Training model...
1/1 [==============================] - 2s 2s/step - loss: 0.5064

Calling end-to-end model on test string...
Model output: tf.Tensor([[0.04655712]], shape=(1, 1), dtype=float32)

サンプル text classification from scratch で Embedding モードと結び付けられた、TextVectorization を実際に見ることができます。

そのようなモデルを訓練するとき、ベスト・パフォーマンスのためには、入力パイプラインの一部として TextVectorization 層を常に使用する必要があることに注意してください。

テキストをマルチ-hot エンコーディングで ngram の密行列としてエンコードする

これは Dense 層に渡されるテキストを前処理する方法です。

# Define some text data to adapt the layer
adapt_data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "multi_hot" output_mode
# and ngrams=2 (index all bigrams)
text_vectorizer = layers.TextVectorization(output_mode="multi_hot", ngrams=2)
# Index the bigrams via `adapt()`
text_vectorizer.adapt(adapt_data)

# Try out the layer
print(
    "Encoded text:\n", text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
)

# Create a simple model
inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))
outputs = layers.Dense(1)(inputs)
model = keras.Model(inputs, outputs)

# Create a labeled dataset (which includes unknown tokens)
train_dataset = tf.data.Dataset.from_tensor_slices(
    (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
)

# Preprocess the string inputs, turning them into int sequences
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
# Train the model on the int sequences
print("\nTraining model...")
model.compile(optimizer="rmsprop", loss="mse")
model.fit(train_dataset)

# For inference, you can export a model that accepts strings as input
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = model(x)
end_to_end_model = keras.Model(inputs, outputs)

# Call the end-to-end model on test data (which includes unknown tokens)
print("\nCalling end-to-end model on test string...")
test_data = tf.constant(["The one the other will absorb"])
test_output = end_to_end_model(test_data)
print("Model output:", test_output)

Encoded text:
 [[1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]]

Training model...
1/1 [==============================] - 0s 186ms/step - loss: 0.2771

Calling end-to-end model on test string...
Model output: tf.Tensor([[-0.96185416]], shape=(1, 1), dtype=float32)

テキストを TF-IDF 重み付けで ngram の密行列としてエンコードする

これはテキストを Dense 層に渡す前に前処理する代替の方法です。

# Define some text data to adapt the layer
adapt_data = tf.constant(
    [
        "The Brain is wider than the Sky",
        "For put them side by side",
        "The one the other will contain",
        "With ease and You beside",
    ]
)
# Instantiate TextVectorization with "tf-idf" output_mode
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
text_vectorizer = layers.TextVectorization(output_mode="tf-idf", ngrams=2)
# Index the bigrams and learn the TF-IDF weights via `adapt()`
text_vectorizer.adapt(adapt_data)

# Try out the layer
print(
    "Encoded text:\n", text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
)

# Create a simple model
inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))
outputs = layers.Dense(1)(inputs)
model = keras.Model(inputs, outputs)

# Create a labeled dataset (which includes unknown tokens)
train_dataset = tf.data.Dataset.from_tensor_slices(
    (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
)

# Preprocess the string inputs, turning them into int sequences
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
# Train the model on the int sequences
print("\nTraining model...")
model.compile(optimizer="rmsprop", loss="mse")
model.fit(train_dataset)

# For inference, you can export a model that accepts strings as input
inputs = keras.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
outputs = model(x)
end_to_end_model = keras.Model(inputs, outputs)

# Call the end-to-end model on test data (which includes unknown tokens)
print("\nCalling end-to-end model on test string...")
test_data = tf.constant(["The one the other will absorb"])
test_output = end_to_end_model(test_data)
print("Model output:", test_output)

Encoded text:
 [[5.461647  1.6945957 0.        0.        0.        0.        0.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        1.0986123 1.0986123 1.0986123 0.        0.
  0.        0.        0.        0.        0.        0.        0.
  1.0986123 0.        0.        0.        0.        0.        0.
  0.        1.0986123 1.0986123 0.        0.        0.       ]]

Training model...
1/1 [==============================] - 0s 241ms/step - loss: 9.6274

Calling end-to-end model on test string...
Model output: tf.Tensor([[-1.0759696]], shape=(1, 1), dtype=float32)

Important gotchas

Working with lookup layers with very large vocabularies

You may find yourself working with a very large vocabulary in a TextVectorization, a StringLookup layer, or an IntegerLookup layer. Typically, a vocabulary larger than 500MB would be considered “very large”.

In such case, for best performance, you should avoid using adapt(). Instead, pre-compute your vocabulary in advance (you could use Apache Beam or TF Transform for this) and store it in a file. Then load the vocabulary into the layer at construction time by passing the filepath as the vocabulary argument.

Using lookup layers on a TPU pod or with ParameterServerStrategy.

There is an outstanding issue that causes performance to degrade when using a TextVectorization, StringLookup, or IntegerLookup layer while training on a TPU pod or on multiple machines via ParameterServerStrategy. This is slated to be fixed in TensorFlow 2.7.

以上

2021年10月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31