Keras 2 : examples : ConvMixer による画像分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 11/15/2021 (keras 2.7.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Computer Vision : Image classification with ConvMixer (Author: Sayak Paul)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス ★ 無料 Web セミナー開催中 ★

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しております。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援
テレワーク & オンライン授業を支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
ウェビナー運用には弊社製品「ClassCat® Webinar」を利用しています。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/ ; Facebook

Keras 2 : examples : ConvMixer による画像分類

Description: 画像のパッチに適用される全畳み込みネットワーク

イントロダクション

ビジョン Transformers (ViT; Dosovitskiy et al.) は入力画像からの小さいパッチを抽出して、線形にそれらを射影してから、Transformer (Vaswani et al.) ブロックを適用します。画像認識タスクへの ViT の応用は急速に有望な研究領域になっています、何故ならば ViT は局所性をモデル化するために (畳み込みのような) 強力な inductive (誘導的・帰納的) なバイアスを持つ必要性を取り除くからです。これは、できる限り最小限の誘導的なバイアスでそれらを訓練データだけから学習する能力がある一般的な計算プリミティブとして提示します。ViT は、適切な正則化、データ増強と比較的大規模なデータセットで訓練されるとき素晴らしい下流パフォーマンスを生じます。

Patches Are All You Need 論文 (注意: 執筆時、ICLR 2022 conference に提出) では、著者は全畳み込みネットワークを訓練するためにパッチを使用するアイデアを拡張して競争力のある結果を実演しています。ConvMixer と呼ばれるそれらのアーキテクチャは (ネットワークの異なる層に渡り同じ depth と解像度を使用する、残差接続等のような) ViT, MLP-Mixer (Tolstikhin et al.) のような最近の isotropic (等方性の) アーキテクチャからのレシピを使用しています。

このサンプルでは、ConvMixer モデルを実装して CIFAR-10 データセット上でそのパフォーマンスを実演します。

AdamW optimizer を使用するため、TensorFlow Addons をインストールする必要があります :

pip install -U -q tensorflow-addons

from tensorflow.keras import layers
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow_addons as tfa
import tensorflow as tf
import numpy as np

ハイパーパラメータ

実行時間を短く維持するため、10 エポックだけモデルを訓練します。ConvMixer の中心的なアイデアにフォーカスするため、RandAugment (Cubuk et al.) のような他の訓練固有な要素は使用しません。これらの詳細について更に学習することに関心があれば、元の論文を参照してください。

learning_rate = 0.001
weight_decay = 0.0001
batch_size = 128
num_epochs = 10

CIFAR-10 データセットをロードする

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
val_split = 0.1

val_indices = int(len(x_train) * val_split)
new_x_train, new_y_train = x_train[val_indices:], y_train[val_indices:]
x_val, y_val = x_train[:val_indices], y_train[:val_indices]

print(f"Training data samples: {len(new_x_train)}")
print(f"Validation data samples: {len(x_val)}")
print(f"Test data samples: {len(x_test)}")

Training data samples: 45000
Validation data samples: 5000
Test data samples: 10000

tf.data.Dataset オブジェクトの準備

私達のデータ増強パイプラインは著者が CIFAR-10 データセットのために使用したものとは異なります、これは例示の目的のために十分です。

image_size = 32
auto = tf.data.AUTOTUNE

data_augmentation = keras.Sequential(
    [layers.RandomCrop(image_size, image_size), layers.RandomFlip("horizontal"),],
    name="data_augmentation",
)


def make_datasets(images, labels, is_train=False):
    dataset = tf.data.Dataset.from_tensor_slices((images, labels))
    if is_train:
        dataset = dataset.shuffle(batch_size * 10)
    dataset = dataset.batch(batch_size)
    if is_train:
        dataset = dataset.map(
            lambda x, y: (data_augmentation(x), y), num_parallel_calls=auto
            # masao : 16-nov-21
            #lambda x, y: (data_augmentation(float(x)), y), num_parallel_calls=auto
        )
    return dataset.prefetch(auto)


train_dataset = make_datasets(new_x_train, new_y_train, is_train=True)
val_dataset = make_datasets(x_val, y_val)
test_dataset = make_datasets(x_test, y_test)

2021-10-17 03:43:59.588315: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:43:59.596532: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:43:59.597211: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:43:59.622016: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-17 03:43:59.622853: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:43:59.623542: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:43:59.624174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:44:00.067659: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:44:00.068334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:44:00.068970: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-17 03:44:00.069615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14684 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:04.0, compute capability: 7.0

ConvMixer ユティリティ

次の図 (元の論文から引用) は ConvMixer モデルを表しています :

ConvMixer は MLP-Mixer に非常に良く似ていますが、以下の主要な違いを持つモデルです :

完全結合層を使用する代わりに、標準的な畳み込み層を使用します。
(ViT と MLP-Mixer のためには一般的な) LayerNorm の代わりに、BatchNorm を使用します。

ConvMixer では 2 つのタイプの畳み込み層が使用されます。(1): Depthwise 畳み込み : 画像の空間的な位置をミックスするため、(2): Pointwise 畳み込み (これは depthwise 畳み込みに続きます) : パッチに渡るチャネル wise な情報をミックスするため。別のキーポイントはより大きな受容野を可能にするため、より大きいカーネルサイズの使用です。

def activation_block(x):
    x = layers.Activation("gelu")(x)
    return layers.BatchNormalization()(x)


def conv_stem(x, filters: int, patch_size: int):
    x = layers.Conv2D(filters, kernel_size=patch_size, strides=patch_size)(x)
    return activation_block(x)


def conv_mixer_block(x, filters: int, kernel_size: int):
    # Depthwise convolution.
    x0 = x
    x = layers.DepthwiseConv2D(kernel_size=kernel_size, padding="same")(x)
    x = layers.Add()([activation_block(x), x0])  # Residual.

    # Pointwise convolution.
    x = layers.Conv2D(filters, kernel_size=1)(x)
    x = activation_block(x)

    return x


def get_conv_mixer_256_8(
    image_size=32, filters=256, depth=8, kernel_size=5, patch_size=2, num_classes=10
):
    """ConvMixer-256/8: https://openreview.net/pdf?id=TVHS5Y4dNvM.
    The hyperparameter values are taken from the paper.
    """
    inputs = keras.Input((image_size, image_size, 3))
    x = layers.Rescaling(scale=1.0 / 255)(inputs)

    # Extract patch embeddings.
    x = conv_stem(x, filters, patch_size)

    # ConvMixer blocks.
    for _ in range(depth):
        x = conv_mixer_block(x, filters, kernel_size)

    # Classification block.
    x = layers.GlobalAvgPool2D()(x)
    outputs = layers.Dense(num_classes, activation="softmax")(x)

    return keras.Model(inputs, outputs)

実験で使用されたモデルは ConvMixer-256/8 と呼称され、ここで 256 はチャネル数を、8 は depth を表します。結果としてのモデルは 80 万 (0.8 million) パラメータを持つだけです。

モデル訓練と評価ユティリティ

# Code reference:
# https://keras.io/examples/vision/image_classification_with_vision_transformer/.


def run_experiment(model):
    optimizer = tfa.optimizers.AdamW(
        learning_rate=learning_rate, weight_decay=weight_decay
    )

    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    checkpoint_filepath = "/tmp/checkpoint"
    checkpoint_callback = keras.callbacks.ModelCheckpoint(
        checkpoint_filepath,
        monitor="val_accuracy",
        save_best_only=True,
        save_weights_only=True,
    )

    history = model.fit(
        train_dataset,
        validation_data=val_dataset,
        epochs=num_epochs,
        callbacks=[checkpoint_callback],
    )

    model.load_weights(checkpoint_filepath)
    _, accuracy = model.evaluate(test_dataset)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, model

モデルの訓練と評価

conv_mixer_model = get_conv_mixer_256_8()
history, conv_mixer_model = run_experiment(conv_mixer_model)

2021-10-17 03:44:01.291445: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

Epoch 1/10

2021-10-17 03:44:04.721186: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005

352/352 [==============================] - 29s 70ms/step - loss: 1.2272 - accuracy: 0.5592 - val_loss: 3.9422 - val_accuracy: 0.1196
Epoch 2/10
352/352 [==============================] - 24s 69ms/step - loss: 0.7813 - accuracy: 0.7278 - val_loss: 0.8860 - val_accuracy: 0.6898
Epoch 3/10
352/352 [==============================] - 24s 68ms/step - loss: 0.5947 - accuracy: 0.7943 - val_loss: 0.6175 - val_accuracy: 0.7856
Epoch 4/10
352/352 [==============================] - 24s 69ms/step - loss: 0.4801 - accuracy: 0.8330 - val_loss: 0.5634 - val_accuracy: 0.8064
Epoch 5/10
352/352 [==============================] - 24s 68ms/step - loss: 0.4065 - accuracy: 0.8599 - val_loss: 0.5359 - val_accuracy: 0.8166
Epoch 6/10
352/352 [==============================] - 24s 68ms/step - loss: 0.3473 - accuracy: 0.8804 - val_loss: 0.5257 - val_accuracy: 0.8228
Epoch 7/10
352/352 [==============================] - 24s 68ms/step - loss: 0.3071 - accuracy: 0.8944 - val_loss: 0.4982 - val_accuracy: 0.8264
Epoch 8/10
352/352 [==============================] - 24s 68ms/step - loss: 0.2655 - accuracy: 0.9083 - val_loss: 0.5032 - val_accuracy: 0.8346
Epoch 9/10
352/352 [==============================] - 24s 68ms/step - loss: 0.2328 - accuracy: 0.9194 - val_loss: 0.5225 - val_accuracy: 0.8326
Epoch 10/10
352/352 [==============================] - 24s 68ms/step - loss: 0.2115 - accuracy: 0.9278 - val_loss: 0.5063 - val_accuracy: 0.8372
79/79 [==============================] - 2s 19ms/step - loss: 0.5412 - accuracy: 0.8325
Test accuracy: 83.25%

(訳注: 実験結果)

Epoch 1/10
352/352 [==============================] - 67s 132ms/step - loss: 1.2052 - accuracy: 0.5668 - val_loss: 4.7296 - val_accuracy: 0.1062
Epoch 2/10
352/352 [==============================] - 46s 130ms/step - loss: 0.7737 - accuracy: 0.7288 - val_loss: 0.7785 - val_accuracy: 0.7268
Epoch 3/10
352/352 [==============================] - 46s 130ms/step - loss: 0.5985 - accuracy: 0.7933 - val_loss: 0.6307 - val_accuracy: 0.7792
Epoch 4/10
352/352 [==============================] - 46s 130ms/step - loss: 0.4844 - accuracy: 0.8325 - val_loss: 0.6092 - val_accuracy: 0.7904
Epoch 5/10
352/352 [==============================] - 46s 130ms/step - loss: 0.4094 - accuracy: 0.8584 - val_loss: 0.5261 - val_accuracy: 0.8190
Epoch 6/10
352/352 [==============================] - 46s 130ms/step - loss: 0.3503 - accuracy: 0.8792 - val_loss: 0.4990 - val_accuracy: 0.8304
Epoch 7/10
352/352 [==============================] - 46s 129ms/step - loss: 0.3073 - accuracy: 0.8942 - val_loss: 0.5283 - val_accuracy: 0.8266
Epoch 8/10
352/352 [==============================] - 46s 129ms/step - loss: 0.2676 - accuracy: 0.9080 - val_loss: 0.4914 - val_accuracy: 0.8370
Epoch 9/10
352/352 [==============================] - 45s 129ms/step - loss: 0.2317 - accuracy: 0.9205 - val_loss: 0.5179 - val_accuracy: 0.8316
Epoch 10/10
352/352 [==============================] - 46s 130ms/step - loss: 0.2188 - accuracy: 0.9237 - val_loss: 0.5427 - val_accuracy: 0.8306
79/79 [==============================] - 3s 33ms/step - loss: 0.5455 - accuracy: 0.8264
Test accuracy: 82.64%
CPU times: user 6min 32s, sys: 10.9 s, total: 6min 42s
Wall time: 11min 42s

訓練と検証性能の隔たりは追加の正則化テクニックを使用することで軽減できます。そうは言っても、80 万パラメータで 10 エポック内で ~83% 精度に到達できるのは強力な結果です。

ConvMixer の内部の可視化

パッチ埋め込みと学習された畳み込みフィルタを可視化できます。各パッチ埋め込みと中間特徴マップは同じ数のチャネル (この場合 256) を持つことを思い出してください。これは可視化ユティリティの実装を容易にします。

# Code reference: https://bit.ly/3awIRbP.


def visualization_plot(weights, idx=1):
    # First, apply min-max normalization to the
    # given weights to avoid isotrophic scaling.
    p_min, p_max = weights.min(), weights.max()
    weights = (weights - p_min) / (p_max - p_min)

    # Visualize all the filters.
    num_filters = 256
    plt.figure(figsize=(8, 8))

    for i in range(num_filters):
        current_weight = weights[:, :, :, i]
        if current_weight.shape[-1] == 1:
            current_weight = current_weight.squeeze()
        ax = plt.subplot(16, 16, idx)
        ax.set_xticks([])
        ax.set_yticks([])
        plt.imshow(current_weight)
        idx += 1


# We first visualize the learned patch embeddings.
patch_embeddings = conv_mixer_model.layers[2].get_weights()[0]
visualization_plot(patch_embeddings)

ネットワークが収束するまで訓練していませんが、異なるパッチが異なるパターンを示すことに気付くことができます。幾つかは他と類似性を共有する一方で、幾つかは非常に異なります。これらの可視化はより大きい画像サイズで顕著です。

同様に、raw 畳み込みカーネルを可視化できます。これは与えられたカーネルがどのパターンに対して receptive (受容可能) であるか理解するのに役立つことができます。

# First, print the indices of the convolution layers that are not
# pointwise convolutions.
for i, layer in enumerate(conv_mixer_model.layers):
    if isinstance(layer, layers.DepthwiseConv2D):
        if layer.get_config()["kernel_size"] == (5, 5):
            print(i, layer)

idx = 26  # Taking a kernel from the middle of the network.

kernel = conv_mixer_model.layers[idx].get_weights()[0]
kernel = np.expand_dims(kernel.squeeze(), axis=2)
visualization_plot(kernel)

5 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e74854990>
12 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e747df910>
19 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e6c5c9e10>
26 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e74906750>
33 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e74902390>
40 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e748ee690>
47 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e7493dfd0>
54 <keras.layers.convolutional.DepthwiseConv2D object at 0x7f9e6c4e8a10>

カーネルの異なるフィルタが異なる位置スパン (= locality spans) を持ち、そしてこのパターンはより多くの訓練で進化する傾向があることが分かります。

最後に

畳み込みを self-attention のような別のデータ不可知論の演算と融合する最近のトレンドあります。以下のワークはこの一連の研究に沿っています :

以上

2021年11月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30