Keras 2 : examples : Swin Transformer による画像分類 (翻訳/解説)
翻訳 : (株)クラスキャット セールスインフォメーション
作成日時 : 12/19/2021 (keras 2.7.0)
* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです:
- Code examples : Computer Vision : Image classification with Swin Transformers (Author: Rishit Dagli)
* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。
- 人工知能研究開発支援
- 人工知能研修サービス(経営者層向けオンサイト研修)
- テクニカルコンサルティングサービス
- 実証実験(プロトタイプ構築)
- アプリケーションへの実装
- 人工知能研修サービス
- PoC(概念実証)を失敗させないための支援
- お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。
- 株式会社クラスキャット セールス・マーケティング本部 セールス・インフォメーション
- sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP
Keras 2 : examples : Swin Transformer による画像分類
Description: Swin Transformer, コンピュータビジョンのための汎用目的バックボーン, を使用した画像分類。
このサンプルは画像分類のための Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Liu et al. を実装し、そしてそれを CIFAR-100 データセット 上で実演します。
Swin Transformer (Shifted Window Transformer) はコンピュータビジョンのための汎用目的バックボーンとして機能することができます。Swin Transformer は階層型 (= hierarchical) Transformer で、その表現はシフトウィンドウで計算されます。シフトウィンドウのスキームは、交差ウィンドウ接続も許容する一方で、自己注意計算を非オーバーラップ局所ウィンドウに制限することで大きな効率性をもたらします。このアーキテクチャは様々なスケールの情報をモデル化する柔軟性を持ち、画像サイズに関して線形の計算複雑度を持ちます。
このサンプルは TensorFlow 2.5 またはそれ以上、そして TensorFlow Addons を必要とします、これは次のコマンドを使用してインストールできます :
!pip install -U tensorflow-addons
セットアップ
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow import keras
from tensorflow.keras import layers
データの準備
tf.keras.datasets 経由で CIFAR-100 データセットをロードし、画像を正規化し、そして整数ラベルを one-hot エンコード・ベクトルに変換します。
num_classes = 100
input_shape = (32, 32, 3)
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(f"x_train shape: {x_train.shape} - y_train shape: {y_train.shape}")
print(f"x_test shape: {x_test.shape} - y_test shape: {y_test.shape}")
plt.figure(figsize=(10, 10))
for i in range(25):
plt.subplot(5, 5, i + 1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(x_train[i])
plt.show()
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz 169009152/169001437 [==============================] - 3s 0us/step 169017344/169001437 [==============================] - 3s 0us/step x_train shape: (50000, 32, 32, 3) - y_train shape: (50000, 100) x_test shape: (10000, 32, 32, 3) - y_test shape: (10000, 100)
ハイパーパラメータの設定
選択すべき重要なパラメータは patch_size, 入力パッチのサイズです。各ピクセルを個別の入力として使用するためには、patch_size を (1, 1) に設定することができます。以下では、ImageNet-1K 上の訓練のための原論文の設定からインスピレーションを得ていて、このサンプルのために殆どの元の設定を維持しています。
patch_size = (2, 2) # 2-by-2 sized patches
dropout_rate = 0.03 # Dropout rate
num_heads = 8 # Attention heads
embed_dim = 64 # Embedding dimension
num_mlp = 256 # MLP layer size
qkv_bias = True # Convert embedded patches to query, key, and values with a learnable additive value
window_size = 2 # Size of attention window
shift_size = 1 # Size of shifting window
image_dimension = 32 # Initial image size
num_patch_x = input_shape[0] // patch_size[0]
num_patch_y = input_shape[1] // patch_size[1]
learning_rate = 1e-3
batch_size = 128
num_epochs = 40
validation_split = 0.1
weight_decay = 0.0001
label_smoothing = 0.1
ヘルパー関数
画像からパッチのシークエンスを取得し、パッチをマージし、そして dropout を適用するのに役立つ 2 つのヘルパー関数を作成します。
def window_partition(x, window_size):
_, height, width, channels = x.shape
patch_num_y = height // window_size
patch_num_x = width // window_size
x = tf.reshape(
x, shape=(-1, patch_num_y, window_size, patch_num_x, window_size, channels)
)
x = tf.transpose(x, (0, 1, 3, 2, 4, 5))
windows = tf.reshape(x, shape=(-1, window_size, window_size, channels))
return windows
def window_reverse(windows, window_size, height, width, channels):
patch_num_y = height // window_size
patch_num_x = width // window_size
x = tf.reshape(
windows,
shape=(-1, patch_num_y, patch_num_x, window_size, window_size, channels),
)
x = tf.transpose(x, perm=(0, 1, 3, 2, 4, 5))
x = tf.reshape(x, shape=(-1, height, width, channels))
return x
class DropPath(layers.Layer):
def __init__(self, drop_prob=None, **kwargs):
super(DropPath, self).__init__(**kwargs)
self.drop_prob = drop_prob
def call(self, x):
input_shape = tf.shape(x)
batch_size = input_shape[0]
rank = x.shape.rank
shape = (batch_size,) + (1,) * (rank - 1)
random_tensor = (1 - self.drop_prob) + tf.random.uniform(shape, dtype=x.dtype)
path_mask = tf.floor(random_tensor)
output = tf.math.divide(x, 1 - self.drop_prob) * path_mask
return output
ウィンドウベースのマルチヘッド自己注意
通常は Transformer は大域的な自己注意を遂行します、そこではトークンと総ての他のトークンの間の関係性が計算されます。大域的な計算はトークンの数に関して 2 次の複雑度を引き起こします。ここでは、原論文 が提案しているように、非オーバーラップ手法で、局所的なウィンドウ内で自己注意を計算します。大域的な自己注意はパッチの数で 2 次の計算複雑度をもたらす一方で、ウィンドウベースの自己注意は線形の複雑度につなり、容易にスケールできます。
class WindowAttention(layers.Layer):
def __init__(
self, dim, window_size, num_heads, qkv_bias=True, dropout_rate=0.0, **kwargs
):
super(WindowAttention, self).__init__(**kwargs)
self.dim = dim
self.window_size = window_size
self.num_heads = num_heads
self.scale = (dim // num_heads) ** -0.5
self.qkv = layers.Dense(dim * 3, use_bias=qkv_bias)
self.dropout = layers.Dropout(dropout_rate)
self.proj = layers.Dense(dim)
def build(self, input_shape):
num_window_elements = (2 * self.window_size[0] - 1) * (
2 * self.window_size[1] - 1
)
self.relative_position_bias_table = self.add_weight(
shape=(num_window_elements, self.num_heads),
initializer=tf.initializers.Zeros(),
trainable=True,
)
coords_h = np.arange(self.window_size[0])
coords_w = np.arange(self.window_size[1])
coords_matrix = np.meshgrid(coords_h, coords_w, indexing="ij")
coords = np.stack(coords_matrix)
coords_flatten = coords.reshape(2, -1)
relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]
relative_coords = relative_coords.transpose([1, 2, 0])
relative_coords[:, :, 0] += self.window_size[0] - 1
relative_coords[:, :, 1] += self.window_size[1] - 1
relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
relative_position_index = relative_coords.sum(-1)
self.relative_position_index = tf.Variable(
initial_value=tf.convert_to_tensor(relative_position_index), trainable=False
)
def call(self, x, mask=None):
_, size, channels = x.shape
head_dim = channels // self.num_heads
x_qkv = self.qkv(x)
x_qkv = tf.reshape(x_qkv, shape=(-1, size, 3, self.num_heads, head_dim))
x_qkv = tf.transpose(x_qkv, perm=(2, 0, 3, 1, 4))
q, k, v = x_qkv[0], x_qkv[1], x_qkv[2]
q = q * self.scale
k = tf.transpose(k, perm=(0, 1, 3, 2))
attn = q @ k
num_window_elements = self.window_size[0] * self.window_size[1]
relative_position_index_flat = tf.reshape(
self.relative_position_index, shape=(-1,)
)
relative_position_bias = tf.gather(
self.relative_position_bias_table, relative_position_index_flat
)
relative_position_bias = tf.reshape(
relative_position_bias, shape=(num_window_elements, num_window_elements, -1)
)
relative_position_bias = tf.transpose(relative_position_bias, perm=(2, 0, 1))
attn = attn + tf.expand_dims(relative_position_bias, axis=0)
if mask is not None:
nW = mask.get_shape()[0]
mask_float = tf.cast(
tf.expand_dims(tf.expand_dims(mask, axis=1), axis=0), tf.float32
)
attn = (
tf.reshape(attn, shape=(-1, nW, self.num_heads, size, size))
+ mask_float
)
attn = tf.reshape(attn, shape=(-1, self.num_heads, size, size))
attn = keras.activations.softmax(attn, axis=-1)
else:
attn = keras.activations.softmax(attn, axis=-1)
attn = self.dropout(attn)
x_qkv = attn @ v
x_qkv = tf.transpose(x_qkv, perm=(0, 2, 1, 3))
x_qkv = tf.reshape(x_qkv, shape=(-1, size, channels))
x_qkv = self.proj(x_qkv)
x_qkv = self.dropout(x_qkv)
return x_qkv
完全な Swin Transformer モデル
最後に、標準的なマルチヘッド注意 (MHA) をシフトウィンドウ注意と置き換えて、完全な Swin Transformer を組み立てます。原論文で提案されているように、シフトウィンドウ・ベースの MHA 層から成るモデルを作成します、間に GELU 非線形を持つ 2 層 MLP が続き、各 MSA 層と各 MLP の前に LayerNormalization を適用し、そしてこれらの層の各々の後には残差接続です。
2 Dense と 2 Dropout 層を持つ単純な MLP だけを作成していることに気づいてください。MLP として ResNet-50 を使用したモデルを見ることも多いです、これは文献では非常に標準的です。けれどもこの論文では著者らは間に GELU 非線形を持つ 2 層 MLPを使用しています。
class SwinTransformer(layers.Layer):
def __init__(
self,
dim,
num_patch,
num_heads,
window_size=7,
shift_size=0,
num_mlp=1024,
qkv_bias=True,
dropout_rate=0.0,
**kwargs,
):
super(SwinTransformer, self).__init__(**kwargs)
self.dim = dim # number of input dimensions
self.num_patch = num_patch # number of embedded patches
self.num_heads = num_heads # number of attention heads
self.window_size = window_size # size of window
self.shift_size = shift_size # size of window shift
self.num_mlp = num_mlp # number of MLP nodes
self.norm1 = layers.LayerNormalization(epsilon=1e-5)
self.attn = WindowAttention(
dim,
window_size=(self.window_size, self.window_size),
num_heads=num_heads,
qkv_bias=qkv_bias,
dropout_rate=dropout_rate,
)
self.drop_path = DropPath(dropout_rate)
self.norm2 = layers.LayerNormalization(epsilon=1e-5)
self.mlp = keras.Sequential(
[
layers.Dense(num_mlp),
layers.Activation(keras.activations.gelu),
layers.Dropout(dropout_rate),
layers.Dense(dim),
layers.Dropout(dropout_rate),
]
)
if min(self.num_patch) < self.window_size:
self.shift_size = 0
self.window_size = min(self.num_patch)
def build(self, input_shape):
if self.shift_size == 0:
self.attn_mask = None
else:
height, width = self.num_patch
h_slices = (
slice(0, -self.window_size),
slice(-self.window_size, -self.shift_size),
slice(-self.shift_size, None),
)
w_slices = (
slice(0, -self.window_size),
slice(-self.window_size, -self.shift_size),
slice(-self.shift_size, None),
)
mask_array = np.zeros((1, height, width, 1))
count = 0
for h in h_slices:
for w in w_slices:
mask_array[:, h, w, :] = count
count += 1
mask_array = tf.convert_to_tensor(mask_array)
# mask array to windows
mask_windows = window_partition(mask_array, self.window_size)
mask_windows = tf.reshape(
mask_windows, shape=[-1, self.window_size * self.window_size]
)
attn_mask = tf.expand_dims(mask_windows, axis=1) - tf.expand_dims(
mask_windows, axis=2
)
attn_mask = tf.where(attn_mask != 0, -100.0, attn_mask)
attn_mask = tf.where(attn_mask == 0, 0.0, attn_mask)
self.attn_mask = tf.Variable(initial_value=attn_mask, trainable=False)
def call(self, x):
height, width = self.num_patch
_, num_patches_before, channels = x.shape
x_skip = x
x = self.norm1(x)
x = tf.reshape(x, shape=(-1, height, width, channels))
if self.shift_size > 0:
shifted_x = tf.roll(
x, shift=[-self.shift_size, -self.shift_size], axis=[1, 2]
)
else:
shifted_x = x
x_windows = window_partition(shifted_x, self.window_size)
x_windows = tf.reshape(
x_windows, shape=(-1, self.window_size * self.window_size, channels)
)
attn_windows = self.attn(x_windows, mask=self.attn_mask)
attn_windows = tf.reshape(
attn_windows, shape=(-1, self.window_size, self.window_size, channels)
)
shifted_x = window_reverse(
attn_windows, self.window_size, height, width, channels
)
if self.shift_size > 0:
x = tf.roll(
shifted_x, shift=[self.shift_size, self.shift_size], axis=[1, 2]
)
else:
x = shifted_x
x = tf.reshape(x, shape=(-1, height * width, channels))
x = self.drop_path(x)
x = x_skip + x
x_skip = x
x = self.norm2(x)
x = self.mlp(x)
x = self.drop_path(x)
x = x_skip + x
return x
モデル訓練と評価
パッチの抽出と埋め込み
最初に画像からパッチを抽出し、埋め込みそしてマージするのに役立つ 3 層を作成します、それらの上に構築した Swin Transformer クラスを後で使用します。
class PatchExtract(layers.Layer):
def __init__(self, patch_size, **kwargs):
super(PatchExtract, self).__init__(**kwargs)
self.patch_size_x = patch_size[0]
self.patch_size_y = patch_size[0]
def call(self, images):
batch_size = tf.shape(images)[0]
patches = tf.image.extract_patches(
images=images,
sizes=(1, self.patch_size_x, self.patch_size_y, 1),
strides=(1, self.patch_size_x, self.patch_size_y, 1),
rates=(1, 1, 1, 1),
padding="VALID",
)
patch_dim = patches.shape[-1]
patch_num = patches.shape[1]
return tf.reshape(patches, (batch_size, patch_num * patch_num, patch_dim))
class PatchEmbedding(layers.Layer):
def __init__(self, num_patch, embed_dim, **kwargs):
super(PatchEmbedding, self).__init__(**kwargs)
self.num_patch = num_patch
self.proj = layers.Dense(embed_dim)
self.pos_embed = layers.Embedding(input_dim=num_patch, output_dim=embed_dim)
def call(self, patch):
pos = tf.range(start=0, limit=self.num_patch, delta=1)
return self.proj(patch) + self.pos_embed(pos)
class PatchMerging(tf.keras.layers.Layer):
def __init__(self, num_patch, embed_dim):
super(PatchMerging, self).__init__()
self.num_patch = num_patch
self.embed_dim = embed_dim
self.linear_trans = layers.Dense(2 * embed_dim, use_bias=False)
def call(self, x):
height, width = self.num_patch
_, _, C = x.get_shape().as_list()
x = tf.reshape(x, shape=(-1, height, width, C))
x0 = x[:, 0::2, 0::2, :]
x1 = x[:, 1::2, 0::2, :]
x2 = x[:, 0::2, 1::2, :]
x3 = x[:, 1::2, 1::2, :]
x = tf.concat((x0, x1, x2, x3), axis=-1)
x = tf.reshape(x, shape=(-1, (height // 2) * (width // 2), 4 * C))
return self.linear_trans(x)
モデルの構築
Swin Transformer モデルを一つに組み立てます。
input = layers.Input(input_shape)
x = layers.RandomCrop(image_dimension, image_dimension)(input)
x = layers.RandomFlip("horizontal")(x)
x = PatchExtract(patch_size)(x)
x = PatchEmbedding(num_patch_x * num_patch_y, embed_dim)(x)
x = SwinTransformer(
dim=embed_dim,
num_patch=(num_patch_x, num_patch_y),
num_heads=num_heads,
window_size=window_size,
shift_size=0,
num_mlp=num_mlp,
qkv_bias=qkv_bias,
dropout_rate=dropout_rate,
)(x)
x = SwinTransformer(
dim=embed_dim,
num_patch=(num_patch_x, num_patch_y),
num_heads=num_heads,
window_size=window_size,
shift_size=shift_size,
num_mlp=num_mlp,
qkv_bias=qkv_bias,
dropout_rate=dropout_rate,
)(x)
x = PatchMerging((num_patch_x, num_patch_y), embed_dim=embed_dim)(x)
x = layers.GlobalAveragePooling1D()(x)
output = layers.Dense(num_classes, activation="softmax")(x)
CIFAR-100 上で訓練
モデルを CIFAR-100 上で訓練します。ここでは、このサンプルでは訓練時間を短くするために 40 エポックだけ訓練します。実践では、収束に達するために 150 エポック訓練すべきです。
model = keras.Model(input, output)
model.compile(
loss=keras.losses.CategoricalCrossentropy(label_smoothing=label_smoothing),
optimizer=tfa.optimizers.AdamW(
learning_rate=learning_rate, weight_decay=weight_decay
),
metrics=[
keras.metrics.CategoricalAccuracy(name="accuracy"),
keras.metrics.TopKCategoricalAccuracy(5, name="top-5-accuracy"),
],
)
history = model.fit(
x_train,
y_train,
batch_size=batch_size,
epochs=num_epochs,
validation_split=validation_split,
)
2021-09-13 08:03:23.935873: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) Epoch 1/40 352/352 [==============================] - 19s 34ms/step - loss: 4.1679 - accuracy: 0.0817 - top-5-accuracy: 0.2551 - val_loss: 3.8964 - val_accuracy: 0.1242 - val_top-5-accuracy: 0.3568 Epoch 2/40 352/352 [==============================] - 11s 32ms/step - loss: 3.7278 - accuracy: 0.1617 - top-5-accuracy: 0.4246 - val_loss: 3.6518 - val_accuracy: 0.1756 - val_top-5-accuracy: 0.4580 Epoch 3/40 352/352 [==============================] - 11s 32ms/step - loss: 3.5245 - accuracy: 0.2077 - top-5-accuracy: 0.4946 - val_loss: 3.4609 - val_accuracy: 0.2248 - val_top-5-accuracy: 0.5222 Epoch 4/40 352/352 [==============================] - 11s 32ms/step - loss: 3.3856 - accuracy: 0.2408 - top-5-accuracy: 0.5430 - val_loss: 3.3515 - val_accuracy: 0.2514 - val_top-5-accuracy: 0.5540 Epoch 5/40 352/352 [==============================] - 11s 32ms/step - loss: 3.2772 - accuracy: 0.2697 - top-5-accuracy: 0.5760 - val_loss: 3.3012 - val_accuracy: 0.2712 - val_top-5-accuracy: 0.5758 Epoch 6/40 352/352 [==============================] - 11s 32ms/step - loss: 3.1845 - accuracy: 0.2915 - top-5-accuracy: 0.6071 - val_loss: 3.2104 - val_accuracy: 0.2866 - val_top-5-accuracy: 0.5994 Epoch 7/40 352/352 [==============================] - 11s 32ms/step - loss: 3.1104 - accuracy: 0.3126 - top-5-accuracy: 0.6288 - val_loss: 3.1408 - val_accuracy: 0.3038 - val_top-5-accuracy: 0.6176 Epoch 8/40 352/352 [==============================] - 11s 32ms/step - loss: 3.0616 - accuracy: 0.3268 - top-5-accuracy: 0.6423 - val_loss: 3.0853 - val_accuracy: 0.3138 - val_top-5-accuracy: 0.6408 Epoch 9/40 352/352 [==============================] - 11s 31ms/step - loss: 3.0237 - accuracy: 0.3349 - top-5-accuracy: 0.6541 - val_loss: 3.0882 - val_accuracy: 0.3130 - val_top-5-accuracy: 0.6370 Epoch 10/40 352/352 [==============================] - 11s 31ms/step - loss: 2.9877 - accuracy: 0.3438 - top-5-accuracy: 0.6649 - val_loss: 3.0532 - val_accuracy: 0.3298 - val_top-5-accuracy: 0.6482 Epoch 11/40 352/352 [==============================] - 11s 31ms/step - loss: 2.9571 - accuracy: 0.3520 - top-5-accuracy: 0.6712 - val_loss: 3.0547 - val_accuracy: 0.3320 - val_top-5-accuracy: 0.6450 Epoch 12/40 352/352 [==============================] - 11s 31ms/step - loss: 2.9238 - accuracy: 0.3640 - top-5-accuracy: 0.6798 - val_loss: 2.9833 - val_accuracy: 0.3462 - val_top-5-accuracy: 0.6602 Epoch 13/40 352/352 [==============================] - 11s 31ms/step - loss: 2.9048 - accuracy: 0.3674 - top-5-accuracy: 0.6869 - val_loss: 2.9779 - val_accuracy: 0.3458 - val_top-5-accuracy: 0.6724 Epoch 14/40 352/352 [==============================] - 11s 31ms/step - loss: 2.8822 - accuracy: 0.3717 - top-5-accuracy: 0.6923 - val_loss: 2.9549 - val_accuracy: 0.3552 - val_top-5-accuracy: 0.6748 Epoch 15/40 352/352 [==============================] - 11s 31ms/step - loss: 2.8578 - accuracy: 0.3826 - top-5-accuracy: 0.6981 - val_loss: 2.9447 - val_accuracy: 0.3584 - val_top-5-accuracy: 0.6786 Epoch 16/40 352/352 [==============================] - 11s 31ms/step - loss: 2.8404 - accuracy: 0.3852 - top-5-accuracy: 0.7024 - val_loss: 2.9087 - val_accuracy: 0.3650 - val_top-5-accuracy: 0.6842 Epoch 17/40 352/352 [==============================] - 11s 31ms/step - loss: 2.8234 - accuracy: 0.3910 - top-5-accuracy: 0.7076 - val_loss: 2.8884 - val_accuracy: 0.3748 - val_top-5-accuracy: 0.6868 Epoch 18/40 352/352 [==============================] - 11s 31ms/step - loss: 2.8014 - accuracy: 0.3974 - top-5-accuracy: 0.7124 - val_loss: 2.8979 - val_accuracy: 0.3696 - val_top-5-accuracy: 0.6908 Epoch 19/40 352/352 [==============================] - 11s 31ms/step - loss: 2.7928 - accuracy: 0.3961 - top-5-accuracy: 0.7172 - val_loss: 2.8873 - val_accuracy: 0.3756 - val_top-5-accuracy: 0.6924 Epoch 20/40 352/352 [==============================] - 11s 31ms/step - loss: 2.7800 - accuracy: 0.4026 - top-5-accuracy: 0.7186 - val_loss: 2.8544 - val_accuracy: 0.3834 - val_top-5-accuracy: 0.7004 Epoch 21/40 352/352 [==============================] - 11s 31ms/step - loss: 2.7659 - accuracy: 0.4095 - top-5-accuracy: 0.7236 - val_loss: 2.8626 - val_accuracy: 0.3840 - val_top-5-accuracy: 0.6896 Epoch 22/40 352/352 [==============================] - 11s 31ms/step - loss: 2.7499 - accuracy: 0.4098 - top-5-accuracy: 0.7278 - val_loss: 2.8621 - val_accuracy: 0.3868 - val_top-5-accuracy: 0.6944 Epoch 23/40 352/352 [==============================] - 11s 31ms/step - loss: 2.7389 - accuracy: 0.4136 - top-5-accuracy: 0.7305 - val_loss: 2.8527 - val_accuracy: 0.3834 - val_top-5-accuracy: 0.7002 Epoch 24/40 352/352 [==============================] - 11s 31ms/step - loss: 2.7219 - accuracy: 0.4198 - top-5-accuracy: 0.7360 - val_loss: 2.9078 - val_accuracy: 0.3738 - val_top-5-accuracy: 0.6796 Epoch 25/40 352/352 [==============================] - 11s 32ms/step - loss: 2.7119 - accuracy: 0.4195 - top-5-accuracy: 0.7373 - val_loss: 2.8470 - val_accuracy: 0.3840 - val_top-5-accuracy: 0.6994 Epoch 26/40 352/352 [==============================] - 11s 32ms/step - loss: 2.7079 - accuracy: 0.4214 - top-5-accuracy: 0.7355 - val_loss: 2.8101 - val_accuracy: 0.3934 - val_top-5-accuracy: 0.7130 Epoch 27/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6925 - accuracy: 0.4280 - top-5-accuracy: 0.7398 - val_loss: 2.8660 - val_accuracy: 0.3804 - val_top-5-accuracy: 0.6996 Epoch 28/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6864 - accuracy: 0.4273 - top-5-accuracy: 0.7430 - val_loss: 2.7863 - val_accuracy: 0.4014 - val_top-5-accuracy: 0.7234 Epoch 29/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6763 - accuracy: 0.4324 - top-5-accuracy: 0.7472 - val_loss: 2.7852 - val_accuracy: 0.4030 - val_top-5-accuracy: 0.7158 Epoch 30/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6656 - accuracy: 0.4356 - top-5-accuracy: 0.7489 - val_loss: 2.7991 - val_accuracy: 0.3940 - val_top-5-accuracy: 0.7104 Epoch 31/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6589 - accuracy: 0.4383 - top-5-accuracy: 0.7512 - val_loss: 2.7938 - val_accuracy: 0.3966 - val_top-5-accuracy: 0.7148 Epoch 32/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6509 - accuracy: 0.4367 - top-5-accuracy: 0.7530 - val_loss: 2.8226 - val_accuracy: 0.3944 - val_top-5-accuracy: 0.7092 Epoch 33/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6384 - accuracy: 0.4432 - top-5-accuracy: 0.7565 - val_loss: 2.8171 - val_accuracy: 0.3964 - val_top-5-accuracy: 0.7060 Epoch 34/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6317 - accuracy: 0.4446 - top-5-accuracy: 0.7561 - val_loss: 2.7923 - val_accuracy: 0.3970 - val_top-5-accuracy: 0.7134 Epoch 35/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6241 - accuracy: 0.4447 - top-5-accuracy: 0.7574 - val_loss: 2.7664 - val_accuracy: 0.4108 - val_top-5-accuracy: 0.7180 Epoch 36/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6199 - accuracy: 0.4467 - top-5-accuracy: 0.7586 - val_loss: 2.7480 - val_accuracy: 0.4078 - val_top-5-accuracy: 0.7242 Epoch 37/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6127 - accuracy: 0.4506 - top-5-accuracy: 0.7608 - val_loss: 2.7651 - val_accuracy: 0.4052 - val_top-5-accuracy: 0.7218 Epoch 38/40 352/352 [==============================] - 11s 31ms/step - loss: 2.6025 - accuracy: 0.4520 - top-5-accuracy: 0.7620 - val_loss: 2.7641 - val_accuracy: 0.4114 - val_top-5-accuracy: 0.7254 Epoch 39/40 352/352 [==============================] - 11s 31ms/step - loss: 2.5934 - accuracy: 0.4542 - top-5-accuracy: 0.7670 - val_loss: 2.7453 - val_accuracy: 0.4120 - val_top-5-accuracy: 0.7200 Epoch 40/40 352/352 [==============================] - 11s 31ms/step - loss: 2.5859 - accuracy: 0.4565 - top-5-accuracy: 0.7688 - val_loss: 2.7504 - val_accuracy: 0.4118 - val_top-5-accuracy: 0.7268
(訳者注: 実験結果 – 150 epochs)
Epoch 1/150 352/352 [==============================] - 20s 38ms/step - loss: 4.1301 - accuracy: 0.0846 - top-5-accuracy: 0.2674 - val_loss: 3.8970 - val_accuracy: 0.1250 - val_top-5-accuracy: 0.3638 Epoch 2/150 352/352 [==============================] - 12s 34ms/step - loss: 3.7000 - accuracy: 0.1683 - top-5-accuracy: 0.4339 - val_loss: 3.6077 - val_accuracy: 0.1920 - val_top-5-accuracy: 0.4612 Epoch 3/150 352/352 [==============================] - 12s 34ms/step - loss: 3.5177 - accuracy: 0.2093 - top-5-accuracy: 0.4956 - val_loss: 3.4902 - val_accuracy: 0.2208 - val_top-5-accuracy: 0.5098 Epoch 4/150 352/352 [==============================] - 12s 34ms/step - loss: 3.3923 - accuracy: 0.2387 - top-5-accuracy: 0.5394 - val_loss: 3.3481 - val_accuracy: 0.2542 - val_top-5-accuracy: 0.5624 Epoch 5/150 352/352 [==============================] - 12s 34ms/step - loss: 3.2882 - accuracy: 0.2662 - top-5-accuracy: 0.5732 - val_loss: 3.2579 - val_accuracy: 0.2832 - val_top-5-accuracy: 0.5824 Epoch 6/150 352/352 [==============================] - 12s 34ms/step - loss: 3.2103 - accuracy: 0.2845 - top-5-accuracy: 0.5984 - val_loss: 3.2234 - val_accuracy: 0.2880 - val_top-5-accuracy: 0.5980 Epoch 7/150 352/352 [==============================] - 12s 34ms/step - loss: 3.1424 - accuracy: 0.3038 - top-5-accuracy: 0.6186 - val_loss: 3.1792 - val_accuracy: 0.2992 - val_top-5-accuracy: 0.6106 Epoch 8/150 352/352 [==============================] - 12s 34ms/step - loss: 3.0884 - accuracy: 0.3190 - top-5-accuracy: 0.6345 - val_loss: 3.1070 - val_accuracy: 0.3136 - val_top-5-accuracy: 0.6226 Epoch 9/150 352/352 [==============================] - 12s 34ms/step - loss: 3.0361 - accuracy: 0.3314 - top-5-accuracy: 0.6516 - val_loss: 3.0654 - val_accuracy: 0.3280 - val_top-5-accuracy: 0.6458 Epoch 10/150 352/352 [==============================] - 12s 34ms/step - loss: 2.9961 - accuracy: 0.3422 - top-5-accuracy: 0.6634 - val_loss: 3.0433 - val_accuracy: 0.3298 - val_top-5-accuracy: 0.6490 ... Epoch 141/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2936 - accuracy: 0.5506 - top-5-accuracy: 0.8343 - val_loss: 2.6108 - val_accuracy: 0.4584 - val_top-5-accuracy: 0.7632 Epoch 142/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2941 - accuracy: 0.5499 - top-5-accuracy: 0.8346 - val_loss: 2.6441 - val_accuracy: 0.4536 - val_top-5-accuracy: 0.7566 Epoch 143/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2891 - accuracy: 0.5514 - top-5-accuracy: 0.8372 - val_loss: 2.6164 - val_accuracy: 0.4566 - val_top-5-accuracy: 0.7568 Epoch 144/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2924 - accuracy: 0.5507 - top-5-accuracy: 0.8349 - val_loss: 2.6333 - val_accuracy: 0.4540 - val_top-5-accuracy: 0.7508 Epoch 145/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2867 - accuracy: 0.5504 - top-5-accuracy: 0.8354 - val_loss: 2.6772 - val_accuracy: 0.4434 - val_top-5-accuracy: 0.7514 Epoch 146/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2822 - accuracy: 0.5534 - top-5-accuracy: 0.8394 - val_loss: 2.6050 - val_accuracy: 0.4564 - val_top-5-accuracy: 0.7574 Epoch 147/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2825 - accuracy: 0.5556 - top-5-accuracy: 0.8374 - val_loss: 2.6184 - val_accuracy: 0.4558 - val_top-5-accuracy: 0.7592 Epoch 148/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2815 - accuracy: 0.5528 - top-5-accuracy: 0.8372 - val_loss: 2.6166 - val_accuracy: 0.4518 - val_top-5-accuracy: 0.7568 Epoch 149/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2794 - accuracy: 0.5560 - top-5-accuracy: 0.8375 - val_loss: 2.6359 - val_accuracy: 0.4502 - val_top-5-accuracy: 0.7566 Epoch 150/150 352/352 [==============================] - 12s 34ms/step - loss: 2.2864 - accuracy: 0.5521 - top-5-accuracy: 0.8368 - val_loss: 2.6472 - val_accuracy: 0.4466 - val_top-5-accuracy: 0.7552
Let’s visualize the training progress of the model.
plt.plot(history.history["loss"], label="train_loss")
plt.plot(history.history["val_loss"], label="val_loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Train and Validation Losses Over Epochs", fontsize=14)
plt.legend()
plt.grid()
plt.show()
Let’s display the final results of the training on CIFAR-100.
loss, accuracy, top_5_accuracy = model.evaluate(x_test, y_test)
print(f"Test loss: {round(loss, 2)}")
print(f"Test accuracy: {round(accuracy * 100, 2)}%")
print(f"Test top 5 accuracy: {round(top_5_accuracy * 100, 2)}%")
313/313 [==============================] - 3s 8ms/step - loss: 2.7039 - accuracy: 0.4288 - top-5-accuracy: 0.7366 Test loss: 2.7 Test accuracy: 42.88% Test top 5 accuracy: 73.66%
313/313 [==============================] - 5s 16ms/step - loss: 2.5971 - accuracy: 0.4655 - top-5-accuracy: 0.7600 Test loss: 2.6 Test accuracy: 46.55% Test top 5 accuracy: 76.0%
ちょうど訓練した Swin Transformer モデルは 152K パラメータを持つだけで、それは上のグラフで見られるように過剰適合の兆候なく 40 エポック内だけで ~75% テスト top-5 精度に達します。これは、このネットワークをより長く訓練して (多分もう少しの正則化で) より良いパフォーマンスさえ取得できることを意味します。このパフォーマンスはコサイン減衰学習率スケジュールや、他のデータ増強テクニックのような追加のテクニックで更に改良できます。While experimenting, I tried training the model for 150 epochs with a slightly higher dropout and greater embedding dimensions which pushes the performance to ~72% test accuracy on CIFAR-100 as you can see in the screenshot.
The authors present a top-1 accuracy of 87.3% on ImageNet. The authors also present a number of experiments to study how input sizes, optimizers etc. affect the final performance of this model. The authors further present using this model for object detection, semantic segmentation and instance segmentation as well and report competitive results for these. You are strongly advised to also check out the original paper.
This example takes inspiration from the official PyTorch and TensorFlow implementations.
以上