Keras 2 : examples : 生成深層学習 – AdaIN によるスタイル変換 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 07/23/2022 (keras 2.9.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Generative Deep Learning : Neural Style Transfer with AdaIN (Author: Aritra Roy Gosthipaty, Ritwik Raha)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : 生成深層学習 – AdaIN によるスタイル変換

Description : Adaptive インスタンス正規化によるニューラルスタイル変換。

イントロダクション

ニューラルスタイル変換は一つの画像のスタイルを別のコンテンツ上に転送する (= transfer) プロセスです。これは Gatys et al. による独創的な論文 “A Neural Algorithm of Artistic Style” で最初に紹介されました。このワークで提案されたテクニックの主要な制限はその実行時間にあります、アルゴリズムは遅い反復的な最適化プロセスを使用するからです。

バッチ正規化, インスタンス正規化と条件付きインスタンス正規化を導入したフォローアップ論文はスタイル変換が新しい方法で実行されることを可能にし、もはや遅い反復的プロセスを必要としません。

これらの論文に続いて、著者 Xun Huang と Serge Belongie は Adaptive インスタンス正規化 (AdaIN) を提案しました、これは任意のスタイル変換をリアルタイムで可能にします。

この例ではニューラルスタイル変換のための Adapative インスタンス正規化を実装します。下の図で 30 エポック だけ訓練した AdaIN モデルの出力を示します。

Style transfer sample gallery

この Hugging Face デモ (訳注: リンク切れ) で貴方自身の画像でモデルを試すこともできます。

セットアップ

必要なパッケージをインポートすることから始めます。再現性のためにシードも設定します。グローバル変数は好きなように変更可能なハイパーパラメータです。

import os
import glob
import imageio
import numpy as np
from tqdm import tqdm
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
from tensorflow.keras import layers

# Defining the global variables.
IMAGE_SIZE = (224, 224)
BATCH_SIZE = 64
# Training for single epoch for time constraint.
# Please use atleast 30 epochs to see good results.
EPOCHS = 1
AUTOTUNE = tf.data.AUTOTUNE

スタイル変換サンプル・ギャラリー

ニューラルスタイル変換のためにスタイル画像とコンテンツ画像が必要です。この例ではスタイル・データセットとして Best Artworks of All Time を、コンテンツ・データセットとして Pascal VOC を使用します。

これは、スタイルとして WIKI-Art を、そしてコンテンツ・データセットとして MSCOCO をそれぞれ使用している、著者らにより実装されたオリジナル論文からは逸脱しています。最小限でありながら再現可能なサンプルを作成するためにこれを行っています。

Kaggle からデータセットのダウンロード

Best Artworks of All Time データセットは Kaggle 上にホストされていて以下のこれらのステップによりそれを Colab で容易にダウンロードできます :

Kaggle API キーを (持っていない場合には) 取得するためにここの手順に従ってください。
Kaggle API キーをアップロードするには次のコマンドを使用します。
```
from google.colab import files
files.upload()
```

以下のコマンドを使用して API キーを適切なディレクトリに移動してデータセットをダウンロードします。

$ mkdir ~/.kaggle
$ cp kaggle.json ~/.kaggle/
$ chmod 600 ~/.kaggle/kaggle.json
$ kaggle datasets download ikarus777/best-artworks-of-all-time
$ unzip -qq best-artworks-of-all-time.zip
$ rm -rf images
$ mv resized artwork
$ rm best-artworks-of-all-time.zip artists.csv

tf.data パイプライン

このセクションでは、プロジェクトのために tf.data パイプラインを構築します。スタイルデータセットに対しては、フォルダから画像をデコードし、変換し、そしてリサイズします。コンテンツ画像については、tfds モジュールを使用していますので、既に tf.data により提示されています。

def decode_and_resize(image_path):
    """Decodes and resizes an image from the image file path.

    Args:
        image_path: The image file path.
        size: The size of the image to be resized to.

    Returns:
        A resized image.
    """
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.convert_image_dtype(image, dtype="float32")
    image = tf.image.resize(image, IMAGE_SIZE)
    return image


def extract_image_from_voc(element):
    """Extracts image from the PascalVOC dataset.

    Args:
        element: A dictionary of data.
        size: The size of the image to be resized to.

    Returns:
        A resized image.
    """
    image = element["image"]
    image = tf.image.convert_image_dtype(image, dtype="float32")
    image = tf.image.resize(image, IMAGE_SIZE)
    return image


# Get the image file paths for the style images.
style_images = os.listdir("/content/artwork/resized")
style_images = [os.path.join("/content/artwork/resized", path) for path in style_images]

# split the style images in train, val and test
total_style_images = len(style_images)
train_style = style_images[: int(0.8 * total_style_images)]
val_style = style_images[int(0.8 * total_style_images) : int(0.9 * total_style_images)]
test_style = style_images[int(0.9 * total_style_images) :]

# Build the style and content tf.data datasets.
train_style_ds = (
    tf.data.Dataset.from_tensor_slices(train_style)
    .map(decode_and_resize, num_parallel_calls=AUTOTUNE)
    .repeat()
)
train_content_ds = tfds.load("voc", split="train").map(extract_image_from_voc).repeat()

val_style_ds = (
    tf.data.Dataset.from_tensor_slices(val_style)
    .map(decode_and_resize, num_parallel_calls=AUTOTUNE)
    .repeat()
)
val_content_ds = (
    tfds.load("voc", split="validation").map(extract_image_from_voc).repeat()
)

test_style_ds = (
    tf.data.Dataset.from_tensor_slices(test_style)
    .map(decode_and_resize, num_parallel_calls=AUTOTUNE)
    .repeat()
)
test_content_ds = (
    tfds.load("voc", split="test")
    .map(extract_image_from_voc, num_parallel_calls=AUTOTUNE)
    .repeat()
)

# Zipping the style and content datasets.
train_ds = (
    tf.data.Dataset.zip((train_style_ds, train_content_ds))
    .shuffle(BATCH_SIZE * 2)
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)

val_ds = (
    tf.data.Dataset.zip((val_style_ds, val_content_ds))
    .shuffle(BATCH_SIZE * 2)
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)

test_ds = (
    tf.data.Dataset.zip((test_style_ds, test_content_ds))
    .shuffle(BATCH_SIZE * 2)
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)

[1mDownloading and preparing dataset voc/2007/4.0.0 (download: 868.85 MiB, generated: Unknown size, total: 868.85 MiB) to /root/tensorflow_datasets/voc/2007/4.0.0...[0m

Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/voc/2007/4.0.0.incompleteP16YU5/voc-test.tfrecord

  0%|          | 0/4952 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/voc/2007/4.0.0.incompleteP16YU5/voc-train.tfrecord

  0%|          | 0/2501 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/voc/2007/4.0.0.incompleteP16YU5/voc-validation.tfrecord

  0%|          | 0/2510 [00:00<?, ? examples/s]

[1mDataset voc downloaded and prepared to /root/tensorflow_datasets/voc/2007/4.0.0. Subsequent calls will reuse this data.[0m

データの可視化

訓練の前にデータを可視化することは常にベターです。前処理パイプラインの正しさを確認するために、データセットから 10 サンプルを可視化します。

style, content = next(iter(train_ds))
fig, axes = plt.subplots(nrows=10, ncols=2, figsize=(5, 30))
[ax.axis("off") for ax in np.ravel(axes)]

for (axis, style_image, content_image) in zip(axes, style[0:10], content[0:10]):
    (ax_style, ax_content) = axis
    ax_style.imshow(style_image)
    ax_style.set_title("Style Image")

    ax_content.imshow(content_image)
    ax_content.set_title("Content Image")

アーキテクチャ

スタイル変換ネットワークは入力としてコンテンツ画像とスタイル画像を取り、スタイル変換された画像を出力します。AdaIN の著者らはこれを実現するために単純なエンコーダデコーダ構造を提案しています。

コンテンツ画像 (C) とスタイル画像 (S) は両者ともエンコーダネットワークに供給されます。次にこれらのエンコーダネットワークからの出力は AdaIN 層に供給されます。AdaIN 層は合成された特徴マップを計算します。そしてこの特徴マップはランダムに初期化されたデコーダネットワークに供給されます、これはニューラルスタイル変換された画像のための generator として機能します。

AdaIn equation

スタイル特徴マップ (fs) とコンテンツ特徴マップ (fc) は AdaIN 層に供給されます。この層は合成特徴マップ t を生成します。関数 g はデコーダ (generator) ネットワークを表します。

エンコーダ

エンコーダは (imagenet で事前訓練された) 事前訓練済み VGG19 モデルの一部です。block4-conv1 層からモデルをスライスしています。出力層は論文で著者らにより提案されたものです。

def get_encoder():
    vgg19 = keras.applications.VGG19(
        include_top=False,
        weights="imagenet",
        input_shape=(*IMAGE_SIZE, 3),
    )
    vgg19.trainable = False
    mini_vgg19 = keras.Model(vgg19.input, vgg19.get_layer("block4_conv1").output)

    inputs = layers.Input([*IMAGE_SIZE, 3])
    mini_vgg19_out = mini_vgg19(inputs)
    return keras.Model(inputs, mini_vgg19_out, name="mini_vgg19")

Adaptive インスタンス正規化

AdaIN 層はコンテンツとスタイル画像の特徴を取り込みます。層は次の式で定義できます :

AdaIn formula

ここで sigma は標準偏差で mu は当該変数の平均です。上の式でコンテンツ特徴マップ fc の平均と分散はスタイル特徴マップ fs の平均と分散で調整されます。

著者らにより提案された AdaIN 層は平均と分散以外の他のパラメータを使用していないことに注意することは重要です。層はまたどのような訓練可能なパラメータも持ちません。これが Keras 層を使用する代わりに Python 関数を使用する理由です。関数はスタイルとコンテンツ特徴マップを受け取り、画像の平均と標準偏差を計算して adaptive インスタンス正規化された特徴マップを返します。

def get_mean_std(x, epsilon=1e-5):
    axes = [1, 2]

    # Compute the mean and standard deviation of a tensor.
    mean, variance = tf.nn.moments(x, axes=axes, keepdims=True)
    standard_deviation = tf.sqrt(variance + epsilon)
    return mean, standard_deviation


def ada_in(style, content):
    """Computes the AdaIn feature map.

    Args:
        style: The style feature map.
        content: The content feature map.

    Returns:
        The AdaIN feature map.
    """
    content_mean, content_std = get_mean_std(content)
    style_mean, style_std = get_mean_std(style)
    t = style_std * (content - content_mean) / content_std + style_mean
    return t

デコーダ

著者らは、デコーダネットワークはエンコーダネットワークをミラーリングしなければならないことを明確に説明しています。デコーダを構築するためにエンコーダを対称的に反転しました。特徴マップの空間的解像度をあげるために UpSampling2D 層を使用しました。

著者らはデコーダネットワークで任意の正規化層を使用することに対して警告していることに注意してください、そして実際に、バッチ正規化やインスタンス正規化を含めることがネットワーク全体のパフォーマンスを害することを示しています。

これはアーキテクチャ全体で訓練可能な部分です。

def get_decoder():
    config = {"kernel_size": 3, "strides": 1, "padding": "same", "activation": "relu"}
    decoder = keras.Sequential(
        [
            layers.InputLayer((None, None, 512)),
            layers.Conv2D(filters=512, **config),
            layers.UpSampling2D(),
            layers.Conv2D(filters=256, **config),
            layers.Conv2D(filters=256, **config),
            layers.Conv2D(filters=256, **config),
            layers.Conv2D(filters=256, **config),
            layers.UpSampling2D(),
            layers.Conv2D(filters=128, **config),
            layers.Conv2D(filters=128, **config),
            layers.UpSampling2D(),
            layers.Conv2D(filters=64, **config),
            layers.Conv2D(
                filters=3,
                kernel_size=3,
                strides=1,
                padding="same",
                activation="sigmoid",
            ),
        ]
    )
    return decoder

損失関数

ここではニューラルスタイル変換モデルのための損失関数を構築します。著者らはネットワークの損失関数を計算するために事前訓練済み VGG-19 を使用することを提案しています。これはデコーダネットワークを訓練するためだけに使用されることに留意することは重要です。全損失 (Lt) はコンテンツ損失 (Lc) とスタイル損失 (Ls) の重み付けられた組み合わせです。ラムダ項は変換されるスタイルの総量を変化させるために使用されます。

The total loss

コンテンツ損失

これは、コンテンツ画像特徴とニューラルスタイル変換画像の特徴の間のユークリッド距離です。

The content loss

ここで著者らは、元の画像の特徴をターゲットとして使用するのではなく、コンテンツターゲットとして AdaIn 層 t からの出力を使用することを提案しています。これは収束をスピードアップするために行われます。

スタイル損失

より一般に使用されるグラム行列を使用するのではなく、著者らは概念的により明確にする統計的な特徴 (平均と分散) の間の差異を計算することを提案しています。これは次の式により容易に可視化できます :

The style loss

ここで theta は損失を計算するために使用される VGG-19 内の層を示します。この場合これは以下に対応します :

block1_conv1
block1_conv2
block1_conv3
block1_conv4

def get_loss_net():
    vgg19 = keras.applications.VGG19(
        include_top=False, weights="imagenet", input_shape=(*IMAGE_SIZE, 3)
    )
    vgg19.trainable = False
    layer_names = ["block1_conv1", "block2_conv1", "block3_conv1", "block4_conv1"]
    outputs = [vgg19.get_layer(name).output for name in layer_names]
    mini_vgg19 = keras.Model(vgg19.input, outputs)

    inputs = layers.Input([*IMAGE_SIZE, 3])
    mini_vgg19_out = mini_vgg19(inputs)
    return keras.Model(inputs, mini_vgg19_out, name="loss_net")

ニューラルスタイル変換

これは trainer モジュールです。エンコーダとデコーダを tf.keras.Model サブクラス内にラップします。これは model.fit() ループ内で発生するものをカスタマイズすることを可能にします。

class NeuralStyleTransfer(tf.keras.Model):
    def __init__(self, encoder, decoder, loss_net, style_weight, **kwargs):
        super().__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.loss_net = loss_net
        self.style_weight = style_weight

    def compile(self, optimizer, loss_fn):
        super().compile()
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.style_loss_tracker = keras.metrics.Mean(name="style_loss")
        self.content_loss_tracker = keras.metrics.Mean(name="content_loss")
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")

    def train_step(self, inputs):
        style, content = inputs

        # Initialize the content and style loss.
        loss_content = 0.0
        loss_style = 0.0

        with tf.GradientTape() as tape:
            # Encode the style and content image.
            style_encoded = self.encoder(style)
            content_encoded = self.encoder(content)

            # Compute the AdaIN target feature maps.
            t = ada_in(style=style_encoded, content=content_encoded)

            # Generate the neural style transferred image.
            reconstructed_image = self.decoder(t)

            # Compute the losses.
            reconstructed_vgg_features = self.loss_net(reconstructed_image)
            style_vgg_features = self.loss_net(style)
            loss_content = self.loss_fn(t, reconstructed_vgg_features[-1])
            for inp, out in zip(style_vgg_features, reconstructed_vgg_features):
                mean_inp, std_inp = get_mean_std(inp)
                mean_out, std_out = get_mean_std(out)
                loss_style += self.loss_fn(mean_inp, mean_out) + self.loss_fn(
                    std_inp, std_out
                )
            loss_style = self.style_weight * loss_style
            total_loss = loss_content + loss_style

        # Compute gradients and optimize the decoder.
        trainable_vars = self.decoder.trainable_variables
        gradients = tape.gradient(total_loss, trainable_vars)
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update the trackers.
        self.style_loss_tracker.update_state(loss_style)
        self.content_loss_tracker.update_state(loss_content)
        self.total_loss_tracker.update_state(total_loss)
        return {
            "style_loss": self.style_loss_tracker.result(),
            "content_loss": self.content_loss_tracker.result(),
            "total_loss": self.total_loss_tracker.result(),
        }

    def test_step(self, inputs):
        style, content = inputs

        # Initialize the content and style loss.
        loss_content = 0.0
        loss_style = 0.0

        # Encode the style and content image.
        style_encoded = self.encoder(style)
        content_encoded = self.encoder(content)

        # Compute the AdaIN target feature maps.
        t = ada_in(style=style_encoded, content=content_encoded)

        # Generate the neural style transferred image.
        reconstructed_image = self.decoder(t)

        # Compute the losses.
        recons_vgg_features = self.loss_net(reconstructed_image)
        style_vgg_features = self.loss_net(style)
        loss_content = self.loss_fn(t, recons_vgg_features[-1])
        for inp, out in zip(style_vgg_features, recons_vgg_features):
            mean_inp, std_inp = get_mean_std(inp)
            mean_out, std_out = get_mean_std(out)
            loss_style += self.loss_fn(mean_inp, mean_out) + self.loss_fn(
                std_inp, std_out
            )
        loss_style = self.style_weight * loss_style
        total_loss = loss_content + loss_style

        # Update the trackers.
        self.style_loss_tracker.update_state(loss_style)
        self.content_loss_tracker.update_state(loss_content)
        self.total_loss_tracker.update_state(total_loss)
        return {
            "style_loss": self.style_loss_tracker.result(),
            "content_loss": self.content_loss_tracker.result(),
            "total_loss": self.total_loss_tracker.result(),
        }

    @property
    def metrics(self):
        return [
            self.style_loss_tracker,
            self.content_loss_tracker,
            self.total_loss_tracker,
        ]

訓練モニタ・コールバック

このコールバックは各エポックの最後にモデルのスタイル変換出力を可視化するために使用されます。スタイル変換の目的は正しく定量化できないので、視聴者によって主観的に評価されるべきです。この理由で、可視化はモデル評価の主要な局面です。

test_style, test_content = next(iter(test_ds))


class TrainMonitor(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        # Encode the style and content image.
        test_style_encoded = self.model.encoder(test_style)
        test_content_encoded = self.model.encoder(test_content)

        # Compute the AdaIN features.
        test_t = ada_in(style=test_style_encoded, content=test_content_encoded)
        test_reconstructed_image = self.model.decoder(test_t)

        # Plot the Style, Content and the NST image.
        fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 5))
        ax[0].imshow(tf.keras.preprocessing.image.array_to_img(test_style[0]))
        ax[0].set_title(f"Style: {epoch:03d}")

        ax[1].imshow(tf.keras.preprocessing.image.array_to_img(test_content[0]))
        ax[1].set_title(f"Content: {epoch:03d}")

        ax[2].imshow(
            tf.keras.preprocessing.image.array_to_img(test_reconstructed_image[0])
        )
        ax[2].set_title(f"NST: {epoch:03d}")

        plt.show()
        plt.close()

モデルの訓練

このセクションでは、optimizer, 損失関数と trainer モジュールを定義します。optimizer と損失関数で trainer モジュールをコンパイルしてからそれを訓練します。

Note : 時間的な制約のために単一エポックだけモデルを訓練しますが、良い結果を見るには少なくとも 30 エポックの間訓練する必要があります。

optimizer = keras.optimizers.Adam(learning_rate=1e-5)
loss_fn = keras.losses.MeanSquaredError()

encoder = get_encoder()
loss_net = get_loss_net()
decoder = get_decoder()

model = NeuralStyleTransfer(
    encoder=encoder, decoder=decoder, loss_net=loss_net, style_weight=4.0
)

model.compile(optimizer=optimizer, loss_fn=loss_fn)

history = model.fit(
    train_ds,
    epochs=EPOCHS,
    steps_per_epoch=50,
    validation_data=val_ds,
    validation_steps=50,
    callbacks=[TrainMonitor()],
)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg19/vgg19_weights_tf_dim_ordering_tf_kernels_notop.h5
80142336/80134624 [==============================] - 1s 0us/step
80150528/80134624 [==============================] - 1s 0us/step
50/50 [==============================] - ETA: 0s - style_loss: 213.1439 - content_loss: 141.1564 - total_loss: 354.3002

50/50 [==============================] - 124s 2s/step - style_loss: 213.1439 - content_loss: 141.1564 - total_loss: 354.3002 - val_style_loss: 167.0819 - val_content_loss: 129.0497 - val_total_loss: 296.1316

推論

モデルを訓練した後では、それで推論を実行する必要があります。テストデータセットから任意のコンテンツとスタイル画像を渡して出力画像を見てみます。

NOTE : 貴方の画像でモデルを試すには、この Hugging Face デモ (訳注: リンク切れ) を利用できます。

for style, content in test_ds.take(1):
    style_encoded = model.encoder(style)
    content_encoded = model.encoder(content)
    t = ada_in(style=style_encoded, content=content_encoded)
    reconstructed_image = model.decoder(t)
    fig, axes = plt.subplots(nrows=10, ncols=3, figsize=(10, 30))
    [ax.axis("off") for ax in np.ravel(axes)]

    for axis, style_image, content_image, reconstructed_image in zip(
        axes, style[0:10], content[0:10], reconstructed_image[0:10]
    ):
        (ax_style, ax_content, ax_reconstructed) = axis
        ax_style.imshow(style_image)
        ax_style.set_title("Style Image")
        ax_content.imshow(content_image)
        ax_content.set_title("Content Image")
        ax_reconstructed.imshow(reconstructed_image)
        ax_reconstructed.set_title("NST Image")

結論

Adaptive インスタンス正規化はリアルタイムでの任意のスタイル変換を可能にします。著者らの新規の提案は、スタイルとコンテンツ画像の統計的特徴 (平均と標準偏差) をアラインするだけによりこれを実現していることに注意することも重要です。

Note : AdaIN はまた Style-GAN のベースとしても機能します。

Reference

TF 実装

以上

2022年7月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31