Keras 2 : examples : 生成深層学習 – ノイズ除去拡散暗黙モデル (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
更新日時 : 12/19/2022 (keras 2.11.0)
作成日時 : 09/04/2022 (keras 2.9.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Generative Deep Learning : Denoising Diffusion Implicit Models (Author: András Béres : Created : 2022/06/24)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : 生成深層学習 – ノイズ除去拡散暗黙モデル

Description : ノイズ除去拡散暗黙モデルで花の画像を生成します。

イントロダクション

拡散モデルとは何か？

近年、スコアベース生成モデルを含む、ノイズ除去拡散モデルは生成モデルの強力なクラスとして人気を得ました、これは画像合成の品質において敵対的生成モデル (GAN) にさえも匹敵します。それらは訓練が安定的でありスケールが容易である一方で、より多様なサンプルを生成する傾向にあります。DALL-E 2 と Imagen のような最近の大規模な拡散モデルは驚くほどのテキスト-to-画像生成能力を示しました。けれどもそれらの欠点の一つは、サンプリングが遅いことです、何故ならば画像を生成するために複数のフォワードパスを必要とするからです。

拡散は構造化シグナル (画像) をノイズに段階的に変化させるプロセスを指します。拡散をシミュレートすることにより、訓練画像からノイズの多い画像を生成できて、それらをノイズ除去しようとするニューラルネットワークを訓練することができます。訓練済みネットワークを使用して拡散の反対である、逆拡散をシミュレートできます、これはノイズから画像が現れる過程です。

1 行要約 : 拡散モデルはノイズの多い画像をノイズ除去するために訓練され、そして純粋なノイズを反復的にノイズ除去することにより画像を生成できます。

このサンプルの目的

このコード・サンプルは、適度の計算要件と妥当なパフォーマンスを持つ、拡散モデルの最小限で (生成品質メトリックを備えた) 完全な機能を持つ実装を意図しています。私の実装の選択とハイパーパラメータ調整はこれらの目標を念頭に置いて成されました。

現在、拡散モデルの文献は複数の理論的枠組み (スコアマッチング, 微分方程式, マルコフ連鎖) を持ち、時には矛盾する記法 (Appendix C.2 参照) さえ伴い数学的に非常に複雑ですから、それらを理解しようとすることに怖気付くかもしれません。このサンプルにおけるこれらのモデルについての私の見解では、モデルはノイズのある画像を画像とガウスノイズの成分に分離することを学習します。

このサンプルでは、私は総ての長い数式を消化可能なピースに分解する努力をして総ての変数に説明的な名前を与えました。また、このコードサンプルが実践者が拡散モデルについて学習する良い開始点となることを願い、関心ある読者がトピックに深入りする助けとなるように関連文献への多くのリンクも含めました。

以下のセクションでは、決定論的サンプリングでノイズ除去拡散暗黙モデル (DDIM) の連続時間バージョンを実装します。

セットアップ

import math
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds

from tensorflow import keras
from keras import layers

ハイパーパラメータ

# data
dataset_name = "oxford_flowers102"
dataset_repetitions = 5
num_epochs = 1  # train for at least 50 epochs for good results
image_size = 64
# KID = Kernel Inception Distance, see related section
kid_image_size = 75
kid_diffusion_steps = 5
plot_diffusion_steps = 20

# sampling
min_signal_rate = 0.02
max_signal_rate = 0.95

# architecture
embedding_dims = 32
embedding_max_frequency = 1000.0
widths = [32, 64, 96, 128]
block_depth = 2

# optimization
batch_size = 64
ema = 0.999
learning_rate = 1e-3
weight_decay = 1e-4

データ・パイプライン

花の画像を生成するために Oxford Flowers 102 データセットを使用します、これは約 8,000 画像を含む多様で自然なデータセットです。残念なことに公式の分割は画像の殆どがテスト分割に含まれていて不均衡です。Tensorflow Datasets slicing API を使用して新しい分割 (80% 訓練, 20% 検証) を作成します。前処理として中心クロップを適用し、データセットを複数回 repeat します (理由は次のセクションで説明します)。

def preprocess_image(data):
    # center crop image
    height = tf.shape(data["image"])[0]
    width = tf.shape(data["image"])[1]
    crop_size = tf.minimum(height, width)
    image = tf.image.crop_to_bounding_box(
        data["image"],
        (height - crop_size) // 2,
        (width - crop_size) // 2,
        crop_size,
        crop_size,
    )

    # resize and clip
    # for image downsampling it is important to turn on antialiasing
    image = tf.image.resize(image, size=[image_size, image_size], antialias=True)
    return tf.clip_by_value(image / 255.0, 0.0, 1.0)


def prepare_dataset(split):
    # the validation dataset is shuffled as well, because data order matters
    # for the KID estimation
    return (
        tfds.load(dataset_name, split=split, shuffle_files=True)
        .map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)
        .cache()
        .repeat(dataset_repetitions)
        .shuffle(10 * batch_size)
        .batch(batch_size, drop_remainder=True)
        .prefetch(buffer_size=tf.data.AUTOTUNE)
    )


# load dataset
train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]")
val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]")

Kernel inception 距離

Kernel Inception 距離 (KID) は画像品質のメトリックで、ポピュラーな Frechet Inception 距離 (FID) に対する置き換えとして提案されました。私は FID よりも KID を好みます、何故ならば実装がより単純で、バッチ毎に推定できて、計算量的に軽いからです。詳細はこちら。

このサンプルでは、画像は Inception ネットワークの可能な最小の解像度 (75×75 instead of 299×299) で評価され、メトリックは計算効率のために検証セット上でのみ測定されます。また同じ理由で評価時のサンプリングステップ数を 5 に制限しています。

データセットは比較的小さいので、訓練と検証の分割をエポック毎に複数回行います、何故ならば KID 推定はノイズが多く計算集約的ですので、多くの反復の後だけ、しかし多くの反復に対して評価したいからです。

class KID(keras.metrics.Metric):
    def __init__(self, name, **kwargs):
        super().__init__(name=name, **kwargs)

        # KID is estimated per batch and is averaged across batches
        self.kid_tracker = keras.metrics.Mean(name="kid_tracker")

        # a pretrained InceptionV3 is used without its classification layer
        # transform the pixel values to the 0-255 range, then use the same
        # preprocessing as during pretraining
        self.encoder = keras.Sequential(
            [
                keras.Input(shape=(image_size, image_size, 3)),
                layers.Rescaling(255.0),
                layers.Resizing(height=kid_image_size, width=kid_image_size),
                layers.Lambda(keras.applications.inception_v3.preprocess_input),
                keras.applications.InceptionV3(
                    include_top=False,
                    input_shape=(kid_image_size, kid_image_size, 3),
                    weights="imagenet",
                ),
                layers.GlobalAveragePooling2D(),
            ],
            name="inception_encoder",
        )

    def polynomial_kernel(self, features_1, features_2):
        feature_dimensions = tf.cast(tf.shape(features_1)[1], dtype=tf.float32)
        return (features_1 @ tf.transpose(features_2) / feature_dimensions + 1.0) ** 3.0

    def update_state(self, real_images, generated_images, sample_weight=None):
        real_features = self.encoder(real_images, training=False)
        generated_features = self.encoder(generated_images, training=False)

        # compute polynomial kernels using the two sets of features
        kernel_real = self.polynomial_kernel(real_features, real_features)
        kernel_generated = self.polynomial_kernel(
            generated_features, generated_features
        )
        kernel_cross = self.polynomial_kernel(real_features, generated_features)

        # estimate the squared maximum mean discrepancy using the average kernel values
        batch_size = tf.shape(real_features)[0]
        batch_size_f = tf.cast(batch_size, dtype=tf.float32)
        mean_kernel_real = tf.reduce_sum(kernel_real * (1.0 - tf.eye(batch_size))) / (
            batch_size_f * (batch_size_f - 1.0)
        )
        mean_kernel_generated = tf.reduce_sum(
            kernel_generated * (1.0 - tf.eye(batch_size))
        ) / (batch_size_f * (batch_size_f - 1.0))
        mean_kernel_cross = tf.reduce_mean(kernel_cross)
        kid = mean_kernel_real + mean_kernel_generated - 2.0 * mean_kernel_cross

        # update the average KID estimate
        self.kid_tracker.update_state(kid)

    def result(self):
        return self.kid_tracker.result()

    def reset_state(self):
        self.kid_tracker.reset_state()

ネットワーク・アーキテクチャ

ここでノイズ除去のために使用するニューラルネットワークのアーキテクチャを指定します。同一の入力と出力次元を持つ U-Net を構築します。U-Net はポピュラーなセマンティックセグメンテーション・アーキテクチャで、その主要なアイデアは入力画像を徐々にダウンサンプリングしてからアップサンプリングし、そして同じ解像度を持つ層間でスキップ接続を追加することです。これらは通常のオートエンコーダとは異なり、勾配フローに役立ち表現のボトルネックを招くことを回避します。これに基づいて、拡散モデルを (ボトルネックのない) ノイズ除去オートエンコーダとして見なすことができます。

ネットワークは 2 つの入力を持ちます、ノイズのある画像とそれらのノイズ成分の分散です。後者は、シグナルのノイズ除去はノイズの異なるレベルでは異なる演算を必要とするからです。transformer と NeRF の両方で使用される位置エンコーディングと同様に、正弦関数の埋め込みを使用してノイズ分散を変換します。これはネットワークがノイズレベルに対して高い感度を持つのに役立ち、これは良いパフォーマンスのために重要です。私たちは Lambda 層を使用して正弦関数埋め込みを実装します。

幾つかの他の考慮すべき点は :

Keras 関数型 API を使用してネットワークを構築し、一貫したスタイルで層のブロックを構築するためにクロージャを使用します。
拡散モデルはノイズ分散の代わりに拡散過程の時間ステップのインデックスを埋め込みます、一方でスコアベースモデル (表1) は通常はノイズレベルの何らかの関数を使用します。ネットワークを再訓練することなく、推論時にサンプリングスケジュールを変更できるように、私は後者を好みます。
拡散モデルは埋め込みを各畳み込みブロックに個別に入力します。私たちは単純化のためにネットワークの最初だけそれを入力しますが、これは私の経験ではパフォーマンスを劣化させることは殆どありません、何故ならばスキップと残差接続が情報がネットワークを正しく伝播することを手助けするためです。
文献では、より良い大域的な一貫性のために低解像度でアテンション層を使用することが一般的です。単純化のためにそれを省略しました。
バッチ正規化層の学習可能な center と scale パラメータは無効にしています、続く畳み込み層がそれらを冗長にするからです。
良い実践として最後の畳み込みカーネルを総てゼロに初期化し、初期化後にネットワークはゼロだけを予測するようにします、これはそのターゲットの平均です。これは訓練開始時の動作を改良し、平均二乗誤差損失を正確に 1 で開始させます。

def sinusoidal_embedding(x):
    embedding_min_frequency = 1.0
    frequencies = tf.exp(
        tf.linspace(
            tf.math.log(embedding_min_frequency),
            tf.math.log(embedding_max_frequency),
            embedding_dims // 2,
        )
    )
    angular_speeds = 2.0 * math.pi * frequencies
    embeddings = tf.concat(
        [tf.sin(angular_speeds * x), tf.cos(angular_speeds * x)], axis=3
    )
    return embeddings


def ResidualBlock(width):
    def apply(x):
        input_width = x.shape[3]
        if input_width == width:
            residual = x
        else:
            residual = layers.Conv2D(width, kernel_size=1)(x)
        x = layers.BatchNormalization(center=False, scale=False)(x)
        x = layers.Conv2D(
            width, kernel_size=3, padding="same", activation=keras.activations.swish
        )(x)
        x = layers.Conv2D(width, kernel_size=3, padding="same")(x)
        x = layers.Add()([x, residual])
        return x

    return apply


def DownBlock(width, block_depth):
    def apply(x):
        x, skips = x
        for _ in range(block_depth):
            x = ResidualBlock(width)(x)
            skips.append(x)
        x = layers.AveragePooling2D(pool_size=2)(x)
        return x

    return apply


def UpBlock(width, block_depth):
    def apply(x):
        x, skips = x
        x = layers.UpSampling2D(size=2, interpolation="bilinear")(x)
        for _ in range(block_depth):
            x = layers.Concatenate()([x, skips.pop()])
            x = ResidualBlock(width)(x)
        return x

    return apply


def get_network(image_size, widths, block_depth):
    noisy_images = keras.Input(shape=(image_size, image_size, 3))
    noise_variances = keras.Input(shape=(1, 1, 1))

    e = layers.Lambda(sinusoidal_embedding)(noise_variances)
    e = layers.UpSampling2D(size=image_size, interpolation="nearest")(e)

    x = layers.Conv2D(widths[0], kernel_size=1)(noisy_images)
    x = layers.Concatenate()([x, e])

    skips = []
    for width in widths[:-1]:
        x = DownBlock(width, block_depth)([x, skips])

    for _ in range(block_depth):
        x = ResidualBlock(widths[-1])(x)

    for width in reversed(widths[:-1]):
        x = UpBlock(width, block_depth)([x, skips])

    x = layers.Conv2D(3, kernel_size=1, kernel_initializer="zeros")(x)

    return keras.Model([noisy_images, noise_variances], x, name="residual_unet")

これは関数型 API のパワーを示しています。スキップ接続, 残差ブロック, マルチ入力, そして正弦関数型埋め込みを伴う比較的複雑な U-Net を 80 行のコードでどのように構築したかに注意してください！

拡散モデル

拡散スケジュール

拡散過程が time = 0 で始まり、time = 1 に終了すると仮定します。この変数は拡散時間と呼ばれ、離散的 (拡散モデルで一般的) か連続的 (スコアベースモデルで一般的) のいずれでも良いです。私は後者を選択しますので、サンプリングステップの数は推論時に変更できます。

拡散過程の各ポイントで、実際の拡散時間に対応するノイズ画像のノイズレベルとシグナルレベルを知らせる関数を持つ必要があります。これは拡散スケジュールと呼ばれます (diffusion_schedule() 参照)。

このスケジュールは 2 つの量を出力します : noise_rate と signal_rate です (DDIM 論文ではそれぞれ sqrt(1 – alpha) と sqrt(alpha) に対応しています)。ランダムノイズと訓練画像を対応するレートで重み付けをしてからそれらを加算することで、ノイズのある画像を生成します。

(標準正規) ランダムノイズと (正規化された) 画像の両方はゼロ平均と単位分散を持ちますので、ノイズレートとシグナルレートはノイズのある画像の成分の標準偏差として解釈できる一方で、それらのレートの二乗は分散 (あるいはシグナル処理という意味で累乗) として解釈できます。レートは常にそれらの二乗和が 1 になるように設定されます、つまりノイズのある画像はスケールされていない成分のように常に単位分散を持ちます。

単純化された、連続的なバージョンのコサインスケジュール (セクション 3.2) を使用します、これは文献で非常に一般的に使用されます。このスケジュールは対称的で、拡散過程の開始と終了に向けて遅くなり、そしてまた単位円の三角関数の性質を使用した良い幾何学的な解釈を持ちます :

訓練プロセス

ノイズ除去拡散モデルの訓練手続き (train_step() と denoise() 参照) は以下になります : ランダムな拡散時間を一様にサンプリングして、拡散時間に対応する比率で訓練画像にランダムなガウスノイズを混在させます。それからノイズのある画像を 2 つの成分に分離するようにモデルを訓練します。

通常は、ニューラルネットワークはスケールされていないノイズ成分を予測するために訓練されます、そこからシグナルとノイズの比率を使用して予測される画像成分を計算できます。理論的にはピクセル単位 (= pixelwise) の平均二乗誤差が使用されるべきですが、(この実装と同様に) 代わりに平均絶対誤差の使用を勧めます、これはこのデータセットではより良い結果を生成します。

サンプリング (逆拡散)

サンプリングするとき (reverse_diffusion() 参照)、各ステップでノイズ画像の前の推定値を取り、ネットワークを使用してそれを画像とノイズに分離します。そして次のステップのシグナルとノイズの比率を使用してこれらの成分を再連結します。

同様の見解が DDIM の式 12 で示されますが、サンプリング式の上記の説明は広くは知られてないと思います。

このサンプルは DDIM からの決定論的なサンプリング手続きだけを実装しています、これは論文の eta = 0 に対応しています。確率的サンプリングもまた使用できます (その場合にはモデルはノイズ除去拡散確率モデル (DDPM) になります)、そこでは予測されたノイズの一部は同じかより大きい量のランダムノイズで置き換えられます (式 16 とその下を参照)。

(両方のモデルは同じ方法で訓練されますので) 確率的サンプリングはネットワークの再訓練なしに使用できます、そしてそれはサンプル品質を改良できますが、他方で通常はより多くのサンプリングステップを必要とします。

class DiffusionModel(keras.Model):
    def __init__(self, image_size, widths, block_depth):
        super().__init__()

        self.normalizer = layers.Normalization()
        self.network = get_network(image_size, widths, block_depth)
        self.ema_network = keras.models.clone_model(self.network)

    def compile(self, **kwargs):
        super().compile(**kwargs)

        self.noise_loss_tracker = keras.metrics.Mean(name="n_loss")
        self.image_loss_tracker = keras.metrics.Mean(name="i_loss")
        self.kid = KID(name="kid")

    @property
    def metrics(self):
        return [self.noise_loss_tracker, self.image_loss_tracker, self.kid]

    def denormalize(self, images):
        # convert the pixel values back to 0-1 range
        images = self.normalizer.mean + images * self.normalizer.variance**0.5
        return tf.clip_by_value(images, 0.0, 1.0)

    def diffusion_schedule(self, diffusion_times):
        # diffusion times -> angles
        start_angle = tf.acos(max_signal_rate)
        end_angle = tf.acos(min_signal_rate)

        diffusion_angles = start_angle + diffusion_times * (end_angle - start_angle)

        # angles -> signal and noise rates
        signal_rates = tf.cos(diffusion_angles)
        noise_rates = tf.sin(diffusion_angles)
        # note that their squared sum is always: sin^2(x) + cos^2(x) = 1

        return noise_rates, signal_rates

    def denoise(self, noisy_images, noise_rates, signal_rates, training):
        # the exponential moving average weights are used at evaluation
        if training:
            network = self.network
        else:
            network = self.ema_network

        # predict noise component and calculate the image component using it
        pred_noises = network([noisy_images, noise_rates**2], training=training)
        pred_images = (noisy_images - noise_rates * pred_noises) / signal_rates

        return pred_noises, pred_images

    def reverse_diffusion(self, initial_noise, diffusion_steps):
        # reverse diffusion = sampling
        num_images = initial_noise.shape[0]
        step_size = 1.0 / diffusion_steps

        # important line:
        # at the first sampling step, the "noisy image" is pure noise
        # but its signal rate is assumed to be nonzero (min_signal_rate)
        next_noisy_images = initial_noise
        for step in range(diffusion_steps):
            noisy_images = next_noisy_images

            # separate the current noisy image to its components
            diffusion_times = tf.ones((num_images, 1, 1, 1)) - step * step_size
            noise_rates, signal_rates = self.diffusion_schedule(diffusion_times)
            pred_noises, pred_images = self.denoise(
                noisy_images, noise_rates, signal_rates, training=False
            )
            # network used in eval mode

            # remix the predicted components using the next signal and noise rates
            next_diffusion_times = diffusion_times - step_size
            next_noise_rates, next_signal_rates = self.diffusion_schedule(
                next_diffusion_times
            )
            next_noisy_images = (
                next_signal_rates * pred_images + next_noise_rates * pred_noises
            )
            # this new noisy image will be used in the next step

        return pred_images

    def generate(self, num_images, diffusion_steps):
        # noise -> images -> denormalized images
        initial_noise = tf.random.normal(shape=(num_images, image_size, image_size, 3))
        generated_images = self.reverse_diffusion(initial_noise, diffusion_steps)
        generated_images = self.denormalize(generated_images)
        return generated_images

    def train_step(self, images):
        # normalize images to have standard deviation of 1, like the noises
        images = self.normalizer(images, training=True)
        noises = tf.random.normal(shape=(batch_size, image_size, image_size, 3))

        # sample uniform random diffusion times
        diffusion_times = tf.random.uniform(
            shape=(batch_size, 1, 1, 1), minval=0.0, maxval=1.0
        )
        noise_rates, signal_rates = self.diffusion_schedule(diffusion_times)
        # mix the images with noises accordingly
        noisy_images = signal_rates * images + noise_rates * noises

        with tf.GradientTape() as tape:
            # train the network to separate noisy images to their components
            pred_noises, pred_images = self.denoise(
                noisy_images, noise_rates, signal_rates, training=True
            )

            noise_loss = self.loss(noises, pred_noises)  # used for training
            image_loss = self.loss(images, pred_images)  # only used as metric

        gradients = tape.gradient(noise_loss, self.network.trainable_weights)
        self.optimizer.apply_gradients(zip(gradients, self.network.trainable_weights))

        self.noise_loss_tracker.update_state(noise_loss)
        self.image_loss_tracker.update_state(image_loss)

        # track the exponential moving averages of weights
        for weight, ema_weight in zip(self.network.weights, self.ema_network.weights):
            ema_weight.assign(ema * ema_weight + (1 - ema) * weight)

        # KID is not measured during the training phase for computational efficiency
        return {m.name: m.result() for m in self.metrics[:-1]}

    def test_step(self, images):
        # normalize images to have standard deviation of 1, like the noises
        images = self.normalizer(images, training=False)
        noises = tf.random.normal(shape=(batch_size, image_size, image_size, 3))

        # sample uniform random diffusion times
        diffusion_times = tf.random.uniform(
            shape=(batch_size, 1, 1, 1), minval=0.0, maxval=1.0
        )
        noise_rates, signal_rates = self.diffusion_schedule(diffusion_times)
        # mix the images with noises accordingly
        noisy_images = signal_rates * images + noise_rates * noises

        # use the network to separate noisy images to their components
        pred_noises, pred_images = self.denoise(
            noisy_images, noise_rates, signal_rates, training=False
        )

        noise_loss = self.loss(noises, pred_noises)
        image_loss = self.loss(images, pred_images)

        self.image_loss_tracker.update_state(image_loss)
        self.noise_loss_tracker.update_state(noise_loss)

        # measure KID between real and generated images
        # this is computationally demanding, kid_diffusion_steps has to be small
        images = self.denormalize(images)
        generated_images = self.generate(
            num_images=batch_size, diffusion_steps=kid_diffusion_steps
        )
        self.kid.update_state(images, generated_images)

        return {m.name: m.result() for m in self.metrics}

    def plot_images(self, epoch=None, logs=None, num_rows=3, num_cols=6):
        # plot random generated images for visual evaluation of generation quality
        generated_images = self.generate(
            num_images=num_rows * num_cols,
            diffusion_steps=plot_diffusion_steps,
        )

        plt.figure(figsize=(num_cols * 2.0, num_rows * 2.0))
        for row in range(num_rows):
            for col in range(num_cols):
                index = row * num_cols + col
                plt.subplot(num_rows, num_cols, index + 1)
                plt.imshow(generated_images[index])
                plt.axis("off")
        plt.tight_layout()
        plt.show()
        plt.close()

訓練

# create and compile the model
model = DiffusionModel(image_size, widths, block_depth)
# below tensorflow 2.9:
# pip install tensorflow_addons
# import tensorflow_addons as tfa
# optimizer=tfa.optimizers.AdamW
model.compile(
    optimizer=keras.optimizers.experimental.AdamW(
        learning_rate=learning_rate, weight_decay=weight_decay
    ),
    loss=keras.losses.mean_absolute_error,
)
# pixelwise mean absolute error is used as loss

# save the best model based on the validation KID metric
checkpoint_path = "checkpoints/diffusion_model"
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    save_weights_only=True,
    monitor="val_kid",
    mode="min",
    save_best_only=True,
)

# calculate mean and variance of training dataset for normalization
model.normalizer.adapt(train_dataset)

# run training and plot generated images periodically
model.fit(
    train_dataset,
    epochs=num_epochs,
    validation_data=val_dataset,
    callbacks=[
        keras.callbacks.LambdaCallback(on_epoch_end=model.plot_images),
        checkpoint_callback,
    ],
)

511/511 [==============================] - 159s 273ms/step - n_loss: 0.2107 - i_loss: 0.4176 - val_n_loss: 0.7908 - val_i_loss: 2.5105 - val_kid: 2.0629

推論

# load the best model and generate images
model.load_weights(checkpoint_path)
model.plot_images()

結果

訓練を少なくとも 50 エポック実行することで (T4 GPU で 2 時間そして A100 GPU で 30 分かかります)、このコードサンプルを使用して高品質の画像生成を取得できます。

80 エポックの訓練に渡る画像のバッチの進化 (カラーアーティファクトは GIF 圧縮によります) :

同じ初期ノイズからの 1 から 20 の間のサンプリングステップを使用して生成された画像 :

初期ノイズサンプル間の補間 (球面) :

決定論的サンプリングプロセス (上がノイズ画像、下が予測画像、40 ステップ) :

確率的サンプリングプロセス (上がノイズ画像、下が予測画像、80 ステップ) :

Trained model and demo available on HuggingFace:

訓練済みモデル : HuggingFace Model DDIM
デモ : HuggingFace Spaces DDIM

以上

2022年9月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30