Keras 2 : ガイド : KerasCV – Stable Diffusion を使用した高性能画像生成 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 12/23/2022 (keras 2.11.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

High-performance image generation using Stable Diffusion in KerasCV (Author : fchollet, lukewood, divamgupta ; Created : 2022/09/25)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス ★ 無料 Web セミナー開催中 ★

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しております。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援
テレワーク & オンライン授業を支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
ウェビナー運用には弊社製品「ClassCat® Webinar」を利用しています。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/ ; Facebook

Keras 2 : ガイド : KerasCV – Stable Diffusion を使用した高性能画像生成

Description : KerasCV の StableDiffusion モデルを使用して新しい画像を生成する。

概要

このガイドでは、stability.ai のテキスト-to-画像変換モデル, Stable Diffusion の KerasCV 実装を使用して、テキストプロンプトに基づいて新規の画像を生成する方法を示します。

Stable Diffusion は強力で、オープンソースなテキスト-to-画像生成モデルです。テキストプロンプトから画像を簡単に生成することを可能にする複数のオープンソース実装がありますが、KerasCV のものは幾つかの明白な利点を提供します。これらは XLA コンパイルと混合精度のサポートを含み、これらは一緒に最先端の生成速度を実現します。

このガイドでは、KerasCV の Stable Diffusion 実装を探求し、これらの強力な性能ブーストを使用する方法を示し、そしてそれらが提供する性能の利点をし調べます。

開始するため、2, 3 の依存関係をインストールして、幾つかのインポートを整理しましょう :

!pip install --upgrade keras-cv

import time
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt

イントロダクション

最初にトピックを説明してからそれを実装する方法を示す、殆どのチュートリアルとは違い、テキスト-to-画像生成では伝えるよりも見せるほうが簡単です。

Check out the power of keras_cv.models.StableDiffusion().

First, we construct a model:

model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)

Next, we give it a prompt:

images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=3)


def plot_images(images):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        plt.imshow(images[i])
        plt.axis("off")


plot_images(images)

25/25 [==============================] - 19s 317ms/step

Pretty incredible!

But that’s not all this model can do. Let’s try a more complex prompt:

images = model.text_to_image(
    "cute magical flying dog, fantasy art, "
    "golden color, high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting, mystery, adventure",
    batch_size=3,
)
plot_images(images)

25/25 [==============================] - 8s 316ms/step

可能性は文字通り無限です (あるいは少なくとも Stable Diffusion の潜在的多様体の境界まで及びます)。

Wait, how does this even work?

この時点で貴方が期待するかもしれないものとは違い、Stable Diffusion は実際にはマジックで動くわけではありません。それは一種の「潜在的拡散モデル」です。Let’s dig into what that means.

貴方は 超解像度 の考えに馴染みがあるかもしれません : 入力画像をノイズ除去して — それによってより高解像度版に変換するために深層学習モデルを訓練することが可能です。深層学習モデルはノイズのある、低解像度な入力から失われた情報を魔術的にリカバーすることによってこれを行なうのではありません — そうではなく、モデルは訓練データ分布を使用して入力が与えられたときに最尤な視覚的な詳細という幻覚を見せます (hallucinate)。超解像度の詳細を学習するには、以下の Keras.io チュートリアルを確認できます :

Image Super-Resolution using an Efficient Sub-Pixel CNN ( Efficient Sub-Pixel CNN を使用した画像超解像)
Enhanced Deep Residual Networks for single-image super-resolution

このアイデアを突き詰めれば、貴方は尋ね始めるかもしれません — そのようなモデルを純粋なノイズ上で実行したらどうなるのだろう？するとモデルは「ノイズを除去」してまったく新しい画像の幻覚を見せ始めるでしょう。このプロセスを複数回繰り返すことにより、ノイズの小さなパッチを段階的に明瞭で高解像度な人工的な画像に変えることができます。

これは、2020 年の High-Resolution Image Synthesis with Latent Diffusion Models で提案された、潜在的拡散の主要なアイデアです。拡散を深く理解するためには、Keras.io チュートリアルの Denoising Diffusion Implicit Models (ノイズ除去暗黙モデル) を確認できます。

そして、潜在的拡散からテキスト-to-画像変換システムに進むには、依然として一つの主要な機能を追加する必要があります : プロンプト・キーワードを通して生成された視覚的コンテンツを制御する機能です。これは古典的な深層学習テクニックである「条件付け」(“conditioning”) を通して行われます、これは少しのテキストを表すベクトルをノイズパッチに連結し、そして {画像: キャプション} ペアのデータセット上でモデルを訓練することから構成されます。

これが Stable Diffusion アーキテクチャへの進化を与えます。Stable Diffusion は 3 つのパートから構成されます :

テキストエンコーダ, これは貴方のプロンプトを潜在的ベクトルに変換します。
拡散モデル, これは 64×64 潜在的画像パッチを繰り返し「ノイズ除去」します。
デコーダ, これは最終的な 64×64 潜在的パッチをより高解像度な 512×512 画像に変換します。

最初に、貴方のテキストプロンプトは (事前訓練済みの凍結された言語モデルに過ぎない) テキストエンコーダにより潜在的ベクトル空間に射影されます。次に、そのプロンプトベクトルはランダムに生成されたノイズパッチに連結されます、これは一連の「ステップ」に渡りデコーダによって繰り返し「ノイズ除去」されます (より多くのステップを実行すれば、画像はより鮮明で素敵になります — デフォルト値は 50 ステップです)。

最後に、64×64 潜在的画像はデコーダに送られて高解像度に正しくレンダリングされます。

全体としては、それは非常に単純なシステムです — Keras 実装は 4 つのファイルに収まり、それは合計で 500 行以下のコードです :

しかしこの比較的単純なシステムは、数十億の画像とそれらのキャプションで訓練すれば、魔術のように見え始めます。As Feynman said about the universe: “It’s not complicated, it’s just a lot of it!”

KerasCV の特典 (Perks)

幾つかの Stable Diffusion が公開されて利用可能であるのに、何故 keras_cv.models.StableDiffusion を使用するべきなのでしょうか？

使いやすい API に加えて、KerasCV の Stable Diffusion モデルは、以下を含む幾つかの強力な利点を備えています :

グラフモード実行
jit_compile=True による XLA コンパイル
混合精度計算のサポート

これらが連結されたとき、KerasCV Stable Diffusion モデルは素朴な実装よりも桁違いに高速に実行されます。このセクションは、これらの機能のすべてを有効にする方法とそれらを使用して生み出される結果としてのパフォーマンスゲインを示します。

比較の目的で、Stable Diffusion の HuggingFace diffusers 実装の実行時間を KerasCV 実装と比較するベンチマークを実行しました。両方の実装は各画像毎にステップカウント 50 で 3 画像を生成するタスクを負いました。このベンチマークでは、Tesla T4 GPU を使用しました。

ベンチマークのすべては GitHub 上のオープンソースで、結果を再現するために Colab で再実行しても良いです。ベンチマークの結果は下の表で示されます :

Tesla T4 で実行時間 30% の改善です！V100 では改善はかなり小さいですが、ベンチマークの結果はすべての NVIDIA GPU に渡り一貫して KerasCV に有利に働くことが一般に期待できます。

完全性のため、コールドスタートとウォームスタートの両方の生成時間を報告します。コールドスタートの実行時間はモデル作成とコンパイルの one-time コストを含みますので、実稼働環境では無視できます (そこでは同じモデルインスタンスを何度も再利用します)。Regardless, here are the cold-start numbers:

このガイドの実行からの実行時間の結果は変わるかもしれませんが、私たちのテストでは Stable Diffusion の KerasCV 実装はその PyTorch のカウンターパートよりも大幅に高速です。これは大きくは XLA コンパイルに起因するかもしれません。

Note : 各最適化のパフォーマンスの利点はハードウェアセットアップで大きく変化することはあり得ます。

To get started, let’s first benchmark our unoptimized model:

benchmark_result = []
start = time.time()
images = model.text_to_image(
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["Standard", end - start])
plot_images(images)

print(f"Standard model: {(end - start):.2f} seconds")
keras.backend.clear_session()  # Clear session to preserve memory.

25/25 [==============================] - 8s 316ms/step
Standard model: 8.17 seconds

混合精度

「混合精度」は float16 精度を使用して計算を実行しながら重みは float32 形式でストアすることから構成されます。これは、最新の NVIDIA GPU 上では float16 演算が float32 のカウンターパートよりも大幅に高速なカーネルにより支援されているという事実を利用して行われます。

Keras で (従って keras_cv.models.StableDiffusion について) 混合精度計算を有効にするのは以下を呼び出すように簡単です :

keras.mixed_precision.set_global_policy("mixed_float16")

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA A100-SXM4-40GB, compute capability 8.0

That’s all. Out of the box – it just works.

model = keras_cv.models.StableDiffusion()

print("Compute dtype:", model.diffusion_model.compute_dtype)
print(
    "Variable dtype:",
    model.diffusion_model.variable_dtype,
)

Compute dtype: float16
Variable dtype: float32

ご覧のように、上で構築されたモデルは今は混合精度計算を利用しています ; 計算のためには float16 演算のスピードを活用する一方で、変数は float32 精度でストアします。

# Warm up model to run graph tracing before benchmarking.
model.text_to_image("warming up the model", batch_size=3)

start = time.time()
images = model.text_to_image(
    "a cute magical flying dog, fantasy art, "
    "golden color, high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting, mystery, adventure",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["Mixed Precision", end - start])
plot_images(images)

print(f"Mixed precision model: {(end - start):.2f} seconds")
keras.backend.clear_session()

25/25 [==============================] - 15s 226ms/step
25/25 [==============================] - 6s 226ms/step
Mixed precision model: 6.02 seconds

XLA コンパイル

TensorFlow は XLA: 高速化線形代数コンパイラを組み込みで装備しています。keras_cv.models.StableDiffusion はそのままで jit_compile 引数をサポートしています。この引数を True に設定すると XLA コンパイルが有効になり、大幅なスピードアップになります。

Let’s use this below:

# Set back to the default for benchmarking purposes.
keras.mixed_precision.set_global_policy("float32")

model = keras_cv.models.StableDiffusion(jit_compile=True)
# Before we benchmark the model, we run inference once to make sure the TensorFlow
# graph has already been traced.
images = model.text_to_image("An avocado armchair", batch_size=3)
plot_images(images)

25/25 [==============================] - 36s 245ms/step

Let’s benchmark our XLA model:

start = time.time()
images = model.text_to_image(
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["XLA", end - start])
plot_images(images)

print(f"With XLA: {(end - start):.2f} seconds")
keras.backend.clear_session()

25/25 [==============================] - 6s 245ms/step
With XLA: 6.27 seconds

On an A100 GPU, we get about a 2x speedup. Fantastic!

Putting it all together

それでは、世界で最もパフォーマンスの高い stable diffusion 推論パイプラインをどのように組み立てるのでしょう (2022年9月現在)。

With these two lines of code:

keras.mixed_precision.set_global_policy("mixed_float16")
model = keras_cv.models.StableDiffusion(jit_compile=True)

And to use it…

# Let's make sure to warm up the model
images = model.text_to_image(
    "Teddy bears conducting machine learning research",
    batch_size=3,
)
plot_images(images)

25/25 [==============================] - 39s 157ms/step

Exactly how fast is it? Let’s find out!

start = time.time()
images = model.text_to_image(
    "A mysterious dark stranger visits the great pyramids of egypt, "
    "high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["XLA + Mixed Precision", end - start])
plot_images(images)

print(f"XLA + mixed precision: {(end - start):.2f} seconds")

25/25 [==============================] - 4s 158ms/step
XLA + mixed precision: 4.25 seconds

Let’s check out the results:

print("{:<20} {:<20}".format("Model", "Runtime"))
for result in benchmark_result:
    name, runtime = result
    print("{:<20} {:<20}".format(name, runtime))

Model                 Runtime             
Standard              8.17177152633667    
Mixed Precision       6.022329568862915   
XLA                   6.265935659408569   
XLA + Mixed Precision 4.252242088317871

It only took our fully-optimized model four seconds to generate three novel images from a text prompt on an A100 GPU.

まとめ

KerasCV は Stable Diffusion の最先端の実装を提供します -- そして XLA と混合精度の使用を通して、2022 年 9 月現在で利用可能な最速の Stable Diffusion パイプラインを供給します。

Normally, at the end of a keras.io tutorial we leave you with some future directions to continue in to learn. This time, we leave you with one idea:

Go run your own prompts through the model! It is an absolute blast!

If you have your own NVIDIA GPU, or a M1 MacBookPro, you can also run the model locally on your machine. (Note that when running on a M1 MacBookPro, you should not enable mixed precision, as it is not yet well supported by Apple's Metal runtime.)

以上

2022年12月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31