TensorFlow 2.0 : 上級 Tutorials : テキスト :- 画像キャプショニング with 視覚 attention (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 11/13/2019

* 本ページは、TensorFlow org サイトの TF 2.0 – Advanced Tutorials – Text の以下のページを翻訳した上で
適宜、補足説明したものです：

Image captioning with visual attention

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマにウェビナー (WEB セミナー) を定期的に開催しています。スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
Windows PC のブラウザからご参加が可能です。スマートデバイスもご利用可能です。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/

Facebook: https://www.facebook.com/ClassCatJP/

テキスト :- 画像キャプショニング with 視覚 attention

下のサンプルのような画像が与えられたとき、私達のゴールは “a surfer riding on a wave” のようなキャプションを生成することです。

画像ソース, ライセンス: Public Domain

これを達成するために、attention ベースのモデルを使用します、これは (モデルが) キャプションを生成するとき画像のどの部分にモデルが注目しているかを見ることを可能にします。

モデル・アーキテクチャは Show, Attend and Tell: Neural Image Caption Generation with Visual Attention に類似しています。

このノートブックは end-to-end なサンプルです。ノートブックを実行する時、それは MS-COCO データセットをダウンロードし、Inception V3 を使用して画像のサブセットを前処理してキャッシュし、エンコーダ-デコーダ・モデルを訓練して、そして訓練されたモデルを使用して新しい画像上でキャプションを生成します。

このサンプルでは、比較的小さい総量のデータ上でモデルを訓練します — およそ 20,000 画像のための最初の 30,000 キャプションです (データセットの画像毎に複数のキャプションがあるためです)。

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

# You'll generate plots of attention in order to see which parts of an image
# our model focuses on during captioning
import matplotlib.pyplot as plt

# Scikit-learn includes many helpful utilities
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle

MS-COCO データセットをダウンロードして準備する

モデルを訓練するために MS-COCO データセットを使用します。このデータセットは 82,000 以上の画像を含み、その各々は少なくとも 5 つの異なるキャプションのアノテーションを持ちます。下のコードはデータセットを自動的にダウンロードして抽出します。

警告: これから先に巨大なダウンロードがあります。訓練セットを使用します、これは 13 GB ファイルです。

annotation_zip = tf.keras.utils.get_file('captions.zip',
                                          cache_subdir=os.path.abspath('.'),
                                          origin = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip',
                                          extract = True)
annotation_file = os.path.dirname(annotation_zip)+'/annotations/captions_train2014.json'

name_of_zip = 'train2014.zip'
if not os.path.exists(os.path.abspath('.') + '/' + name_of_zip):
  image_zip = tf.keras.utils.get_file(name_of_zip,
                                      cache_subdir=os.path.abspath('.'),
                                      origin = 'http://images.cocodataset.org/zips/train2014.zip',
                                      extract = True)
  PATH = os.path.dirname(image_zip)+'/train2014/'
else:
  PATH = os.path.abspath('.')+'/train2014/'

Downloading data from http://images.cocodataset.org/annotations/annotations_trainval2014.zip
252878848/252872794 [==============================] - 8s 0us/step
Downloading data from http://images.cocodataset.org/zips/train2014.zip
12593045504/13510573713 [==========================>...] - ETA: 27s

オプション: 訓練セットのサイズを制限する

このチュートリアルのための訓練を高速化するため、モデルを訓練するために 30,000 キャプションのサブセットとそれらに対応する画像を使用します。より多くのデータを使用する選択は改善されたキャプショニング品質という結果になるでしょう。

# Read the json file
with open(annotation_file, 'r') as f:
    annotations = json.load(f)

# Store captions and image names in vectors
all_captions = []
all_img_name_vector = []

for annot in annotations['annotations']:
    caption = ' ' + annot['caption'] + ' '
    image_id = annot['image_id']
    full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)

    all_img_name_vector.append(full_coco_image_path)
    all_captions.append(caption)

# Shuffle captions and image_names together
# Set a random state
train_captions, img_name_vector = shuffle(all_captions,
                                          all_img_name_vector,
                                          random_state=1)

# Select the first 30000 captions from the shuffled set
num_examples = 30000
train_captions = train_captions[:num_examples]
img_name_vector = img_name_vector[:num_examples]

len(train_captions), len(all_captions)

(30000, 414113)

InceptionV3 を使用して画像を前処理する

次に、各画像を分類するために (Imagenet 上で事前訓練された) InceptionV3 を使用します。最後の畳込み層から特徴を抽出します。

最初に、画像を次により inceptionV3 が想定するフォーマットに変換する必要があります : * 画像を 299px x 299px にリサイズします。* 画像を正規化するために preprocess_input メソッドを使用して画像を前処理します、その結果それは -1 から 1 の範囲のピクセルを含みます、これは InceptionV3 を訓練するために使用された画像のフォーマットに適合します。

def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

InceptionV3 を初期化して事前訓練された Imagenet 重みをロードする

今は tf.keras モデルを作成します、そこでは出力層は InceptionV3 アーキテクチャの最後の畳み込み層です。この層の出力の shape は 8x8x2048 です。最後の畳み込み層を使用します、何故ならばこのサンプルでは attention を使用しているからです。訓練の間にこの初期化は遂行しません、何故ならばそれはボトルネックになり得るからです。

各画像はネットワークを通して forward されて最後に得られるベクトルは辞書にストアされます (image_name –> feature_vector)。
総ての画像がネットワークを通された後、辞書を pickle 化してそれをディスクにセーブします。

image_model = tf.keras.applications.InceptionV3(include_top=False,
                                                weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87916544/87910968 [==============================] - 3s 0us/step

InceptionV3 から抽出された特徴をキャッシュする

各画像を InceptionV3 で前処理して出力をディスクにキャッシュします。出力を RAM にキャッシングすることはより高速ですがメモリ集約的で、画像毎に 8 * 8 * 2048 float を必要とします。(これを) 書いている時点で、これは Colab のメモリ制限を超過するでしょう (現在は 12 GB メモリ)。

パフォーマンスはより洗練されたキャッシング・ストラテジー (例えば、ディスク I/O へのランダムアクセスを減じるために画像をシャーディングする) で改善されるかもしれませんが、それはより多くのコードを必要とするでしょう。

キャッシングは GPU を持つ Colab で実行するためにおよそ 10 分間かかります。進捗バーを見ることを望む場合、次を行なうことができます :

tqdm をインストールする:
!pip install -q tqdm
Import tqdm:
from tqdm import tqdm
次の行を変更する:
for img, path in image_dataset:
to:
for img, path in tqdm(image_dataset):.

# Get unique images
encode_train = sorted(set(img_name_vector))

# Feel free to change batch_size according to your system configuration
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(
  load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)

for img, path in image_dataset:
  batch_features = image_features_extract_model(img)
  batch_features = tf.reshape(batch_features,
                              (batch_features.shape[0], -1, batch_features.shape[3]))

  for bf, p in zip(batch_features, path):
    path_of_feature = p.numpy().decode("utf-8")
    np.save(path_of_feature, bf.numpy())

キャプションを前処理してトークン化する

最初にキャプションをトークン化します (例えば、空白で分割することにより)。これはデータの総ての一意な単語の語彙を与えます (例えば、”surfing”, “football” 等)。
次に、(メモリを節約するために) 語彙サイズを top 5,000 単語に制限します。総ての他の単語をトークン “UNK” (unknown) で置き替えます。
それから単語-to-インデックスとインデックス-to-単語マッピングを作成します。
最後に、最長のものと同じ長さになるように総てのシークエンスをパッドします。

# Find the maximum length of any caption in our dataset
def calc_max_length(tensor):
    return max(len(t) for t in tensor)

# Choose the top 5000 words from the vocabulary
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)

tokenizer.word_index[''] = 0
tokenizer.index_word[0] = '<pad>'

# Create the tokenized vectors
train_seqs = tokenizer.texts_to_sequences(train_captions)

# Pad each vector to the max_length of the captions
# If you do not provide a max_length value, pad_sequences calculates it automatically
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

# Calculates the max_length, which is used to store the attention weights
max_length = calc_max_length(train_seqs)

データを訓練とテストに分割する

# Create training and validation sets using an 80-20 split
img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,
                                                                    cap_vector,
                                                                    test_size=0.2,
                                                                    random_state=0)

len(img_name_train), len(cap_train), len(img_name_val), len(cap_val)

(24000, 24000, 6000, 6000)

訓練のために tf.data dataset を作成します

画像とキャプションの準備ができました！次に、モデルを訓練するために使用する tf.data データセットを作成しましょう。

# Feel free to change these parameters according to your system's configuration

BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
# Shape of the vector extracted from InceptionV3 is (64, 2048)
# These two variables represent that vector shape
features_shape = 2048
attention_features_shape = 64

# Load the numpy files
def map_func(img_name, cap):
  img_tensor = np.load(img_name.decode('utf-8')+'.npy')
  return img_tensor, cap

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
          map_func, [item1, item2], [tf.float32, tf.int32]),
          num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Shuffle and batch
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

モデル

面白い事実: 下のデコーダはニューラル機械翻訳 with Attention のためのサンプルのものと同一です。

モデル・アーキテクチャは Show, Attend and Tell ペーパーによりインスパイアされています。

このサンプルでは、InceptionV3 のより低い畳み込み層から特徴を抽出します、これは shape (8, 8, 2048) のベクトルを与えます。
それを (64, 2048) の shape に押しつぶします。
それからこのベクトルは CNN エンコーダを通して渡されます (これは単一の完全結合層から成ります)。
RNN (ここでは GRU) が次の単語を予測するために画像に渡り注視します。

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, features, hidden):
    # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

    # hidden shape == (batch_size, hidden_size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
    hidden_with_time_axis = tf.expand_dims(hidden, 1)

    # score shape == (batch_size, 64, hidden_size)
    score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

    # attention_weights shape == (batch_size, 64, 1)
    # you get 1 at the last axis because you are applying score to self.V
    attention_weights = tf.nn.softmax(self.V(score), axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

class CNN_Encoder(tf.keras.Model):
    # Since you have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

class RNN_Decoder(tf.keras.Model):
  def __init__(self, embedding_dim, units, vocab_size):
    super(RNN_Decoder, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc1 = tf.keras.layers.Dense(self.units)
    self.fc2 = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.units)

  def call(self, x, features, hidden):
    # defining attention as a separate model
    context_vector, attention_weights = self.attention(features, hidden)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # shape == (batch_size, max_length, hidden_size)
    x = self.fc1(output)

    # x shape == (batch_size * max_length, hidden_size)
    x = tf.reshape(x, (-1, x.shape[2]))

    # output shape == (batch_size * max_length, vocab)
    x = self.fc2(x)

    return x, state, attention_weights

  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))

encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

チェックポイント

checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

start_epoch = 0
if ckpt_manager.latest_checkpoint:
  start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])

訓練

それぞれの .npy ファイルにストアされている特徴を抽出してからそれらの特徴をエンコーダを通して渡します。
エンコーダ出力、(0 に初期化された) 隠れ状態そしてデコーダ入力 (それは開始トークンです) がデコーダに渡されます。
デコーダは予測とデコーダ隠れ状態を返します。
そしてデコーダ隠れ状態はモデルに渡し戻されて予測は損失を計算するために使用されます。
デコーダへの次の入力を決めるために teacher forcing を使用します。
teacher forcing はそこではターゲット単語がデコーダへの次の入力として渡されるようなテクニックです。
最後のステップは勾配を計算してそれを optimizer に適用してそして backpropagate します。

# adding this in a separate cell because if you run the training cell
# many times, the loss_plot array will be reset
loss_plot = []

@tf.function
def train_step(img_tensor, target):
  loss = 0

  # initializing the hidden state for each batch
  # because the captions are not related from image to image
  hidden = decoder.reset_state(batch_size=target.shape[0])

  dec_input = tf.expand_dims([tokenizer.word_index['']] * BATCH_SIZE, 1)

  with tf.GradientTape() as tape:
      features = encoder(img_tensor)

      for i in range(1, target.shape[1]):
          # passing the features through the decoder
          predictions, hidden, _ = decoder(dec_input, features, hidden)

          loss += loss_function(target[:, i], predictions)

          # using teacher forcing
          dec_input = tf.expand_dims(target[:, i], 1)

  total_loss = (loss / int(target.shape[1]))

  trainable_variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, trainable_variables)

  optimizer.apply_gradients(zip(gradients, trainable_variables))

  return loss, total_loss

EPOCHS = 20

for epoch in range(start_epoch, EPOCHS):
    start = time.time()
    total_loss = 0

    for (batch, (img_tensor, target)) in enumerate(dataset):
        batch_loss, t_loss = train_step(img_tensor, target)
        total_loss += t_loss

        if batch % 100 == 0:
            print ('Epoch {} Batch {} Loss {:.4f}'.format(
              epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
    # storing the epoch end loss value to plot later
    loss_plot.append(total_loss / num_steps)

    if epoch % 5 == 0:
      ckpt_manager.save()

    print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
                                         total_loss/num_steps))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Epoch 1 Batch 0 Loss 2.0988
Epoch 1 Batch 100 Loss 1.1463
Epoch 1 Batch 200 Loss 1.0366
Epoch 1 Batch 300 Loss 0.9083
Epoch 1 Loss 1.085102
Time taken for 1 epoch 131.9147825241089 sec

Epoch 2 Batch 0 Loss 0.8748
Epoch 2 Batch 100 Loss 0.7652
Epoch 2 Batch 200 Loss 0.7708
Epoch 2 Batch 300 Loss 0.7578
Epoch 2 Loss 0.812931
Time taken for 1 epoch 49.82423710823059 sec

Epoch 3 Batch 0 Loss 0.8118
Epoch 3 Batch 100 Loss 0.7946
Epoch 3 Batch 200 Loss 0.7396
Epoch 3 Batch 300 Loss 0.6746
Epoch 3 Loss 0.735972
Time taken for 1 epoch 49.87708878517151 sec

Epoch 4 Batch 0 Loss 0.6586
Epoch 4 Batch 100 Loss 0.7100
Epoch 4 Batch 200 Loss 0.6617
Epoch 4 Batch 300 Loss 0.7083
Epoch 4 Loss 0.688551
Time taken for 1 epoch 49.925899028778076 sec

Epoch 5 Batch 0 Loss 0.6543
Epoch 5 Batch 100 Loss 0.6995
Epoch 5 Batch 200 Loss 0.6268
Epoch 5 Batch 300 Loss 0.6577
Epoch 5 Loss 0.650644
Time taken for 1 epoch 49.861159801483154 sec

Epoch 6 Batch 0 Loss 0.6031
Epoch 6 Batch 100 Loss 0.5955
Epoch 6 Batch 200 Loss 0.6627
Epoch 6 Batch 300 Loss 0.5704
Epoch 6 Loss 0.617129
Time taken for 1 epoch 50.251293659210205 sec

Epoch 7 Batch 0 Loss 0.5712
Epoch 7 Batch 100 Loss 0.5685
Epoch 7 Batch 200 Loss 0.5779
Epoch 7 Batch 300 Loss 0.5544
Epoch 7 Loss 0.586807
Time taken for 1 epoch 50.4225971698761 sec

Epoch 8 Batch 0 Loss 0.5429
Epoch 8 Batch 100 Loss 0.5585
Epoch 8 Batch 200 Loss 0.5514
Epoch 8 Batch 300 Loss 0.5229
Epoch 8 Loss 0.555478
Time taken for 1 epoch 50.04306435585022 sec

Epoch 9 Batch 0 Loss 0.5150
Epoch 9 Batch 100 Loss 0.5001
Epoch 9 Batch 200 Loss 0.5294
Epoch 9 Batch 300 Loss 0.5434
Epoch 9 Loss 0.526150
Time taken for 1 epoch 49.535995960235596 sec

Epoch 10 Batch 0 Loss 0.4677
Epoch 10 Batch 100 Loss 0.5044
Epoch 10 Batch 200 Loss 0.4583
Epoch 10 Batch 300 Loss 0.4794
Epoch 10 Loss 0.496457
Time taken for 1 epoch 50.2047655582428 sec

Epoch 11 Batch 0 Loss 0.4150
Epoch 11 Batch 100 Loss 0.4491
Epoch 11 Batch 200 Loss 0.4283
Epoch 11 Batch 300 Loss 0.4874
Epoch 11 Loss 0.465688
Time taken for 1 epoch 50.450185775756836 sec

Epoch 12 Batch 0 Loss 0.4305
Epoch 12 Batch 100 Loss 0.4535
Epoch 12 Batch 200 Loss 0.4198
Epoch 12 Batch 300 Loss 0.4154
Epoch 12 Loss 0.437214
Time taken for 1 epoch 49.61044931411743 sec

Epoch 13 Batch 0 Loss 0.4156
Epoch 13 Batch 100 Loss 0.4067
Epoch 13 Batch 200 Loss 0.4412
Epoch 13 Batch 300 Loss 0.4066
Epoch 13 Loss 0.429518
Time taken for 1 epoch 50.13954949378967 sec

Epoch 14 Batch 0 Loss 0.3823
Epoch 14 Batch 100 Loss 0.4156
Epoch 14 Batch 200 Loss 0.3560
Epoch 14 Batch 300 Loss 0.4084
Epoch 14 Loss 0.387618
Time taken for 1 epoch 49.05424618721008 sec

Epoch 15 Batch 0 Loss 0.3724
Epoch 15 Batch 100 Loss 0.3452
Epoch 15 Batch 200 Loss 0.3371
Epoch 15 Batch 300 Loss 0.3183
Epoch 15 Loss 0.358968
Time taken for 1 epoch 49.87037777900696 sec

Epoch 16 Batch 0 Loss 0.3415
Epoch 16 Batch 100 Loss 0.3094
Epoch 16 Batch 200 Loss 0.3534
Epoch 16 Batch 300 Loss 0.3220
Epoch 16 Loss 0.340680
Time taken for 1 epoch 50.09799098968506 sec

Epoch 17 Batch 0 Loss 0.3501
Epoch 17 Batch 100 Loss 0.3355
Epoch 17 Batch 200 Loss 0.3027
Epoch 17 Batch 300 Loss 0.3440
Epoch 17 Loss 0.318385
Time taken for 1 epoch 49.605764865875244 sec

Epoch 18 Batch 0 Loss 0.3254
Epoch 18 Batch 100 Loss 0.3095
Epoch 18 Batch 200 Loss 0.2968
Epoch 18 Batch 300 Loss 0.2670
Epoch 18 Loss 0.295994
Time taken for 1 epoch 49.70194149017334 sec

Epoch 19 Batch 0 Loss 0.3094
Epoch 19 Batch 100 Loss 0.3093
Epoch 19 Batch 200 Loss 0.2804
Epoch 19 Batch 300 Loss 0.2976
Epoch 19 Loss 0.278130
Time taken for 1 epoch 49.86575794219971 sec

Epoch 20 Batch 0 Loss 0.2911
Epoch 20 Batch 100 Loss 0.2470
Epoch 20 Batch 200 Loss 0.2651
Epoch 20 Batch 300 Loss 0.2656
Epoch 20 Loss 0.258760
Time taken for 1 epoch 50.28017234802246 sec

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

キャプション！

evaluate 関数は訓練ループに類似しています、ここでは teacher forcing を使用しないことを除いて。各時間ステップにおけるデコーダへの入力は隠れ状態とエンコーダ出力と共にその前の予測です。
モデルが終了トークンを予測するとき予測を停止します。
そして総ての時間ステップのために attention 重みをストアします。

def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([tokenizer.word_index['']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.argmax(predictions[0]).numpy()
        result.append(tokenizer.index_word[predicted_id])

        if tokenizer.index_word[predicted_id] == '':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(10, 10))

    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

print ('Real Caption:', real_caption)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image, result, attention_plot)
# opening the image
Image.open(img_name_val[rid])

Real Caption: <start> a bike parked next to a bucket filled with lots of oranges <end>
Prediction Caption: a man sitting in the banana <end>

貴方自身の画像でそれを試してください

楽しみのために、貴方自身の画像を丁度訓練したモデルでキャプションするために使用できるメソッドを下で提供しました。留意してください、それは比較的小さい量のデータの上で訓練されました、そして貴方の画像は訓練データとは異なるかもしれません (そのため奇妙な結果に備えてください！)

image_url = 'https://tensorflow.org/images/surf.jpg'
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file('image'+image_extension,
                                     origin=image_url)

result, attention_plot = evaluate(image_path)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image_path, result, attention_plot)
# opening the image
Image.open(image_path)

Downloading data from https://tensorflow.org/images/surf.jpg
65536/64400 [==============================] - 0s 2us/step
Prediction Caption: a man on a surfboard riding a surfboard <end>

以上

2019年11月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30