Keras 2 : examples : CNN-RNN アーキテクチャによる動画分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 12/20/2021 (keras 2.7.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Computer Vision : Video Classification with a CNN-RNN Architecture (Author: Sayak Paul)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : CNN-RNN アーキテクチャによる動画分類

Description: UCF101 データセット上で転移学習とリカレントモデルで動画分類器を訓練する。

このサンプルは動画分類を実演します、レコメンデーション, セキュリティ, 等への応用を持つ重要なユースケースです。動画分類器を構築するために UCF101 データセットを使用していきます。データセットは、クリケットショット, パンチング, バイキング, 等のような様々なアクションに分類された動画から成ります。このデータセットは一般にはアクション認識器を構築するために使用されます、これは動画分類の応用です。

動画はフレームの順序付けられたシークエンスから成ります。各フレームは空間情報を含み、それらのフレームのシークエンスは時間情報を含みます。これらの様相の両方をモデル化するため、(空間処理のために) 畳み込みと (時間処理のために) リカレント層から構成されるハイブリッド・アーキテクチャを使用します。具体的には、畳み込みニューラルネットワーク (CNN) と GRU 層から成るリカレント・ニューラルネットワーク (RNN) を利用します。この種類のハイブリッド・アーキテクチャは CNN-RNN として一般に知られています。

このサンプルは TensorFlow 2.5 またはそれ以上、そして TensorFlow Docs を必要とします、これは次のコマンドを使用してインストールできます :

!pip install -q git+https://github.com/tensorflow/docs

データ・コレクション

このサンプルの実行時間を比較的短く保つために、元の UCF101 データセットのサブサンプリングされたバージョンを使用していきます。サブサンプリングがどのように成されたかを知るにはこのノートブックを参照してください。

!wget -q https://git.io/JGc31 -O ucf101_top5.tar.gz
!tar xf ucf101_top5.tar.gz

セットアップ

from tensorflow_docs.vis import embed
from tensorflow import keras
from imutils import paths

import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import numpy as np
import imageio
import cv2
import os

ハイパーパラメータの定義

IMG_SIZE = 224
BATCH_SIZE = 64
EPOCHS = 10

MAX_SEQ_LENGTH = 20
NUM_FEATURES = 2048

データの準備

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

print(f"Total videos for training: {len(train_df)}")
print(f"Total videos for testing: {len(test_df)}")

train_df.sample(10)

Total videos for training: 594
Total videos for testing: 224

	video_name	tag
149	v_PlayingCello_g12_c05.avi	PlayingCello
317	v_Punch_g19_c05.avi	Punch
438	v_ShavingBeard_g20_c03.avi	ShavingBeard
559	v_TennisSwing_g20_c02.avi	TennisSwing
368	v_ShavingBeard_g09_c03.avi	ShavingBeard
241	v_Punch_g08_c04.avi	Punch
398	v_ShavingBeard_g14_c03.avi	ShavingBeard
111	v_CricketShot_g25_c01.avi	CricketShot
119	v_PlayingCello_g08_c02.avi	PlayingCello
249	v_Punch_g09_c05.avi	Punch

動画分類器の訓練の多くの課題の一つは動画をネットワークに供給する方法を見つけ出すことです。このブログ投稿は 5 つのそのような手法を説明しています。動画はフレームの順序付けられたシークエンスですから、フレームを抽出してそれらを 3D テンソルに収められるでしょう。しかしフレーム数は動画毎に異なるかもしれませんので、(パディングを使用しない限り) バッチにスタックする妨げになります。代替案として、動画フレームを最大フレームカウントに到達するまで一定間隔でフレームをセーブする ことができます。このサンプルでは以下を行ないます :

動画のフレームをキャプチャします。
最大フレームカウントに到達するまで動画からフレームを抽出します。
動画のフレームカウントが最大フレームカウントよりも少ない場合には、動画をゼロでパディングします。

このワークフローはテキストシークエンスを含む問題と同一であることに注意してください。UCF101 データセットの動画は、フレーム間でオブジェクトとアクションの極端な変動は含まないとして知られています。このため、学習タスクのために幾つかのフレームだけを考えれば十分かもしれません。しかしこのアプローチは他の動画分類問題に上手く一般化できないかもしれません。動画からフレームを読むために OpenCV の VideoCapture() メソッドを使用していきます。

# The following two methods are taken from this tutorial:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub


def crop_center_square(frame):
    y, x = frame.shape[0:2]
    min_dim = min(y, x)
    start_x = (x // 2) - (min_dim // 2)
    start_y = (y // 2) - (min_dim // 2)
    return frame[start_y : start_y + min_dim, start_x : start_x + min_dim]


def load_video(path, max_frames=0, resize=(IMG_SIZE, IMG_SIZE)):
    cap = cv2.VideoCapture(path)
    frames = []
    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                break
            frame = crop_center_square(frame)
            frame = cv2.resize(frame, resize)
            frame = frame[:, :, [2, 1, 0]]
            frames.append(frame)

            if len(frames) == max_frames:
                break
    finally:
        cap.release()
    return np.array(frames)

抽出されたフレームから意味のある特徴を抽出するために事前訓練済みネットワークを利用できます。Keras Applications モジュールは ImageNet-1k データセット上で多くの事前訓練された最先端モデルを提供しています。この目的のために InceptionV3 モデルを使用していきます。

def build_feature_extractor():
    feature_extractor = keras.applications.InceptionV3(
        weights="imagenet",
        include_top=False,
        pooling="avg",
        input_shape=(IMG_SIZE, IMG_SIZE, 3),
    )
    preprocess_input = keras.applications.inception_v3.preprocess_input

    inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3))
    preprocessed = preprocess_input(inputs)

    outputs = feature_extractor(preprocessed)
    return keras.Model(inputs, outputs, name="feature_extractor")


feature_extractor = build_feature_extractor()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87916544/87910968 [==============================] - 0s 0us/step
87924736/87910968 [==============================] - 0s 0us/step

動画のラベルは文字列です。ニューラルネットワークは文字列値を理解しませんので、モデルに供給される前にある数値形式に変換されなければなりません。ここではクラスラベルを整数にエンコードする StringLookup 層を使用します。

label_processor = keras.layers.StringLookup(
    num_oov_indices=0, vocabulary=np.unique(train_df["tag"])
)
print(label_processor.get_vocabulary())

['CricketShot', 'PlayingCello', 'Punch', 'ShavingBeard', 'TennisSwing']

最後に、データ処理ユティリティを作成するために総てのピースを一つにまとめることができます。

def prepare_all_videos(df, root_dir):
    num_samples = len(df)
    video_paths = df["video_name"].values.tolist()
    labels = df["tag"].values
    labels = label_processor(labels[..., None]).numpy()

    # `frame_masks` and `frame_features` are what we will feed to our sequence model.
    # `frame_masks` will contain a bunch of booleans denoting if a timestep is
    # masked with padding or not.
    frame_masks = np.zeros(shape=(num_samples, MAX_SEQ_LENGTH), dtype="bool")
    frame_features = np.zeros(
        shape=(num_samples, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
    )

    # For each video.
    for idx, path in enumerate(video_paths):
        # Gather all its frames and add a batch dimension.
        frames = load_video(os.path.join(root_dir, path))
        frames = frames[None, ...]

        # Initialize placeholders to store the masks and features of the current video.
        temp_frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
        temp_frame_features = np.zeros(
            shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32"
        )

        # Extract features from the frames of the current video.
        for i, batch in enumerate(frames):
            video_length = batch.shape[0]
            length = min(MAX_SEQ_LENGTH, video_length)
            for j in range(length):
                temp_frame_features[i, j, :] = feature_extractor.predict(
                    batch[None, j, :]
                )
            temp_frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

        frame_features[idx,] = temp_frame_features.squeeze()
        frame_masks[idx,] = temp_frame_mask.squeeze()

    return (frame_features, frame_masks), labels


train_data, train_labels = prepare_all_videos(train_df, "train")
test_data, test_labels = prepare_all_videos(test_df, "test")

print(f"Frame features in train set: {train_data[0].shape}")
print(f"Frame masks in train set: {train_data[1].shape}")

2021-09-13 14:08:18.486751: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

Frame features in train set: (594, 20, 2048)
Frame masks in train set: (594, 20)

Frame features in train set: (594, 20, 2048)
Frame masks in train set: (594, 20)

上のコードブロックは実行されるマシンに依存して、実行に ~20 分かかるでしょう。

シークエンスモデル

そして、GRU のようなリカレント層から成るシークエンスモデルにこのデータを供給できます。

# Utility for our sequence model.
def get_sequence_model():
    class_vocab = label_processor.get_vocabulary()

    frame_features_input = keras.Input((MAX_SEQ_LENGTH, NUM_FEATURES))
    mask_input = keras.Input((MAX_SEQ_LENGTH,), dtype="bool")

    # Refer to the following tutorial to understand the significance of using `mask`:
    # https://keras.io/api/layers/recurrent_layers/gru/
    x = keras.layers.GRU(16, return_sequences=True)(
        frame_features_input, mask=mask_input
    )
    x = keras.layers.GRU(8)(x)
    x = keras.layers.Dropout(0.4)(x)
    x = keras.layers.Dense(8, activation="relu")(x)
    output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)

    rnn_model = keras.Model([frame_features_input, mask_input], output)

    rnn_model.compile(
        loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
    )
    return rnn_model


# Utility for running experiments.
def run_experiment():
    filepath = "/tmp/video_classifier"
    checkpoint = keras.callbacks.ModelCheckpoint(
        filepath, save_weights_only=True, save_best_only=True, verbose=1
    )

    seq_model = get_sequence_model()
    history = seq_model.fit(
        [train_data[0], train_data[1]],
        train_labels,
        validation_split=0.3,
        epochs=EPOCHS,
        callbacks=[checkpoint],
    )

    seq_model.load_weights(filepath)
    _, accuracy = seq_model.evaluate([test_data[0], test_data[1]], test_labels)
    print(f"Test accuracy: {round(accuracy * 100, 2)}%")

    return history, seq_model


_, sequence_model = run_experiment()

Epoch 1/10
13/13 [==============================] - 4s 101ms/step - loss: 1.5259 - accuracy: 0.3157 - val_loss: 1.4732 - val_accuracy: 0.3408
Epoch 00001: val_loss improved from inf to 1.47325, saving model to /tmp/video_classifi

...

Epoch 00009: val_loss did not improve from 1.47325
Epoch 10/10
13/13 [==============================] - 0s 20ms/step - loss: 0.6519 - accuracy: 0.8265 - val_loss: 1.9150 - val_accuracy: 0.3464

Epoch 00010: val_loss did not improve from 1.47325
7/7 [==============================] - 1s 5ms/step - loss: 1.3806 - accuracy: 0.6875
Test accuracy: 68.75%

Epoch 1/10
13/13 [==============================] - ETA: 0s - loss: 1.3961 - accuracy: 0.4000
Epoch 00001: val_loss improved from inf to 2.12746, saving model to /tmp/video_classifier
13/13 [==============================] - 10s 208ms/step - loss: 1.3961 - accuracy: 0.4000 - val_loss: 2.1275 - val_accuracy: 0.2682
Epoch 2/10
11/13 [========================>.....] - ETA: 0s - loss: 1.1441 - accuracy: 0.5739
Epoch 00002: val_loss improved from 2.12746 to 2.12174, saving model to /tmp/video_classifier
13/13 [==============================] - 0s 19ms/step - loss: 1.1290 - accuracy: 0.5904 - val_loss: 2.1217 - val_accuracy: 0.3352
Epoch 3/10
11/13 [========================>.....] - ETA: 0s - loss: 1.0144 - accuracy: 0.6790
Epoch 00003: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 17ms/step - loss: 1.0214 - accuracy: 0.6892 - val_loss: 2.2151 - val_accuracy: 0.3408
Epoch 4/10
13/13 [==============================] - ETA: 0s - loss: 0.9488 - accuracy: 0.7855
Epoch 00004: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 18ms/step - loss: 0.9488 - accuracy: 0.7855 - val_loss: 2.3421 - val_accuracy: 0.3408
Epoch 5/10
13/13 [==============================] - ETA: 0s - loss: 0.8538 - accuracy: 0.8145
Epoch 00005: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 18ms/step - loss: 0.8538 - accuracy: 0.8145 - val_loss: 2.4863 - val_accuracy: 0.3408
Epoch 6/10
10/13 [======================>.......] - ETA: 0s - loss: 0.8008 - accuracy: 0.8531
Epoch 00006: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 17ms/step - loss: 0.8075 - accuracy: 0.8602 - val_loss: 2.6000 - val_accuracy: 0.3408
Epoch 7/10
10/13 [======================>.......] - ETA: 0s - loss: 0.7974 - accuracy: 0.8156
Epoch 00007: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 17ms/step - loss: 0.7830 - accuracy: 0.8241 - val_loss: 2.7102 - val_accuracy: 0.3408
Epoch 8/10
13/13 [==============================] - ETA: 0s - loss: 0.7129 - accuracy: 0.8940
Epoch 00008: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 17ms/step - loss: 0.7129 - accuracy: 0.8940 - val_loss: 2.8141 - val_accuracy: 0.3408
Epoch 9/10
11/13 [========================>.....] - ETA: 0s - loss: 0.6945 - accuracy: 0.8920
Epoch 00009: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 17ms/step - loss: 0.6862 - accuracy: 0.9012 - val_loss: 2.9116 - val_accuracy: 0.3408
Epoch 10/10
10/13 [======================>.......] - ETA: 0s - loss: 0.6299 - accuracy: 0.8906
Epoch 00010: val_loss did not improve from 2.12174
13/13 [==============================] - 0s 17ms/step - loss: 0.6359 - accuracy: 0.8892 - val_loss: 3.0316 - val_accuracy: 0.3408
7/7 [==============================] - 2s 7ms/step - loss: 1.3497 - accuracy: 0.6696
Test accuracy: 66.96%
CPU times: user 17 s, sys: 1.08 s, total: 18 s
Wall time: 15.9 s

Note: このサンプルの実行時間を比較的短く保つために、幾つかの訓練サンプルだけを使用しました。この訓練サンプルの数は、99,909 訓練可能パラメータを持つ、使用されるシークエンスモデルの観点からは少ないです。上述のノートブックを使用して UCF101 データセットからより多くのデータをサンプリングして同じモデルを訓練することを勧めます。

推論

def prepare_single_video(frames):
    frames = frames[None, ...]
    frame_mask = np.zeros(shape=(1, MAX_SEQ_LENGTH,), dtype="bool")
    frame_features = np.zeros(shape=(1, MAX_SEQ_LENGTH, NUM_FEATURES), dtype="float32")

    for i, batch in enumerate(frames):
        video_length = batch.shape[0]
        length = min(MAX_SEQ_LENGTH, video_length)
        for j in range(length):
            frame_features[i, j, :] = feature_extractor.predict(batch[None, j, :])
        frame_mask[i, :length] = 1  # 1 = not masked, 0 = masked

    return frame_features, frame_mask


def sequence_prediction(path):
    class_vocab = label_processor.get_vocabulary()

    frames = load_video(os.path.join("test", path))
    frame_features, frame_mask = prepare_single_video(frames)
    probabilities = sequence_model.predict([frame_features, frame_mask])[0]

    for i in np.argsort(probabilities)[::-1]:
        print(f"  {class_vocab[i]}: {probabilities[i] * 100:5.2f}%")
    return frames


# This utility is for visualization.
# Referenced from:
# https://www.tensorflow.org/hub/tutorials/action_recognition_with_tf_hub
def to_gif(images):
    converted_images = images.astype(np.uint8)
    imageio.mimsave("animation.gif", converted_images, fps=10)
    return embed.embed_file("animation.gif")


test_video = np.random.choice(test_df["video_name"].values.tolist())
print(f"Test video path: {test_video}")
test_frames = sequence_prediction(test_video)
to_gif(test_frames[:MAX_SEQ_LENGTH])

Test video path: v_PlayingCello_g05_c03.avi
  PlayingCello: 25.61%
  CricketShot: 24.82%
  ShavingBeard: 19.38%
  TennisSwing: 17.43%
  Punch: 12.77%

Test video path: v_CricketShot_g02_c03.avi
  CricketShot: 66.34%
  Punch: 16.68%
  PlayingCello:  7.00%
  TennisSwing:  6.48%
  ShavingBeard:  3.51%

Next steps

このサンプルでは、動画フレームから意味ある特徴を抽出するために転移学習を利用しました。事前訓練済みネットワークをそれが最終結果にどう影響するかを知るために再調整することもできるでしょう。
速度と精度のトレードオフについては、tf.keras.applications に存在する他のモデルを試すことができます。
MAX_SEQ_LENGTH の異なる組合せをそれがパフォーマンスにどう影響するか観察するために試してください。
より大きい数のクラスで試して、良いパフォーマンスが得られるか確認してください。
このチュートリアルに従って、DeepMind からの事前訓練済みアクション認識モデルを試してください。
Rolling-averaging は動画分類のために有用なテクニックであり得て、それは動画を推論するために標準的な画像分類モデルと組み合わせることができます。このチュートリアルは画像分類器で rolling-averaging をどのように使用するかを理解するのに役立ちます。
動画のフレーム間に変動がある場合、そのカテゴリーを決定するためにフレームの総てが等しく重要ではないかもしれません。そのような状況では、シークエンスモデルに自己注意層を置くことがより良い結果を生成する傾向があります。
この本の章に従い、動画を処理するために Transformer ベースのモデルを実装できます。

以上

2021年12月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31