Keras 2 : examples : 手書きテキスト (可変長文字列) 認識 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 11/20/2021 (keras 2.7.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Computer Vision : Handwriting recognition (Author: A_K_Nain, Sayak Paul)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス ★ 無料 Web セミナー開催中 ★

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しております。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援
テレワーク & オンライン授業を支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
ウェビナー運用には弊社製品「ClassCat® Webinar」を利用しています。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
E-Mail：sales-info@classcat.com ; WebSite: www.classcat.com ; Facebook

Keras 2 : examples : 手書きテキスト (可変長文字列) 認識

Description: 可変長のシークエンスを使用する手書き認識モデルの訓練。

イントロダクション

このサンプルは Captcha OCR サンプルをどのように IAM データセットに拡張できるかを示します、これは可変長の正解ターゲットを持ちます。データセットの各サンプルは幾つかの手書きテキストの画像で、対応するターゲットは画像に存在する文字列です。IAM データセットは多くの OCR ベンチマークに渡り広く使用されていますので、このサンプルが OCR システムの構築のための良い開始点として役立てることを望みます。

データ収集

!wget -q https://git.io/J0fjL -O IAM_Words.zip
!unzip -qq IAM_Words.zip
!
!mkdir data
!mkdir data/words
!tar -xf IAM_Words/words.tgz -C data/words
!mv IAM_Words/words.txt data

データセットがどのように構造化されているかプレビューします。”#” が先頭にある行は単なるメタ情報です。

!head -20 data/words.txt

#--- words.txt ---------------------------------------------------------------#
#
# iam database word information
#
# format: a01-000u-00-00 ok 154 1 408 768 27 51 AT A
#
#     a01-000u-00-00  -> word id for line 00 in form a01-000u
#     ok              -> result of word segmentation
#                            ok: word was correctly
#                            er: segmentation of word can be bad
#
#     154             -> graylevel to binarize the line containing this word
#     1               -> number of components for this word
#     408 768 27 51   -> bounding box around this word in x,y,w,h format
#     AT              -> the grammatical tag for this word, see the
#                        file tagset.txt for an explanation
#     A               -> the transcription for this word
#
a01-000u-00-00 ok 154 408 768 27 51 AT A
a01-000u-00-01 ok 154 507 766 213 48 NN MOVE

インポート

from tensorflow.keras.layers.experimental.preprocessing import StringLookup
from tensorflow import keras

import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import os

np.random.seed(42)
tf.random.set_seed(42)

データセット分割

base_path = "data"
words_list = []

words = open(f"{base_path}/words.txt", "r").readlines()
for line in words:
    if line[0] == "#":
        continue
    if line.split(" ")[1] != "err":  # We don't need to deal with errored entries.
        words_list.append(line)

len(words_list)

np.random.shuffle(words_list)

データセットを 90:5:5 比率 (訓練:検証:テスト) で 3 つのサブセットに分割します。

split_idx = int(0.9 * len(words_list))
train_samples = words_list[:split_idx]
test_samples = words_list[split_idx:]

val_split_idx = int(0.5 * len(test_samples))
validation_samples = test_samples[:val_split_idx]
test_samples = test_samples[val_split_idx:]

assert len(words_list) == len(train_samples) + len(validation_samples) + len(
    test_samples
)

print(f"Total training samples: {len(train_samples)}")
print(f"Total validation samples: {len(validation_samples)}")
print(f"Total test samples: {len(test_samples)}")

Total training samples: 86810
Total validation samples: 4823
Total test samples: 4823

データ入力パイプライン

最初に画像パスを準備することでデータ入力パイプラインを構築していきます。

base_image_path = os.path.join(base_path, "words")


def get_image_paths_and_labels(samples):
    paths = []
    corrected_samples = []
    for (i, file_line) in enumerate(samples):
        line_split = file_line.strip()
        line_split = line_split.split(" ")

        # Each line split will have this format for the corresponding image:
        # part1/part1-part2/part1-part2-part3.png
        image_name = line_split[0]
        partI = image_name.split("-")[0]
        partII = image_name.split("-")[1]
        img_path = os.path.join(
            base_image_path, partI, partI + "-" + partII, image_name + ".png"
        )
        if os.path.getsize(img_path):
            paths.append(img_path)
            corrected_samples.append(file_line.split("\n")[0])

    return paths, corrected_samples


train_img_paths, train_labels = get_image_paths_and_labels(train_samples)
validation_img_paths, validation_labels = get_image_paths_and_labels(validation_samples)
test_img_paths, test_labels = get_image_paths_and_labels(test_samples)

次に正解ラベルを準備します。

# Find maximum length and the size of the vocabulary in the training data.
train_labels_cleaned = []
characters = set()
max_len = 0

for label in train_labels:
    label = label.split(" ")[-1].strip()
    for char in label:
        characters.add(char)

    max_len = max(max_len, len(label))
    train_labels_cleaned.append(label)

print("Maximum length: ", max_len)
print("Vocab size: ", len(characters))

# Check some label samples.
train_labels_cleaned[:10]

Maximum length:  21
Vocab size:  78

['sure',
 'he',
 'during',
 'of',
 'booty',
 'gastronomy',
 'boy',
 'The',
 'and',
 'in']

次に検証とテストラベルもクリーンアップします。

def clean_labels(labels):
    cleaned_labels = []
    for label in labels:
        label = label.split(" ")[-1].strip()
        cleaned_labels.append(label)
    return cleaned_labels


validation_labels_cleaned = clean_labels(validation_labels)
test_labels_cleaned = clean_labels(test_labels)

文字語彙の構築

Keras は様々な多様なデータを処理するために様々な前処理層を提供しています。このガイドは包括的なイントロダクションを提供しています。サンプルは文字レベルでのラベルの前処理を必要とします。これは、2 つのラベル, e.g. “cat” と “dog” がある場合、文字語彙は {a, c, d, g, o, t} であることを意味します (特殊トークンはありません)。この目的で StringLookup 層を使用します。

AUTOTUNE = tf.data.AUTOTUNE

# Mapping characters to integers.
char_to_num = StringLookup(vocabulary=list(characters), mask_token=None)

# Mapping integers back to original characters.
num_to_char = StringLookup(
    vocabulary=char_to_num.get_vocabulary(), mask_token=None, invert=True
)

歪み (= distortion) のない画像のリサイズ

正方形の画像の代わりに、多くの OCR モデルは矩形画像を扱います。データセットから幾つかのサンプルを可視化するとき、これはすぐに明らかになります。正方形画像の外観 (= aspect) を無視したリサイズが歪みの大きな総量を導入しない一方で、これは矩形画像には当てはまりません。しかし一様なサイズへの画像のリサイズはミニバッチ処理のための要件です。そこで以下の基準が満たされるようにリサイズを実行する必要があります :

アスペクト比は保存される。
画像の内容は影響されない。

def distortion_free_resize(image, img_size):
    w, h = img_size
    image = tf.image.resize(image, size=(h, w), preserve_aspect_ratio=True)

    # Check tha amount of padding needed to be done.
    pad_height = h - tf.shape(image)[0]
    pad_width = w - tf.shape(image)[1]

    # Only necessary if you want to do same amount of padding on both sides.
    if pad_height % 2 != 0:
        height = pad_height // 2
        pad_height_top = height + 1
        pad_height_bottom = height
    else:
        pad_height_top = pad_height_bottom = pad_height // 2

    if pad_width % 2 != 0:
        width = pad_width // 2
        pad_width_left = width + 1
        pad_width_right = width
    else:
        pad_width_left = pad_width_right = pad_width // 2

    image = tf.pad(
        image,
        paddings=[
            [pad_height_top, pad_height_bottom],
            [pad_width_left, pad_width_right],
            [0, 0],
        ],
    )

    image = tf.transpose(image, perm=[1, 0, 2])
    image = tf.image.flip_left_right(image)
    return image

プレーンなリサイズだけを行なう場合、画像は次のように見えます :

このリサイズが不必要な引き伸ばし (= stretching) を導入したことに気付くでしょう。

ユティリティをまとめる

batch_size = 64
padding_token = 99
image_width = 128
image_height = 32


def preprocess_image(image_path, img_size=(image_width, image_height)):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_png(image, 1)
    image = distortion_free_resize(image, img_size)
    image = tf.cast(image, tf.float32) / 255.0
    return image


def vectorize_label(label):
    label = char_to_num(tf.strings.unicode_split(label, input_encoding="UTF-8"))
    length = tf.shape(label)[0]
    pad_amount = max_len - length
    label = tf.pad(label, paddings=[[0, pad_amount]], constant_values=padding_token)
    return label


def process_images_labels(image_path, label):
    image = preprocess_image(image_path)
    label = vectorize_label(label)
    return {"image": image, "label": label}


def prepare_dataset(image_paths, labels):
    dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels)).map(
        process_images_labels, num_parallel_calls=AUTOTUNE
    )
    return dataset.batch(batch_size).cache().prefetch(AUTOTUNE)

tf.data.Dataset の準備

train_ds = prepare_dataset(train_img_paths, train_labels_cleaned)
validation_ds = prepare_dataset(validation_img_paths, validation_labels_cleaned)
test_ds = prepare_dataset(test_img_paths, test_labels_cleaned)

幾つかのサンプルの可視化

for data in train_ds.take(1):
    images, labels = data["image"], data["label"]

    _, ax = plt.subplots(4, 4, figsize=(15, 8))

    for i in range(16):
        img = images[i]
        img = tf.image.flip_left_right(img)
        img = tf.transpose(img, perm=[1, 0, 2])
        img = (img * 255.0).numpy().clip(0, 255).astype(np.uint8)
        img = img[:, :, 0]

        # Gather indices where label!= padding_token.
        label = labels[i]
        indices = tf.gather(label, tf.where(tf.math.not_equal(label, padding_token)))
        # Convert to string.
        label = tf.strings.reduce_join(num_to_char(indices))
        label = label.numpy().decode("utf-8")

        ax[i // 4, i % 4].imshow(img, cmap="gray")
        ax[i // 4, i % 4].set_title(label)
        ax[i // 4, i % 4].axis("off")


plt.show()

元の画像の内容ができる限り忠実に維持されてそれに従ってパディングされることに気付くでしょう。

モデル

私達のモデルはエンドポイント層として CTC 損失を使用します。CTC 損失の深い理解のためには、この投稿を参照してください。

class CTCLayer(keras.layers.Layer):
    def __init__(self, name=None):
        super().__init__(name=name)
        self.loss_fn = keras.backend.ctc_batch_cost

    def call(self, y_true, y_pred):
        batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
        input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
        label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")

        input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")
        label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")
        loss = self.loss_fn(y_true, y_pred, input_length, label_length)
        self.add_loss(loss)

        # At test time, just return the computed predictions.
        return y_pred


def build_model():
    # Inputs to the model
    input_img = keras.Input(shape=(image_width, image_height, 1), name="image")
    labels = keras.layers.Input(name="label", shape=(None,))

    # First conv block.
    x = keras.layers.Conv2D(
        32,
        (3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        padding="same",
        name="Conv1",
    )(input_img)
    x = keras.layers.MaxPooling2D((2, 2), name="pool1")(x)

    # Second conv block.
    x = keras.layers.Conv2D(
        64,
        (3, 3),
        activation="relu",
        kernel_initializer="he_normal",
        padding="same",
        name="Conv2",
    )(x)
    x = keras.layers.MaxPooling2D((2, 2), name="pool2")(x)

    # We have used two max pool with pool size and strides 2.
    # Hence, downsampled feature maps are 4x smaller. The number of
    # filters in the last layer is 64. Reshape accordingly before
    # passing the output to the RNN part of the model.
    new_shape = ((image_width // 4), (image_height // 4) * 64)
    x = keras.layers.Reshape(target_shape=new_shape, name="reshape")(x)
    x = keras.layers.Dense(64, activation="relu", name="dense1")(x)
    x = keras.layers.Dropout(0.2)(x)

    # RNNs.
    x = keras.layers.Bidirectional(
        keras.layers.LSTM(128, return_sequences=True, dropout=0.25)
    )(x)
    x = keras.layers.Bidirectional(
        keras.layers.LSTM(64, return_sequences=True, dropout=0.25)
    )(x)

    # +2 is to account for the two special tokens introduced by the CTC loss.
    # The recommendation comes here: https://git.io/J0eXP.
    x = keras.layers.Dense(
        len(char_to_num.get_vocabulary()) + 2, activation="softmax", name="dense2"
    )(x)

    # Add CTC layer for calculating CTC loss at each step.
    output = CTCLayer(name="ctc_loss")(labels, x)

    # Define the model.
    model = keras.models.Model(
        inputs=[input_img, labels], outputs=output, name="handwriting_recognizer"
    )
    # Optimizer.
    opt = keras.optimizers.Adam()
    # Compile the model and return.
    model.compile(optimizer=opt)
    return model


# Get the model.
model = build_model()
model.summary()

Model: "handwriting_recognizer"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
image (InputLayer)              [(None, 128, 32, 1)] 0                                            
__________________________________________________________________________________________________
Conv1 (Conv2D)                  (None, 128, 32, 32)  320         image[0][0]                      
__________________________________________________________________________________________________
pool1 (MaxPooling2D)            (None, 64, 16, 32)   0           Conv1[0][0]                      
__________________________________________________________________________________________________
Conv2 (Conv2D)                  (None, 64, 16, 64)   18496       pool1[0][0]                      
__________________________________________________________________________________________________
pool2 (MaxPooling2D)            (None, 32, 8, 64)    0           Conv2[0][0]                      
__________________________________________________________________________________________________
reshape (Reshape)               (None, 32, 512)      0           pool2[0][0]                      
__________________________________________________________________________________________________
dense1 (Dense)                  (None, 32, 64)       32832       reshape[0][0]                    
__________________________________________________________________________________________________
dropout (Dropout)               (None, 32, 64)       0           dense1[0][0]                     
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 32, 256)      197632      dropout[0][0]                    
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 32, 128)      164352      bidirectional[0][0]              
__________________________________________________________________________________________________
label (InputLayer)              [(None, None)]       0                                            
__________________________________________________________________________________________________
dense2 (Dense)                  (None, 32, 81)       10449       bidirectional_1[0][0]            
__________________________________________________________________________________________________
ctc_loss (CTCLayer)             (None, 32, 81)       0           label[0][0]                      
                                                                 dense2[0][0]                     
==================================================================================================
Total params: 424,081
Trainable params: 424,081
Non-trainable params: 0
__________________________________________________________________________________________________

評価メトリック

編集距離 (= Edit Distance ) は OCR モデルを評価するために最も広く使用されているメトリックです。このセクションでは、それを実装してモデルを監視するためにコールバックとして使用します。

最初に便宜上、検証画像とそれらのラベルを分離します。

validation_images = []
validation_labels = []

for batch in validation_ds:
    validation_images.append(batch["image"])
    validation_labels.append(batch["label"])

次に、編集距離を監視するためにコールバックを作成します。

def calculate_edit_distance(labels, predictions):
    # Get a single batch and convert its labels to sparse tensors.
    saprse_labels = tf.cast(tf.sparse.from_dense(labels), dtype=tf.int64)

    # Make predictions and convert them to sparse tensors.
    input_len = np.ones(predictions.shape[0]) * predictions.shape[1]
    predictions_decoded = keras.backend.ctc_decode(
        predictions, input_length=input_len, greedy=True
    )[0][0][:, :max_len]
    sparse_predictions = tf.cast(
        tf.sparse.from_dense(predictions_decoded), dtype=tf.int64
    )

    # Compute individual edit distances and average them out.
    edit_distances = tf.edit_distance(
        sparse_predictions, saprse_labels, normalize=False
    )
    return tf.reduce_mean(edit_distances)


class EditDistanceCallback(keras.callbacks.Callback):
    def __init__(self, pred_model):
        super().__init__()
        self.prediction_model = pred_model

    def on_epoch_end(self, epoch, logs=None):
        edit_distances = []

        for i in range(len(validation_images)):
            labels = validation_labels[i]
            predictions = self.prediction_model.predict(validation_images[i])
            edit_distances.append(calculate_edit_distance(labels, predictions).numpy())

        print(
            f"Mean edit distance for epoch {epoch + 1}: {np.mean(edit_distances):.4f}"
        )

訓練

さてモデル訓練を開始する準備ができました。

epochs = 10  # To get good results this should be at least 50.

model = build_model()
prediction_model = keras.models.Model(
    model.get_layer(name="image").input, model.get_layer(name="dense2").output
)
edit_distance_callback = EditDistanceCallback(prediction_model)

# Train the model.
history = model.fit(
    train_ds,
    validation_data=validation_ds,
    epochs=epochs,
    callbacks=[edit_distance_callback],
)

Epoch 1/10
1357/1357 [==============================] - 89s 51ms/step - loss: 13.6670 - val_loss: 11.8041
Mean edit distance for epoch 1: 20.5117
Epoch 2/10
1357/1357 [==============================] - 48s 36ms/step - loss: 10.6864 - val_loss: 9.6994
Mean edit distance for epoch 2: 20.1167
Epoch 3/10
1357/1357 [==============================] - 48s 35ms/step - loss: 9.0437 - val_loss: 8.0355
Mean edit distance for epoch 3: 19.7270
Epoch 4/10
1357/1357 [==============================] - 48s 35ms/step - loss: 7.6098 - val_loss: 6.4239
Mean edit distance for epoch 4: 19.1106
Epoch 5/10
1357/1357 [==============================] - 48s 35ms/step - loss: 6.3194 - val_loss: 4.9814
Mean edit distance for epoch 5: 18.4894
Epoch 6/10
1357/1357 [==============================] - 48s 35ms/step - loss: 5.3417 - val_loss: 4.1307
Mean edit distance for epoch 6: 18.1909
Epoch 7/10
1357/1357 [==============================] - 48s 35ms/step - loss: 4.6396 - val_loss: 3.7706
Mean edit distance for epoch 7: 18.1224
Epoch 8/10
1357/1357 [==============================] - 48s 35ms/step - loss: 4.1926 - val_loss: 3.3682
Mean edit distance for epoch 8: 17.9387
Epoch 9/10
1357/1357 [==============================] - 48s 36ms/step - loss: 3.8532 - val_loss: 3.1829
Mean edit distance for epoch 9: 17.9074
Epoch 10/10
1357/1357 [==============================] - 49s 36ms/step - loss: 3.5769 - val_loss: 2.9221
Mean edit distance for epoch 10: 17.7960

(訳注: 実験結果)

Epoch 1/10
1357/1357 [==============================] - ETA: 0s - loss: 13.6126Mean edit distance for epoch 1: 20.4893
1357/1357 [==============================] - 142s 88ms/step - loss: 13.6126 - val_loss: 11.8357
Epoch 2/10
1357/1357 [==============================] - ETA: 0s - loss: 10.5591Mean edit distance for epoch 2: 20.0652
1357/1357 [==============================] - 63s 46ms/step - loss: 10.5591 - val_loss: 9.5134
Epoch 3/10
1356/1357 [============================>.] - ETA: 0s - loss: 8.7765Mean edit distance for epoch 3: 19.5805
1357/1357 [==============================] - 63s 46ms/step - loss: 8.7769 - val_loss: 7.6806
Epoch 4/10
1356/1357 [============================>.] - ETA: 0s - loss: 7.1568Mean edit distance for epoch 4: 18.8701
1357/1357 [==============================] - 63s 47ms/step - loss: 7.1568 - val_loss: 5.8427
Epoch 5/10
1357/1357 [==============================] - ETA: 0s - loss: 5.9351Mean edit distance for epoch 5: 18.4914
1357/1357 [==============================] - 64s 47ms/step - loss: 5.9351 - val_loss: 4.8136
Epoch 6/10
1356/1357 [============================>.] - ETA: 0s - loss: 5.1171Mean edit distance for epoch 6: 18.2129
1357/1357 [==============================] - 63s 47ms/step - loss: 5.1172 - val_loss: 4.0588
Epoch 7/10
1357/1357 [==============================] - ETA: 0s - loss: 4.5114Mean edit distance for epoch 7: 18.0457
1357/1357 [==============================] - 63s 47ms/step - loss: 4.5114 - val_loss: 3.5931
Epoch 8/10
1356/1357 [============================>.] - ETA: 0s - loss: 4.0710Mean edit distance for epoch 8: 17.8817
1357/1357 [==============================] - 63s 47ms/step - loss: 4.0710 - val_loss: 3.2655
Epoch 9/10
1356/1357 [============================>.] - ETA: 0s - loss: 3.7387Mean edit distance for epoch 9: 17.7877
1357/1357 [==============================] - 63s 47ms/step - loss: 3.7388 - val_loss: 2.9869
Epoch 10/10
1356/1357 [============================>.] - ETA: 0s - loss: 3.5015Mean edit distance for epoch 10: 17.7893
1357/1357 [==============================] - 66s 48ms/step - loss: 3.5016 - val_loss: 2.9744
CPU times: user 16min 33s, sys: 1min 49s, total: 18min 23s
Wall time: 14min 27s

(50 epochs)

Epoch 1/50
1357/1357 [==============================] - ETA: 0s - loss: 13.4471Mean edit distance for epoch 1: 20.4373
1357/1357 [==============================] - 74s 50ms/step - loss: 13.4471 - val_loss: 11.8533
Epoch 2/50
1356/1357 [============================>.] - ETA: 0s - loss: 10.7246Mean edit distance for epoch 2: 20.0584
1357/1357 [==============================] - 65s 48ms/step - loss: 10.7249 - val_loss: 9.6146
Epoch 3/50
1357/1357 [==============================] - ETA: 0s - loss: 8.9240Mean edit distance for epoch 3: 19.7461
1357/1357 [==============================] - 65s 48ms/step - loss: 8.9240 - val_loss: 7.8377
Epoch 4/50
1356/1357 [============================>.] - ETA: 0s - loss: 7.3277Mean edit distance for epoch 4: 19.0892
1357/1357 [==============================] - 65s 48ms/step - loss: 7.3277 - val_loss: 6.1525
Epoch 5/50
1357/1357 [==============================] - ETA: 0s - loss: 6.0136Mean edit distance for epoch 5: 18.5805
1357/1357 [==============================] - 65s 48ms/step - loss: 6.0136 - val_loss: 4.8196
Epoch 6/50
1357/1357 [==============================] - ETA: 0s - loss: 5.1234Mean edit distance for epoch 6: 18.2658
1357/1357 [==============================] - 65s 48ms/step - loss: 5.1234 - val_loss: 4.0891
Epoch 7/50
1357/1357 [==============================] - ETA: 0s - loss: 4.5151Mean edit distance for epoch 7: 18.1098
1357/1357 [==============================] - 65s 48ms/step - loss: 4.5151 - val_loss: 3.6587
Epoch 8/50
1356/1357 [============================>.] - ETA: 0s - loss: 4.0568Mean edit distance for epoch 8: 17.8928
1357/1357 [==============================] - 66s 48ms/step - loss: 4.0570 - val_loss: 3.2468
Epoch 9/50
1357/1357 [==============================] - ETA: 0s - loss: 3.7304Mean edit distance for epoch 9: 17.8605
1357/1357 [==============================] - 66s 49ms/step - loss: 3.7304 - val_loss: 3.0323
Epoch 10/50
1357/1357 [==============================] - ETA: 0s - loss: 3.4922Mean edit distance for epoch 10: 17.7409
1357/1357 [==============================] - 66s 49ms/step - loss: 3.4922 - val_loss: 2.8043
Epoch 11/50
1356/1357 [============================>.] - ETA: 0s - loss: 3.2765Mean edit distance for epoch 11: 17.7393
1357/1357 [==============================] - 66s 49ms/step - loss: 3.2767 - val_loss: 2.6934
Epoch 12/50
1356/1357 [============================>.] - ETA: 0s - loss: 3.1172Mean edit distance for epoch 12: 17.6257
1357/1357 [==============================] - 68s 50ms/step - loss: 3.1174 - val_loss: 2.5190
Epoch 13/50
1357/1357 [==============================] - ETA: 0s - loss: 2.9876Mean edit distance for epoch 13: 17.6952
1357/1357 [==============================] - 66s 48ms/step - loss: 2.9876 - val_loss: 2.5399
Epoch 14/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.8918Mean edit distance for epoch 14: 17.6334
1357/1357 [==============================] - 67s 50ms/step - loss: 2.8918 - val_loss: 2.4650
Epoch 15/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.7731Mean edit distance for epoch 15: 17.5921
1357/1357 [==============================] - 66s 49ms/step - loss: 2.7731 - val_loss: 2.3714
Epoch 16/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.6854Mean edit distance for epoch 16: 17.5724
1357/1357 [==============================] - 66s 48ms/step - loss: 2.6857 - val_loss: 2.2714
Epoch 17/50
1357/1357 [==============================] - ETA: 0s - loss: 2.6109Mean edit distance for epoch 17: 17.5473
1357/1357 [==============================] - 66s 48ms/step - loss: 2.6109 - val_loss: 2.1894
Epoch 18/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.5395Mean edit distance for epoch 18: 17.5076
1357/1357 [==============================] - 66s 48ms/step - loss: 2.5398 - val_loss: 2.1756
Epoch 19/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.4615Mean edit distance for epoch 19: 17.4859
1357/1357 [==============================] - 65s 48ms/step - loss: 2.4616 - val_loss: 2.1148
Epoch 20/50
1357/1357 [==============================] - ETA: 0s - loss: 2.4156Mean edit distance for epoch 20: 17.4874
1357/1357 [==============================] - 66s 49ms/step - loss: 2.4156 - val_loss: 2.0923
Epoch 21/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.3655Mean edit distance for epoch 21: 17.4633
1357/1357 [==============================] - 65s 48ms/step - loss: 2.3655 - val_loss: 2.0256
Epoch 22/50
1357/1357 [==============================] - ETA: 0s - loss: 2.3188Mean edit distance for epoch 22: 17.4456
1357/1357 [==============================] - 65s 48ms/step - loss: 2.3188 - val_loss: 2.0074
Epoch 23/50
1357/1357 [==============================] - ETA: 0s - loss: 2.2649Mean edit distance for epoch 23: 17.4421
1357/1357 [==============================] - 65s 48ms/step - loss: 2.2649 - val_loss: 1.9864
Epoch 24/50
1357/1357 [==============================] - ETA: 0s - loss: 2.2271Mean edit distance for epoch 24: 17.4225
1357/1357 [==============================] - 65s 48ms/step - loss: 2.2271 - val_loss: 1.9584
Epoch 25/50
1357/1357 [==============================] - ETA: 0s - loss: 2.1922Mean edit distance for epoch 25: 17.4272
1357/1357 [==============================] - 64s 47ms/step - loss: 2.1922 - val_loss: 1.9683
Epoch 26/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.1556Mean edit distance for epoch 26: 17.4341
1357/1357 [==============================] - 65s 48ms/step - loss: 2.1558 - val_loss: 1.9814
Epoch 27/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.1258Mean edit distance for epoch 27: 17.4183
1357/1357 [==============================] - 65s 48ms/step - loss: 2.1258 - val_loss: 1.9473
Epoch 28/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.0964Mean edit distance for epoch 28: 17.4343
1357/1357 [==============================] - 64s 47ms/step - loss: 2.0964 - val_loss: 1.9463
Epoch 29/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.0501Mean edit distance for epoch 29: 17.4031
1357/1357 [==============================] - 65s 48ms/step - loss: 2.0501 - val_loss: 1.9043
Epoch 30/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.0311Mean edit distance for epoch 30: 17.4161
1357/1357 [==============================] - 65s 48ms/step - loss: 2.0313 - val_loss: 1.8567
Epoch 31/50
1356/1357 [============================>.] - ETA: 0s - loss: 2.0114Mean edit distance for epoch 31: 17.4091
1357/1357 [==============================] - 66s 49ms/step - loss: 2.0116 - val_loss: 1.8696
Epoch 32/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.9824Mean edit distance for epoch 32: 17.3840
1357/1357 [==============================] - 66s 49ms/step - loss: 1.9825 - val_loss: 1.9015
Epoch 33/50
1357/1357 [==============================] - ETA: 0s - loss: 1.9547Mean edit distance for epoch 33: 17.3754
1357/1357 [==============================] - 66s 49ms/step - loss: 1.9547 - val_loss: 1.8477
Epoch 34/50
1357/1357 [==============================] - ETA: 0s - loss: 1.9186Mean edit distance for epoch 34: 17.3707
1357/1357 [==============================] - 66s 48ms/step - loss: 1.9186 - val_loss: 1.8276
Epoch 35/50
1357/1357 [==============================] - ETA: 0s - loss: 1.9059Mean edit distance for epoch 35: 17.3801
1357/1357 [==============================] - 65s 48ms/step - loss: 1.9059 - val_loss: 1.8287
Epoch 36/50
1357/1357 [==============================] - ETA: 0s - loss: 1.8761Mean edit distance for epoch 36: 17.4086
1357/1357 [==============================] - 67s 49ms/step - loss: 1.8761 - val_loss: 1.8549
Epoch 37/50
1357/1357 [==============================] - ETA: 0s - loss: 1.8859Mean edit distance for epoch 37: 17.3773
1357/1357 [==============================] - 67s 49ms/step - loss: 1.8859 - val_loss: 1.7979
Epoch 38/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.8336Mean edit distance for epoch 38: 17.3572
1357/1357 [==============================] - 66s 49ms/step - loss: 1.8335 - val_loss: 1.7572
Epoch 39/50
1357/1357 [==============================] - ETA: 0s - loss: 1.8296Mean edit distance for epoch 39: 17.3655
1357/1357 [==============================] - 66s 49ms/step - loss: 1.8296 - val_loss: 1.7855
Epoch 40/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.8015Mean edit distance for epoch 40: 17.3752
1357/1357 [==============================] - 66s 49ms/step - loss: 1.8014 - val_loss: 1.7763
Epoch 41/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.8819Mean edit distance for epoch 41: 17.3934
1357/1357 [==============================] - 66s 48ms/step - loss: 1.8819 - val_loss: 1.8190
Epoch 42/50
1357/1357 [==============================] - ETA: 0s - loss: 1.7901Mean edit distance for epoch 42: 17.3488
1357/1357 [==============================] - 65s 48ms/step - loss: 1.7901 - val_loss: 1.7245
Epoch 43/50
1357/1357 [==============================] - ETA: 0s - loss: 1.7541Mean edit distance for epoch 43: 17.3599
1357/1357 [==============================] - 67s 49ms/step - loss: 1.7541 - val_loss: 1.7311
Epoch 44/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.7241Mean edit distance for epoch 44: 17.3524
1357/1357 [==============================] - 65s 48ms/step - loss: 1.7242 - val_loss: 1.7827
Epoch 45/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.7227Mean edit distance for epoch 45: 17.3653
1357/1357 [==============================] - 65s 48ms/step - loss: 1.7228 - val_loss: 1.7457
Epoch 46/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.7197Mean edit distance for epoch 46: 17.3513
1357/1357 [==============================] - 65s 48ms/step - loss: 1.7199 - val_loss: 1.7366
Epoch 47/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.6876Mean edit distance for epoch 47: 17.3430
1357/1357 [==============================] - 65s 48ms/step - loss: 1.6875 - val_loss: 1.7244
Epoch 48/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.7167Mean edit distance for epoch 48: 17.3273
1357/1357 [==============================] - 66s 48ms/step - loss: 1.7167 - val_loss: 1.6987
Epoch 49/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.7589Mean edit distance for epoch 49: 17.3550
1357/1357 [==============================] - 65s 48ms/step - loss: 1.7588 - val_loss: 1.7602
Epoch 50/50
1356/1357 [============================>.] - ETA: 0s - loss: 1.6587Mean edit distance for epoch 50: 17.3387
1357/1357 [==============================] - 65s 48ms/step - loss: 1.6589 - val_loss: 1.6907
CPU times: user 1h 16min 31s, sys: 8min 38s, total: 1h 25min 9s
Wall time: 1h 1min 48s

推論

# A utility function to decode the output of the network.
def decode_batch_predictions(pred):
    input_len = np.ones(pred.shape[0]) * pred.shape[1]
    # Use greedy search. For complex tasks, you can use beam search.
    results = keras.backend.ctc_decode(pred, input_length=input_len, greedy=True)[0][0][
        :, :max_len
    ]
    # Iterate over the results and get back the text.
    output_text = []
    for res in results:
        res = tf.gather(res, tf.where(tf.math.not_equal(res, -1)))
        res = tf.strings.reduce_join(num_to_char(res)).numpy().decode("utf-8")
        output_text.append(res)
    return output_text


#  Let's check results on some test samples.
for batch in test_ds.take(1):
    batch_images = batch["image"]
    _, ax = plt.subplots(4, 4, figsize=(15, 8))

    preds = prediction_model.predict(batch_images)
    pred_texts = decode_batch_predictions(preds)

    for i in range(16):
        img = batch_images[i]
        img = tf.image.flip_left_right(img)
        img = tf.transpose(img, perm=[1, 0, 2])
        img = (img * 255.0).numpy().clip(0, 255).astype(np.uint8)
        img = img[:, :, 0]

        title = f"Prediction: {pred_texts[i]}"
        ax[i // 4, i % 4].imshow(img, cmap="gray")
        ax[i // 4, i % 4].set_title(title)
        ax[i // 4, i % 4].axis("off")

plt.show()

より良い結果を得るには、モデルは少なくとも 50 エポックの間訓練されるべきです。

(50 epochs)

Final remarks

prediction_model は TensorFlow Lite と完全に互換です。興味があれば、モバイル・アプリケーション内でそれを利用できます。この件については、このノートブックが有用であるかもしれません。
総ての訓練サンプルがこのサンプルで観察されるように完全に整列されるわけではありません。これは複雑なシークエンスについてはモデル性能を劣化させる可能性があります。この目的で、Spatial Transformer ネットワーク ( Jaderberg et al. ) を活用できます、これはモデルがその性能を最大化するアフィン変換を学習するのに役立つことができます。

以上

2021年11月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30