TensorFlow 2.0 : Beginner Tutorials : Keras ML 基本 :- 前処理されたテキストでテキスト分類 : 映画レビュー (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 10/03/2019

* 本ページは、TensorFlow org サイトの TF 2.0 – Beginner Tutorials – ML basics with Keras の以下のページを翻訳した上で
適宜、補足説明したものです：

Text classification with preprocessed text: Movie reviews

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマにウェビナー (WEB セミナー) を定期的に開催しています。スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
Windows PC のブラウザからご参加が可能です。スマートデバイスもご利用可能です。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/

Facebook: https://www.facebook.com/ClassCatJP/

Keras ML 基本 :- 前処理されたテキストでテキスト分類 : 映画レビュー

このノートブックは (映画) レビューのテキストを使用して映画レビューを肯定的か否定的として分類します。これは二値 — あるいは 2 クラス — 分類の例で、重要で広く応用可能な種類の機械学習問題です。

私達は IMDB データセットを使用します、これは Internet Movie Database からの 50,000 映画レビューのテキストを含みます。これらは訓練のための 25,000 レビューとテストのための 25,000 レビューに分割されます。訓練とテストセットは均等です、それらがポジティブとネガティブ・レビューの同じ数を含むことを意味します。

この notebook は tf.keras を使用します、TensorFlow でモデルを構築して訓練するための高位 API です。tf.keras を使用したより進んだテキスト分類チュートリアルについては、MLCC テキスト分類ガイドを見てください。

セットアップ

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf

from tensorflow import keras

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

import numpy as np

print(tf.__version__)

2.0.0

IMDB データセットをダウンロードする

IMDB 映画レビュー・データセットは tfds でパッケージ化されています。

それは既にレビュー (単語のシークエンス) が整数のシークエンスに変換されるように前処理されていて、そこでは各整数は辞書の特定の単語を表しています。

次のコードは IMDB データセットを貴方のマシンにダウンロードします (あるいは既にそれをダウンロードしているのであればキャッシュされたコピーを使用します) :

貴方自身のテキストをエンコードするには Loading text チュートリアルを見てください

(train_data, test_data), info = tfds.load(
    # Use the version pre-encoded with an ~8k vocabulary.
    'imdb_reviews/subwords8k', 
    # Return the train/test datasets as a tuple.
    split = (tfds.Split.TRAIN, tfds.Split.TEST),
    # Return (example, label) pairs from the dataset (instead of a dictionary).
    as_supervised=True,
    # Also return the `info` structure. 
    with_info=True)

Downloading and preparing dataset imdb_reviews (80.23 MiB) to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/0.1.0...
WARNING:tensorflow:From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow_datasets/core/file_format_adapter.py:209: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

WARNING:tensorflow:From /home/kbuilder/.local/lib/python3.5/site-packages/tensorflow_datasets/core/file_format_adapter.py:209: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`

Dataset imdb_reviews downloaded and prepared to /home/kbuilder/tensorflow_datasets/imdb_reviews/subwords8k/0.1.0. Subsequent calls will reuse this data.

エンコーダを試す

データセットの info はテキスト・エンコーダ ( tfds.features.text.SubwordTextEncoder ) を含みます。

encoder = info.features['text'].encoder

print ('Vocabulary size: {}'.format(encoder.vocab_size))

Vocabulary size: 8185

このテキスト・エンコーダは任意の文字列を可逆的にエンコードします :

sample_string = 'Hello TensorFlow.'

encoded_string = encoder.encode(sample_string)
print ('Encoded string is {}'.format(encoded_string))

original_string = encoder.decode(encoded_string)
print ('The original string: "{}"'.format(original_string))

assert original_string == sample_string

Encoded string is [4025, 222, 6307, 2327, 4043, 2120, 7975]
The original string: "Hello TensorFlow."

エンコーダは文字列を単語が辞書にない場合にはそれを部分単語か文字に分解することによりエンコードします。そのためより多くの文字列がデータセットに類似していれば、エンコードされた表現はより短くなります。

for ts in encoded_string:
  print ('{} ----> {}'.format(ts, encoder.decode([ts])))

4025 ----> Hell
222 ----> o 
6307 ----> Ten
2327 ----> sor
4043 ----> Fl
2120 ----> ow
7975 ----> .

データを調査する

データのフォーマットを理解するために少し時間をつかいましょう。データセットは前処理されています : 各サンプルは映画レビューの単語を表わす整数の配列です。

レビューのテキストは整数に変換されます、そこでは各整数は辞書の特定の単語ピースを表します。

各ラベルは 0 か 1 の整数値で、そこでは 0 は否定的なレビューで、1 は肯定的なレビューです。

ここに最初のレビューがどのように見えるがあります :

for train_example, train_label in train_data.take(1):
  print('Encoded text:', train_example[:10].numpy())
  print('Label:', train_label.numpy())

Encoded text: [ 249    4  277  309  560    6 6639 4574    2   12]
Label: 1

info 構造はエンコーダ/デコーダを含みます。エンコーダは元のテキストをリカバーするために使用できます :

encoder.decode(train_example)

"As a lifelong fan of Dickens, I have invariably been disappointed by adaptations of his novels.

Although his works presented an extremely accurate re-telling of human life at every level in Victorian Britain, throughout them all was a pervasive thread of humour that could be both playful or sarcastic as the narrative dictated. In a way, he was a literary caricaturist and cartoonist. He could be serious and hilarious in the same sentence. He pricked pride, lampooned arrogance, celebrated modesty, and empathised with loneliness and poverty. It may be a cliché, but he was a people's writer.

And it is the comedy that is so often missing from his interpretations. At the time of writing, Oliver Twist is being dramatised in serial form on BBC television. All of the misery and cruelty is their, but non of the humour, irony, and savage lampoonery. The result is just a dark, dismal experience: the story penned by a journalist rather than a novelist. It's not really Dickens at all.

'Oliver!', on the other hand, is much closer to the mark. The mockery of officialdom is perfectly interpreted, from the blustering beadle to the drunken magistrate. The classic stand-off between the beadle and Mr Brownlow, in which the law is described as 'a ass, a idiot' couldn't have been better done. Harry Secombe is an ideal choice.

But the blinding cruelty is also there, the callous indifference of the state, the cold, hunger, poverty and loneliness are all presented just as surely as The Master would have wished.

And then there is crime. Ron Moody is a treasure as the sleazy Jewish fence, whilst Oliver Reid has Bill Sykes to perfection.

Perhaps not surprisingly, Lionel Bart - himself a Jew from London's east-end - takes a liberty with Fagin by re-interpreting him as a much more benign fellow than was Dicken's original. In the novel, he was utterly ruthless, sending some of his own boys to the gallows in order to protect himself (though he was also caught and hanged). Whereas in the movie, he is presented as something of a wayward father-figure, a sort of charitable thief rather than a corrupter of children, the latter being a long-standing anti-semitic sentiment. Otherwise, very few liberties are taken with Dickens's original. All of the most memorable elements are included. Just enough menace and violence is retained to ensure narrative fidelity whilst at the same time allowing for children' sensibilities. Nancy is still beaten to death, Bullseye narrowly escapes drowning, and Bill Sykes gets a faithfully graphic come-uppance.

Every song is excellent, though they do incline towards schmaltz. Mark Lester mimes his wonderfully. Both his and my favourite scene is the one in which the world comes alive to 'who will buy'. It's schmaltzy, but it's Dickens through and through.

I could go on. I could commend the wonderful set-pieces, the contrast of the rich and poor. There is top-quality acting from more British regulars than you could shake a stick at.

I ought to give it 10 points, but I'm feeling more like Scrooge today. Soak it up with your Christmas dinner. No original has been better realised."

訓練のためにデータを準備する

モデルのための訓練データのバッチを作成することを望みます。レビューは総て異なる長さですので、バッチ処理する間にシークエンスをゼロパッドするために padded_batch を使用します :

BUFFER_SIZE = 1000

train_batches = (
    train_data
    .shuffle(BUFFER_SIZE)
    .padded_batch(32, train_data.output_shapes))

test_batches = (
    test_data
    .padded_batch(32, train_data.output_shapes))

各バッチは (batch_size, sequence_length) の shape を持ちます、パディングは動的ですので各バッチは異なる長さを持つでしょう :

for example_batch, label_batch in train_batches.take(2):
  print("Batch shape:", example_batch.shape)
  print("label shape:", label_batch.shape)

Batch shape: (32, 1186)
label shape: (32,)
Batch shape: (32, 1111)
label shape: (32,)

モデルを構築する

ニューラルネットワークは層をスタックすることにより作成されます — これは 2 つの主要なアーキテクチャ的な決定を必要とします :

モデルで幾つの層を使用するか？
各層のために幾つの隠れユニットを使用するか？

このサンプルでは、入力データは単語インデックスの配列から成ります。予測するラベルは 0 か 1 です。この問題のために “Continuous bag of words (CBOW)” スタイルのモデルを構築しましょう :

警告: このモデルはマスキングを使用しませんので、入力の一部としてゼロ・パディングが使用されますので、パディング長は出力に影響するかもしれません。これを修正するには、マスキングとパディング・ガイドを見てください。

model = keras.Sequential([
  keras.layers.Embedding(encoder.vocab_size, 16),
  keras.layers.GlobalAveragePooling1D(),
  keras.layers.Dense(1, activation='sigmoid')])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
=================================================================
Total params: 130,977
Trainable params: 130,977
Non-trainable params: 0

分類器を構築するために層はシーケンシャルにスタックされます :

最初の層は Embedding 層です。この層は整数エンコードされた語彙を取って各単語インデックスのための埋め込みベクトルを検索します。これらのベクトルはモデルが訓練されるときに学習されます。ベクトルは出力配列に次元を追加します。結果としての次元は (batch, sequence, embedding) です。
次に、GlobalAveragePooling1D 層は各サンプルについて sequence 次元に渡り平均することにより固定長出力ベクトルを返します。これは可能な最も単純な方法でモデルが可変長の入力を扱うことを可能にします。
この固定長出力ベクトルは 16 隠れユニットを持つ完全結合 (Dense) 層を通してパイプされます。
最後の層は単一の出力ノードに密に接続されています。sigmoid 活性化関数を使用して、この値は 0 と 1 の間の浮動小数点で、確率、または確信レベルを表します。

隠れユニット

上のモデルは入力と出力の間に、2 つの中間層あるいは「隠れ」層を持ちます。出力 (ユニット、ノード、またはニューロン) の数は層のための具象空間の次元です。換言すれば、内部表現を学習するときにネットワークが許容される自由度の総量です。

モデルがより多くの隠れユニット (より高い次元の表現空間) and/or より多くの層を持てば、ネットワークはより複雑な表現を学習できます。けれども、それはネットワークをより計算的に高価にして望まないパターンを学習することに繋がるかもしれません — このパターンは訓練データ上の性能を改善しますがテストデータ上ではそうではないものです。これは overfitting と呼ばれ、後でそれを調査します。

損失関数と optimizer

モデルは訓練のために損失関数と optimizer を必要とします。これは二値分類問題でモデルは確率を出力します (sigmoid 活性を持つシングルユニット層) ので、binary_crossentropy 損失関数を使用します。

これは損失関数のための唯一の選択ではありません、例えば mean_squared_error を選択できるでしょう。しかし一般的に binary_crossentropy は確率を扱うためにはより良いです — それは確率分布間、あるいは私達のケースでは、正解の分布と予測の間の「距離」を測ります。

後で、回帰問題 (例えば、家の価格を予測する) を調べているときに、mean squared error と呼ばれるもう一つの損失関数をどのように使用するかを見ます。

さて、optimizer と損失関数を使用するためにモデルを configure します :

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

モデルを訓練する

Dataset オブジェクトを model の fit 関数に渡すことによりモデルを訓練します。エポック数を設します。

history = model.fit(train_batches,
                    epochs=10,
                    validation_data=test_batches,
                    validation_steps=30)

Epoch 1/10
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where

782/782 [==============================] - 8s 10ms/step - loss: 0.6822 - accuracy: 0.5978 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
782/782 [==============================] - 5s 7ms/step - loss: 0.6213 - accuracy: 0.7495 - val_loss: 0.5959 - val_accuracy: 0.7635
Epoch 3/10
782/782 [==============================] - 5s 7ms/step - loss: 0.5426 - accuracy: 0.8070 - val_loss: 0.5296 - val_accuracy: 0.8073
Epoch 4/10
782/782 [==============================] - 5s 7ms/step - loss: 0.4762 - accuracy: 0.8382 - val_loss: 0.4752 - val_accuracy: 0.8313
Epoch 5/10
782/782 [==============================] - 5s 7ms/step - loss: 0.4220 - accuracy: 0.8644 - val_loss: 0.4334 - val_accuracy: 0.8500
Epoch 6/10
782/782 [==============================] - 5s 7ms/step - loss: 0.3823 - accuracy: 0.8769 - val_loss: 0.4054 - val_accuracy: 0.8562
Epoch 7/10
782/782 [==============================] - 5s 7ms/step - loss: 0.3508 - accuracy: 0.8861 - val_loss: 0.3760 - val_accuracy: 0.8677
Epoch 8/10
782/782 [==============================] - 5s 7ms/step - loss: 0.3234 - accuracy: 0.8931 - val_loss: 0.3566 - val_accuracy: 0.8750
Epoch 9/10
782/782 [==============================] - 5s 7ms/step - loss: 0.3033 - accuracy: 0.9000 - val_loss: 0.3406 - val_accuracy: 0.8802
Epoch 10/10
782/782 [==============================] - 5s 7ms/step - loss: 0.2872 - accuracy: 0.9036 - val_loss: 0.3285 - val_accuracy: 0.8844

モデルを評価する

そしてモデルどのように遂行するか見ましょう。2 つの値が返されます。損失 (エラーを表わす数字です、より低ければより良いです)、そして精度です。

loss, accuracy = model.evaluate(test_batches)

print("Loss: ", loss)
print("Accuracy: ", accuracy)

782/782 [==============================] - 3s 4ms/step - loss: 0.3320 - accuracy: 0.8772
Loss:  0.33202573521743955
Accuracy:  0.8772

このかなり素朴なアプローチは約 87 % の精度を得ます。より進んだアプローチでは、モデルは 95 % に近づくはずです。

時間に渡る精度と損失のグラフを作成する

model.fit() は History オブジェクトを返します、これは訓練の間に発生した総てを持つ辞書を含みます :

history_dict = history.history
history_dict.keys()

dict_keys(['loss', 'val_accuracy', 'accuracy', 'val_loss'])

4 つのエントリがあります: 訓練と検証の間に各々監視されたメトリックのために一つ (ずつ) です。比較のために訓練と検証精度に加えて、訓練と検証損失をプロットするためにこれらを使用することができます :

import matplotlib.pyplot as plt

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

<Figure size 640x480 with 1 Axes>

plt.clf()   # clear figure

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

plt.show()

このプロットでは、点線は訓練損失と精度を表し、そして実線は検証損失と精度です。

訓練損失は各エポックとともに減少して訓練精度は各エポックとともに増加することに気がつくでしょう。これは勾配降下最適化を使用するときに期待されるものです — それは総ての反復で望まれる量を最小化するはずです。

これは検証損失と精度については当てはまりません — それらは約 20 エポック後に最大になるようです。これは overfitting の例です : モデルは、それが前に決して見ていないデータ上で行なうよりも訓練データ上でより上手く遂行します。このポイント後、モデルは過剰に最適化されてテストデータに一般化されない訓練データに固有の表現を学習します。

この特定のケースのためには、単純に 20 程度のエポック後に訓練を停止することで overfitting を回避できるでしょう。後で、これを callback で自動的にどのように行なうかを見るでしょう。

以上

2019年10月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31