TensorFlow 2.0 Alpha : Beginner Tutorials : テキストとシークエンス :- RNN でテキスト分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 04/02/2019

* 本ページは、TensorFlow の本家サイトの TF 2.0 Alpha – Beginner Tutorials – Text and sequences の以下のページを翻訳した上で適宜、補足説明したものです：

Text classification with an RNN

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

テキストとシークエンス :- RNN でテキスト分類

このテキスト分類チュートリアルはセンチメント分析のために IMDB 巨大映画レビューデータセット上でリカレント・ニューラルネットワークを訓練します。

from __future__ import absolute_import, division, print_function

!pip install -q tensorflow-gpu==2.0.0-alpha0
import tensorflow_datasets as tfds
import tensorflow as tf

matplotlib をインポートしてグラフをプロットするためのヘルパー関数を作成します :

import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

入力パイプラインをセットアップする

IMDB 巨大映画レビューデータセットは二値分類データセットです — 総てのレビューはポジティブかネガティブなセンチメントを持ちます。

TFDS を使用してデータセットをダウンロードします。データセットは作り付けの部分語字句解析器 (= subword tokenizer) を装備しています。

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True, 
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

これは部分語字句解析器ですから、それは任意の文字列を渡すことができて字句解析器はそれをトークン化します。

tokenizer = info.features['text'].encoder

print ('Vocabulary size: {}'.format(tokenizer.vocab_size))

Vocabulary size: 8185

sample_string = 'TensorFlow is cool.'

tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

assert original_string == sample_string

Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975]
The original string: TensorFlow is cool.

字句解析器は単語がその辞書にない場合には文字列を部分語に分解してエンコードします。

for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))

6307 ----> Ten
2327 ----> sor
4043 ----> Fl
4265 ----> ow 
9 ----> is 
2724 ----> cool
7975 ----> .

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)

test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

モデルを作成する

tf.keras.Sequential モデルを構築して埋め込み層から始めます。埋め込み層は単語毎に一つのベクトルをストアします。呼び出されたとき、それは単語インデックスのシークエンスをベクトルのシークエンスに変換します。これらのベクトルは訓練可能です。(十分なデータ上で) 訓練後、類似の意味を持つ単語はしばしば同様のベクトルを持ちます。

このインデックス-検索は tf.keras.layers.Dense 層を通した one-hot エンコード・ベクトルを渡す等値の演算よりも遥かにより効率的です。

リカレント・ニューラルネットワーク (RNN) は要素を通した iterate によるシークエンス入力を処理します。RNN は一つのタイムスタンプからの出力をそれらの入力 — そして次へと渡します。

tf.keras.layers.Bidirectional ラッパーはまた RNN 層とともに使用できます。これは RNN 層を通して入力を foward そして backward に伝播してそれから出力を連結します。これは RNN が長期の (= long range) 依存性を学習する手助けをします。

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

訓練プロセスを構成するために Keras モデルをコンパイルします :

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

モデルを訓練する

history = model.fit(train_dataset, epochs=10, 
                    validation_data=test_dataset)

Epoch 1/10
391/391 [==============================] - 75s 191ms/step - loss: 0.5536 - accuracy: 0.7140 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
391/391 [==============================] - 73s 187ms/step - loss: 0.3922 - accuracy: 0.8311 - val_loss: 0.5141 - val_accuracy: 0.7940
Epoch 3/10
391/391 [==============================] - 71s 182ms/step - loss: 0.3120 - accuracy: 0.8807 - val_loss: 0.4517 - val_accuracy: 0.8098
Epoch 4/10
391/391 [==============================] - 78s 199ms/step - loss: 0.2548 - accuracy: 0.9030 - val_loss: 0.4383 - val_accuracy: 0.8235
Epoch 5/10
391/391 [==============================] - 72s 185ms/step - loss: 0.2387 - accuracy: 0.9078 - val_loss: 0.4918 - val_accuracy: 0.8214
Epoch 6/10
391/391 [==============================] - 71s 182ms/step - loss: 0.1905 - accuracy: 0.9293 - val_loss: 0.4849 - val_accuracy: 0.8162
Epoch 7/10
391/391 [==============================] - 71s 182ms/step - loss: 0.1900 - accuracy: 0.9282 - val_loss: 0.5919 - val_accuracy: 0.8257
Epoch 8/10
391/391 [==============================] - 74s 190ms/step - loss: 0.1321 - accuracy: 0.9526 - val_loss: 0.6331 - val_accuracy: 0.7657
Epoch 9/10
391/391 [==============================] - 73s 187ms/step - loss: 0.3290 - accuracy: 0.8516 - val_loss: 0.6709 - val_accuracy: 0.6501
Epoch 10/10
391/391 [==============================] - 70s 180ms/step - loss: 0.3074 - accuracy: 0.8692 - val_loss: 0.5533 - val_accuracy: 0.7873

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

    391/Unknown - 19s 47ms/step - loss: 0.5533 - accuracy: 0.7873Test Loss: 0.553319326714
Test Accuracy: 0.787320017815

上のモデルはシークエンスに適用されるパディングをマスクしていません。これはパッドされたシークエンス上で訓練してパッドされていないシークエンス上でテストする場合に歪みに繋がる可能性があります。理想的にはモデルはパディングを無視することを学習するでしょうが、下で見れるようにそれは出力上で小さい効果を持つだけです。

prediction が >= 0.5 であれば、それはポジティブでそうでなければネガティブです。

def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec

def sample_predict(sentence, pad):
  tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
  
  if pad:
    tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)
  
  predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0))

  return (predictions)

# predict on a sample text without padding.

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)

[[ 0.68914342]]

# predict on a sample text with padding

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)

[[ 0.68634349]]

plot_graphs(history, 'accuracy')

plot_graphs(history, 'loss')

2 つあるいはそれ以上の LSTM 層をスタックする

Keras リカレント層は return_sequences コンストラクタ引数で制御される 2 つの利用可能なモードを持ちます :

各タイムスタンプのための連続する出力の完全なシークエンスを返すか (shape (batch_size, timesteps, output_features) の 3D tensor)、
各入力シークエンスのための最後の出力だけを返します (shape (batch_size, output_features) の 2D tensor)。

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
        64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=10, 
                    validation_data=test_dataset)

Epoch 1/10
391/391 [==============================] - 155s 397ms/step - loss: 0.6349 - accuracy: 0.6162 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
391/391 [==============================] - 155s 396ms/step - loss: 0.6333 - accuracy: 0.6134 - val_loss: 0.5872 - val_accuracy: 0.6914
Epoch 3/10
391/391 [==============================] - 153s 391ms/step - loss: 0.4199 - accuracy: 0.8217 - val_loss: 0.4361 - val_accuracy: 0.8187
Epoch 4/10
391/391 [==============================] - 156s 398ms/step - loss: 0.3088 - accuracy: 0.8785 - val_loss: 0.4131 - val_accuracy: 0.8319
Epoch 5/10
391/391 [==============================] - 153s 391ms/step - loss: 0.3328 - accuracy: 0.8564 - val_loss: 0.4689 - val_accuracy: 0.7958
Epoch 6/10
391/391 [==============================] - 156s 398ms/step - loss: 0.2383 - accuracy: 0.9128 - val_loss: 0.4299 - val_accuracy: 0.8404
Epoch 7/10
391/391 [==============================] - 152s 388ms/step - loss: 0.2426 - accuracy: 0.9039 - val_loss: 0.4934 - val_accuracy: 0.8299
Epoch 8/10
391/391 [==============================] - 155s 396ms/step - loss: 0.1638 - accuracy: 0.9440 - val_loss: 0.5106 - val_accuracy: 0.8279
Epoch 9/10
391/391 [==============================] - 150s 383ms/step - loss: 0.1616 - accuracy: 0.9420 - val_loss: 0.5287 - val_accuracy: 0.8245
Epoch 10/10
391/391 [==============================] - 154s 394ms/step - loss: 0.1120 - accuracy: 0.9643 - val_loss: 0.5646 - val_accuracy: 0.8070

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

    391/Unknown - 45s 115ms/step - loss: 0.5646 - accuracy: 0.8070Test Loss: 0.564571284348
Test Accuracy: 0.80703997612

# predict on a sample text without padding.

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)

[[ 0.00393916]]

# predict on a sample text with padding

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)

[[ 0.01098633]]

plot_graphs(history, 'accuracy')

plot_graphs(history, 'loss')

GRU 層のような他の存在するリカレント層を調べてください。

以上

2019年4月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30