Keras 2 : examples : NLP – 事前訓練済み単語埋め込みの使用 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 06/01/2022 (keras 2.9.0)

* 本ページは、Keras の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Code examples : Natural Language Processing : Using pre-trained word embeddings (Author: fchollet)

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Keras 2 : examples : 自然言語処理 – 事前訓練済み単語埋め込みの使用

Description : 事前訓練済み GloVe 単語埋め込みを使用した Newsgroup20 データセット上のテキスト分類。

セットアップ

import numpy as np
import tensorflow as tf
from tensorflow import keras

イントロダクション

この例では、事前訓練済み単語埋め込みを使用するテキスト分類モデルを訓練する方法を示します。

Newsgroup20 データセットを扱います、20 の異なるトピックカテゴリーに属する 20,000 のメッセージボードのメッセージのセットです。

事前訓練済み単語埋め込みについては、GloVe 埋め込みを使用します。

Newsgroup20 データのダウンロード

data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

データを見てみましょう

import os
import pathlib

data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of directories: 20
Directory names: ['talk.politics.mideast', 'rec.autos', 'comp.sys.mac.hardware', 'alt.atheism', 'rec.sport.baseball', 'comp.os.ms-windows.misc', 'rec.sport.hockey', 'sci.crypt', 'sci.med', 'talk.politics.misc', 'rec.motorcycles', 'comp.windows.x', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'sci.electronics', 'talk.politics.guns', 'sci.space', 'soc.religion.christian', 'misc.forsale', 'talk.religion.misc']
Number of files in comp.graphics: 1000
Some example filenames: ['38254', '38402', '38630', '38865', '38891']

ここに一つのファイルが含むもののサンプルがあります :

print(open(data_dir / "comp.graphics" / "38987").read())

Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: mabusj@nason110.its.rpi.edu (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <c285m+p@rpi.edu>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: mabusj@rpi.edu
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7

Jasen Mabus
RPI student

I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to mabusj@rpi.edu.

Thank you in advance,
Jasen Mabus

ご覧のように、明示的であるにせよ (最初の行が文字通りカテゴリー名) 、暗黙的であるにせよ e.g. ファイルされた Organization によって、ファイルのカテゴリーを漏らしているヘッダ行があります、ヘッダを除去しましょう :

samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Number of samples: 19997

実際には想定する数のファイルを持たない一つのカテゴリーがありますが、違いは十分に小さく、問題は均衡した分類問題のままです。

データを訓練 & 検証セットにシャッフルして分割する

# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

語彙インデックスの作成

データセットで見つかった語彙をインデックスするために TextVectorization を使用しましょう。後で、サンプルをベクトル化するために同じ層インスタンスを使用します。

層は上位 20,000 単語だけを考慮して、実際には 200 トークン長になるようにシークエンスを切り詰めるかパディングします。

from tensorflow.keras.layers import TextVectorization

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

vectorizer.get_vocabulary() を通して使用される計算済みの語彙を取得できます。上位 5 単語をプリントしましょう :

vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'the', 'to', 'of']

テストシークエンスをベクトル化しましょう :

output = vectorizer([["the cat sat on the mat"]])
output.numpy()[0, :6]

array([   2, 3697, 1686,   15,    2, 5943])

ご覧のように、”the” は “2” として表されます。”the” が語彙の最初の単語であるとすると、何故 0 でないのでしょう？それはインデックス 0 はパディングのために予約され、そしてインデックス 1 は “out of vocabulary” トークンのために予約されているからです。

ここに単語をインデックスにマップする辞書があります :

voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

ご覧のように、テストセンテンスについて上と同じエンコーディングを得ます :

test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

[2, 3697, 1686, 15, 2, 5943]

事前訓練済み埋め込みのロード

事前訓練済み GloVe 埋め込みをダウンロードしましょう (822M zip ファイル)。

次のコマンドを実行する必要があります :

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

アーカイブは様々なサイズのテキスト・エンコードされたベクトルを含みます : 50-次元、100-次元、200-次元、300-次元です。100D のものを使用します。

単語 (文字列) を NumPy ベクトル表現にマップする辞書を作成しましょう :

path_to_glove_file = os.path.join(
    os.path.expanduser("~"), ".keras/datasets/glove.6B.100d.txt"
)

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.

そして、Keras Embedding で使用できる、対応する埋め込み行列を準備しましょう。それは単純な NumPy 行列で、そこではインデックス i のエントリは vectorizer の語彙内のインデックス i の単語に対する事前訓練済みベクトルです。

num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 17999 words (2001 misses)

次に、事前訓練済み単語埋め込み行列を Embedding 層にロードします。埋め込みを固定し続けるために trainable=False を設定することに注意してください (訓練の間にそれらを更新することを望みません)。

from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

モデルの構築

グローバル max プーリングを持つ単純な 1D convnet と最後に分類器です。

from tensorflow.keras import layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         2000200   
_________________________________________________________________
conv1d (Conv1D)              (None, None, 128)         64128     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 128)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, None, 128)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               16512     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 20)                2580      
=================================================================
Total params: 2,247,516
Trainable params: 247,316
Non-trainable params: 2,000,200
_________________________________________________________________

モデルの訓練

最初に、文字列のリストデータを整数インデックスの NumPy 配列に変換します。配列は右パディング (= right-padded) されています。

x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

softmax 分類を行っていますので、損失として categorical crossentropy を使用します。更に、ラベルが整数ですので sparse_categorical_crossentropy を使用します。

model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

Epoch 1/20
125/125 [==============================] - 8s 57ms/step - loss: 2.8766 - acc: 0.0945 - val_loss: 2.0770 - val_acc: 0.2956
Epoch 2/20
125/125 [==============================] - 7s 58ms/step - loss: 2.0792 - acc: 0.2887 - val_loss: 1.6626 - val_acc: 0.4076
Epoch 3/20
125/125 [==============================] - 7s 60ms/step - loss: 1.5632 - acc: 0.4527 - val_loss: 1.3000 - val_acc: 0.5609
Epoch 4/20
125/125 [==============================] - 8s 60ms/step - loss: 1.2945 - acc: 0.5612 - val_loss: 1.2282 - val_acc: 0.5944
Epoch 5/20
125/125 [==============================] - 8s 61ms/step - loss: 1.1137 - acc: 0.6209 - val_loss: 1.0695 - val_acc: 0.6409
Epoch 6/20
125/125 [==============================] - 8s 61ms/step - loss: 0.9556 - acc: 0.6718 - val_loss: 1.1743 - val_acc: 0.6124
Epoch 7/20
125/125 [==============================] - 8s 61ms/step - loss: 0.8235 - acc: 0.7172 - val_loss: 1.0126 - val_acc: 0.6602
Epoch 8/20
125/125 [==============================] - 8s 65ms/step - loss: 0.7268 - acc: 0.7475 - val_loss: 1.0608 - val_acc: 0.6632
Epoch 9/20
125/125 [==============================] - 8s 63ms/step - loss: 0.6441 - acc: 0.7759 - val_loss: 1.0606 - val_acc: 0.6664
Epoch 10/20
125/125 [==============================] - 8s 63ms/step - loss: 0.5409 - acc: 0.8120 - val_loss: 1.0380 - val_acc: 0.6884
Epoch 11/20
125/125 [==============================] - 8s 65ms/step - loss: 0.4846 - acc: 0.8273 - val_loss: 1.1073 - val_acc: 0.6729
Epoch 12/20
125/125 [==============================] - 8s 62ms/step - loss: 0.4173 - acc: 0.8553 - val_loss: 1.1256 - val_acc: 0.6864
Epoch 13/20
125/125 [==============================] - 8s 63ms/step - loss: 0.3419 - acc: 0.8808 - val_loss: 1.1576 - val_acc: 0.6979
Epoch 14/20
125/125 [==============================] - 8s 68ms/step - loss: 0.2869 - acc: 0.9053 - val_loss: 1.1381 - val_acc: 0.6974
Epoch 15/20
125/125 [==============================] - 8s 67ms/step - loss: 0.2617 - acc: 0.9118 - val_loss: 1.3850 - val_acc: 0.6747
Epoch 16/20
125/125 [==============================] - 8s 67ms/step - loss: 0.2543 - acc: 0.9152 - val_loss: 1.3119 - val_acc: 0.6972
Epoch 17/20
125/125 [==============================] - 8s 66ms/step - loss: 0.2109 - acc: 0.9267 - val_loss: 1.3145 - val_acc: 0.6954
Epoch 18/20
125/125 [==============================] - 8s 64ms/step - loss: 0.1939 - acc: 0.9364 - val_loss: 1.4054 - val_acc: 0.7009
Epoch 19/20
125/125 [==============================] - 8s 67ms/step - loss: 0.1873 - acc: 0.9379 - val_loss: 1.7441 - val_acc: 0.6667
Epoch 20/20
125/125 [==============================] - 9s 70ms/step - loss: 0.1762 - acc: 0.9420 - val_loss: 1.5269 - val_acc: 0.6927

<tensorflow.python.keras.callbacks.History at 0x157134890>

end-to-end モデルのエクスポート

さて、入力としてインデックスのシークエンスではなく、任意の長さの文字列を取る Model オブジェクトをエクスポートすることを望むかもしれません。それは、入力前処理パイプラインについて心配する必要がないので、モデルを遥かに可搬にします。

vectorizer は実際には Keras 層ですので、それは簡単です :

string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)

class_names[np.argmax(probabilities[0])]

'comp.graphics'

以上

2022年6月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30