AutoKeras 1.0 : Tutorials : テキスト分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 03/21/2020

* 本ページは、AutoKeras の以下のページを翻訳した上で適宜、補足説明したものです：

Getting Started : Text Classification

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

Tutorials : テキスト分類

単純なサンプル

最初のステップは貴方のデータを準備することです。ここではサンプルとして IMDB データセットを使用します。

import numpy as np
from tensorflow.keras.datasets import imdb

# Load the integer sequence the IMDB dataset with Keras.
index_offset = 3  # word index offset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=1000,
                                                      index_from=index_offset)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
# Prepare the dictionary of index to word.
word_to_id = imdb.get_word_index()
word_to_id = {k: (v + index_offset) for k, v in word_to_id.items()}
word_to_id[""] = 0
word_to_id[""] = 1
word_to_id[""] = 2
id_to_word = {value: key for key, value in word_to_id.items()}
# Convert the word indices to words.
x_train = list(map(lambda sentence: ' '.join(
    id_to_word[i] for i in sentence), x_train))
x_test = list(map(lambda sentence: ' '.join(
    id_to_word[i] for i in sentence), x_test))
x_train = np.array(x_train, dtype=np.str)
x_test = np.array(x_test, dtype=np.str)
print(x_train.shape)  # (25000,)
print(y_train.shape)  # (25000, 1)
print(x_train[0][:50])  #  this film was just brilliant casting

2 番目のステップは TextClassifier を実行することです。

import autokeras as ak

# Initialize the text classifier.
clf = ak.TextClassifier(max_trials=10) # It tries 10 different models.
# Feed the text classifier with training data.
clf.fit(x_train, y_train)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))

検証データ

デフォルトでは、AutoKeras は訓練データの最後の 20% を検証データとして使用します。下のサンプルで示されるように、パーセンテージを指定するために validation_split を使用できます。

clf.fit(x_train,
        y_train,
        # Split the training data and use the last 15% as validation data.
        validation_split=0.15)

それを訓練データから分割する代わりに、validation_data で貴方自身の検証セットを使用することもできます。

split = 5000
x_val = x_train[split:]
y_val = y_train[split:]
x_train = x_train[:split]
y_train = y_train[:split]
clf.fit(x_train,
        y_train,
        # Use your own validation set.
        validation_data=(x_val, y_val))

カスタマイズされた探索空間

上級ユーザのために、TextClassifier の代わりに AutoModel を使用して探索空間をカスタマイズしても良いです。幾つかの高位設定のために TextBlock を設定することができます、e.g., 使用するテキストベクトル化方法のタイプのための vectorizer。’sequence’ を使用できます、これは単語を整数に変換するために TextToInteSequence を使用して整数シークエンスを埋め込むために Embedding を使用します、あるいは ‘ngram’ を使用できます、これはセンテンスをベクトル化するために TextToNgramVector を使用します。これらの引数を指定しないこともできます、これは異なる選択が自動的に調整されるようにするでしょう。詳細のために次のサンプルを見てください。

import autokeras as ak

input_node = ak.TextInput()
output_node = ak.TextBlock(vectorizer='ngram')(input_node)
output_node = ak.ClassificationHead()(output_node)
clf = ak.AutoModel(inputs=input_node, outputs=output_node, max_trials=10)
clf.fit(x_train, y_train)

AutoModel の利用方法は Keras の functional API に類似しています。基本的には、グラフを構築しています、そのエッジはブロックでノードはブロックの中間出力です。output_node = ak.[some_block]([block_args])(input_node) で input_node から output_node へのエッジを追加します。

更に探索空間をカスタマイズするためにより極め細かいブロックを利用することもまた可能です。次のサンプルを見てください。

import autokeras as ak

input_node = ak.TextInput()
output_node = ak.TextToIntSequence()(input_node)
output_node = ak.Embedding()(output_node)
# Use separable Conv layers in Keras.
output_node = ak.ConvBlock(separable=True)(output_node)
output_node = ak.ClassificationHead()(output_node)
clf = ak.AutoModel(inputs=input_node, outputs=output_node, max_trials=10)
clf.fit(x_train, y_train)

データ形式

AutoKeras TextClassifier はデータ形式について非常に柔軟です。

テキストについて、入力データは分類ラベルのために 1-次元であるべきです。AutoKeras は plain ラベル, i.e. 文字列か整数、そして one-hot エンコードラベル, i.e. 0 と 1 のベクトルの両者を受け取ります。

訓練データのために tf.data.Dataset 形式の使用もサポートします。ラベルは tensorflow Dataset にラップされるためマルチクラス分類のために one-hot エンコードでなければなりません。IMDB データセットは二値分類ですので、それは one-hot エンコードされるべきではありません。

import tensorflow as tf
train_set = tf.data.Dataset.from_tensor_slices(((x_train, ), (y_train, )))
test_set = tf.data.Dataset.from_tensor_slices(((x_test, ), (y_test, )))

clf = ak.TextClassifier(max_trials=10)
# Feed the tensorflow Dataset to the classifier.
clf.fit(train_set)
# Predict with the best model.
predicted_y = clf.predict(test_set)
# Evaluate the best model with testing data.
print(clf.evaluate(test_set))

以上

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31