TensorFlow (Hub) : Tutorials : ML at production scale : TF-Hub によるテキスト分類器の構築 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
更新日時 : 07/16/2018 (v1.9)
作成日時 : 04/06/2018

* TensorFlow 1.9 でドキュメント構成が変わりましたので調整しました。
* 本ページは、TensorFlow の本家サイトの Tutorials – ML at production scale – How to build a simple text classifier with TF-Hub を
翻訳した上で適宜、補足説明したものです：

https://www.tensorflow.org/hub/tutorials/text_classification_with_tf_hub

* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマにウェビナー (WEB セミナー) を定期的に開催しています。スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
Windows PC のブラウザからご参加が可能です。スマートデバイスもご利用可能です。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/

Facebook: https://www.facebook.com/ClassCatJP/

序

TF-Hub は再利用可能なリソース、特に事前訓練された モジュール にパッケージングされた機械学習の専門技術を共有するためのプラットフォームです。このチュートリアルは２つの主要パートに体系化されます。

イントロダクション : TF-Hub でテキスト分類器を訓練する

合理的なベースラインの精度を持つ単純な感情分析器 (= sentiment classifier) を訓練するために TF-Hub テキスト埋め込みモジュールを利用します。それから私達のモデルが合理的であることを確認して精度を増すための改良を提案するために予測を解析します。

上級 : 転移学習解析

このセクションでは、estimator の精度上の効果を比較して転移学習の優位点と落とし穴を示すために各種の TF-Hub モジュールを使用します。

Getting started

データ

Mass et al からの Large Movie Review Dataset v1.0 タスクを解いてみます。データセットは 1 から 10 の正値によりラベル付けされた IMDB 映画レビューから成ります。タスクはレビューを negative か positive としてラベル付けすることです。

# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
  data = {}
  data["sentence"] = []
  data["sentiment"] = []
  for file_path in os.listdir(directory):
    with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
      data["sentence"].append(f.read())
      data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
  return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
  pos_df = load_directory_data(os.path.join(directory, "pos"))
  neg_df = load_directory_data(os.path.join(directory, "neg"))
  pos_df["polarity"] = 1
  neg_df["polarity"] = 0
  return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
  dataset = tf.keras.utils.get_file(
      fname="aclImdb.tar.gz", 
      origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
      extract=True)

  train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
  test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))

  return train_df, test_df

# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

train_df, test_df = download_and_load_datasets()
train_df.head()

    Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    84131840/84125825 [==============================] - 1s 0us/step
    84140032/84125825 [==============================] - 1s 0us/step

	文	感情	極性
0	I just rented this today….heard lots of good…	1	0
1	Outrage is pretty good movie! Robert Culp was …	10	1
2	OK, as everyone has pointed out, this film is …	3	0
3	I am a current A.S.L. Student & was forced to …	4	0
4	Redundant, but again the case. If you enjoy th…	2	0

モデル

入力関数

Estimator フレームワークは Pandas データフレームをラップする入力関数を提供します。

# Training input on the whole training set with no limit on training epochs.
train_input_fn = tf.estimator.inputs.pandas_input_fn(
    train_df, train_df["polarity"], num_epochs=None, shuffle=True)

# Prediction on the whole training set.
predict_train_input_fn = tf.estimator.inputs.pandas_input_fn(
    train_df, train_df["polarity"], shuffle=False)
# Prediction on the test set.
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn(
    test_df, test_df["polarity"], shuffle=False)

特徴カラム

TF-Hub は、与えられたテキスト特徴上にモデルを適用して更にモジュールの出力を渡すような特徴カラムを提供します。このチュートリアルでは nnlm-en-dim128 モジュールを使用していきます。このチュートリアルの目的のために、最重要な事実は :

モジュールは入力として文字列の 1-D tensor の文のバッチを取ります。
モジュールは文の前処理に責任を負います (e.g. 句読点の除去とスペース上の分割)。
モジュールは任意の入力で動作します (e.g. nnlm-en-dim128 は語彙にない単語を ~20.000 バケツにハッシュします)。

embedded_text_feature_column = hub.text_embedding_column(
    key="sentence", 
    module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")

Estimator

分類のために DNN Classifier を利用できます。

estimator = tf.estimator.DNNClassifier(
    hidden_units=[500, 100],
    feature_columns=[embedded_text_feature_column],
    n_classes=2,
    optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))

訓練

合理的なステップ総量のために estimator を訓練します。

# Training for 1,000 steps means 128,000 training examples with the default
# batch size. This is roughly equivalent to 5 epochs since the training dataset
# contains 25,000 examples.
estimator.train(input_fn=train_input_fn, steps=1000);

予測

訓練とテストセットの両者に対して予測を実行します。

train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)

print "Training set accuracy: {accuracy}".format(**train_eval_result)
print "Test set accuracy: {accuracy}".format(**test_eval_result)

    Training set accuracy: 0.802160024643
    Test set accuracy: 0.792879998684

混同行列

誤分類の分布を理解するために混同行列を視覚的にチェックできます。

def get_predictions(estimator, input_fn):
  return [x["class_ids"][0] for x in estimator.predict(input_fn=input_fn)]

LABELS = [
    "negative", "positive"
]

# Create a confusion matrix on training data.
with tf.Graph().as_default():
  cm = tf.confusion_matrix(train_df["polarity"], 
                           get_predictions(estimator, predict_train_input_fn))
  with tf.Session() as session:
    cm_out = session.run(cm)

# Normalize the confusion matrix so that each row sums to 1.
cm_out = cm_out.astype(float) / cm_out.sum(axis=1)[:, np.newaxis]

sns.heatmap(cm_out, annot=True, xticklabels=LABELS, yticklabels=LABELS);
plt.xlabel("Predicted");
plt.ylabel("True");

更なる改良

感情上の回帰 : 各サンプルを極性クラスに割り当てるために分類器を使用しました。しかし私達は実際にはもう一つの利用可能なカテゴリカルな特徴を持ちます – 感情です。ここでクラスは実際にスケールを表して基礎値 (positive/negative) は連続的な範囲に上手くマップされるでしょう。分類 (DNN Classifier) の代わりに回帰 (DNN Regressor) を計算することでこのプロパティを活用できるでしょう。
より巨大なモジュール : このチュートリアルのためにはメモリ消費を制限するために小さいモジュールを使用しました。より巨大な語彙と巨大な埋め込み空間を持つモジュールがあり、これらは追加の精度ポイントを与えるでしょう。
パラメータ調整 : 学習率やステップ数のようなメタ・パラメータの調整により精度を改善できます、特に異なるモジュールを利用する場合にです。もし合理的な結果を得ることを望むのであれば検証セットは非常に重要です、何故ならばテストセットに上手く一般化することなしに訓練データを予測することを学習するモデルをセットアップすることは非常に容易だからです。
より複雑なモデル : 各個々の単語を埋め込みそしてそれらを平均と結合することにより文埋め込みを計算するモジュールを使用しました。文の性質をより良く捕捉する sequential モジュール (e.g. Universal Sentence Encoder モジュール) もまた利用できます。あるいは２つかそれ以上の TF-Hub モジュールのアンサンブルです。
正則化 : overfitting を回避するためにある種の正則化を行なう optimizer を使用してみることができます、例えば Proximal Adagrad Optimizer です。

上級 : 転移学習解析

転移学習は訓練リソースをセーブして小さいデータセット上で訓練するときでさえも良いモデルの一般化を獲得することを可能にします。このパートでは、２つの異なる TF-Hub モジュールで訓練することによりこれを示します :

nnlm-en-dim128 – 事前訓練されたテキスト埋め込みモジュール、
random-nnlm-en-dim128 – テキスト埋め込みモジュール、これは nnlm-en-dim128 と同じ語彙とネットワークを持ちますが、重みは単にランダムに初期化されて実際のデータ上では決して訓練されていません。

そして２つのモードで訓練します :

分類器のみを訓練する (i.e. モジュールはフリーズしています)、そして
モジュールと一緒に分類器を訓練する。

各種モジュールの使用がどのように精度に影響を与えられるかを見るために２つの訓練と評価を実行してみましょう。

def train_and_evaluate_with_module(hub_module, train_module=False):
  embedded_text_feature_column = hub.text_embedding_column(
      key="sentence", module_spec=hub_module, trainable=train_module)

  estimator = tf.estimator.DNNClassifier(
      hidden_units=[500, 100],
      feature_columns=[embedded_text_feature_column],
      n_classes=2,
      optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))

  estimator.train(input_fn=train_input_fn, steps=1000)

  train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
  test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)

  training_set_accuracy = train_eval_result["accuracy"]
  test_set_accuracy = test_eval_result["accuracy"]

  return {
      "Training accuracy": training_set_accuracy,
      "Test accuracy": test_set_accuracy
  }

results = {}
results["nnlm-en-dim128"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/nnlm-en-dim128/1")
results["nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/nnlm-en-dim128/1", True)
results["random-nnlm-en-dim128"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/random-nnlm-en-dim128/1")
results["random-nnlm-en-dim128-with-module-training"] = train_and_evaluate_with_module(
    "https://tfhub.dev/google/random-nnlm-en-dim128/1", True)

結果を見てみましょう。

pd.DataFrame.from_dict(results, orient="index")

	訓練精度	テスト精度
nnlm-en-dim128	0.80176	0.79324
nnlm-en-dim128-with-module-training	0.94912	0.86996
random-nnlm-en-dim128	0.72244	0.67456
random-nnlm-en-dim128-with-module-training	0.76584	0.72180

既にあるパターンを見ることができますが、最初にテストセットのベースライン精度を確立するべきです – 下界は最も代表的なクラスのラベルだけを出力することで獲得できます :

estimator.evaluate(input_fn=predict_test_input_fn)["accuracy_baseline"]

0.5

最も代表的なクラスの割り当ては 50 % の精度を与えます。ここに気付くべき 2, 3 のことがあります :

多分驚くことに、モデルは固定された、ランダム埋め込みの上でも依然として学習できます。その理由は辞書の総ての単語がランダム・ベクトルにマップされた場合でさえも、estimator はその完全結合層を純粋に使用して空間を分割することができるからです。
ランダム埋め込みを持つモジュールの訓練を許せば分類器だけの訓練とは反対に訓練とテスト精度の両者を増大します。
事前訓練された埋め込みを持つモジュールの訓練もまた両者の精度を増大させます。けれども訓練セット上の overfitting には注意してください。事前訓練されたモジュールの訓練は正則化をもってさえも危険であるかもしれません。埋め込み重みはもはや多様なデータ上で訓練された言語モデルを表しておらず、代わりにそれらは新しいデータセットの理想的な表現に収束するという意味でです。

以上

2018年4月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30