TensorFlow 2.0 Beta : Beginner Tutorials : ML 基本 :- 構造化データを分類する (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 06/22/2019

* 本ページは、TensorFlow の本家サイトの TF 2.0 Tutorials : – Beginner Tutorials – ML basics の以下のページを翻訳した上で
適宜、補足説明したものです：

Classify structured data

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

ML 基本 :- 構造化データを分類する

このチュートリアルは構造化データ (e.g. CSV の表形式データ) をどのように分類するかを実演します。モデルを定義するために Keras を、そして CSV のカラムからモデルを訓練するために使用される特徴へマップするブリッジとして feature columns を使用します。このチュートリアルは以下を行なうための完全なコードを含みます :

Pandas を使用して CSV ファイルをロードする。
tf.data を使用して行をバッチ処理してシャッフルするために入力パイプラインを構築する。
feature columns を使用して CSV のカラムからモデルを訓練するために使用される特徴にマップする。
Keras を使用して、モデルを構築、訓練そして評価する。

データセット

Cleveland Clinic Foundation for Heart Disease により提供される小さいデータセットを使用します。CSV には数百行あります。各行は患者を表し、各カラム (列) は属性を表します。患者が心臓疾患を持つか否かを予測するためにこの情報を使用します、これはこのデータセットにおける二値分類タスクです。

次はこのデータセットの記述です。numeric と categorical カラムの両者があることが分かるでしょう。

カラム	説明	特徴型	データ型
Age	年齢	Numerical	integer
Sex	(1 = male; 0 = female)	Categorical	integer
CP	胸の痛みのタイプ (0, 1, 2, 3, 4)	Categorical	integer
Trestbpd	安静時血圧 (in mm Hg 入院時)	Numerical	integer
Chol	血清コレステロール in mg/dl	Numerical	integer
FBS	(空腹時血糖値 > 120 mg/dl) (1 = true; 0 = false)	Categorical	integer
RestECG	安静時心電図結果 (0, 1, 2)	Categorical	integer
Thalach	得られた最大心拍数	Numerical	integer
Exang	労作性狭心症 (1 = yes; 0 = no)	Categorical	integer
Oldpeak	ST 低下 induced by exercise relative to rest	Numerical	integer
Slope	The slope of 最大運動時 ST 部分	Numerical	float
CA	透視法により色付けられた主要管の数 (0-3)	Numerical	integer
Thal	3 = normal; 6 = fixed defect; 7 = reversible defect	Categorical	string
Target	心臓疾患の診断 (1 = true; 0 = false)	Classification	integer

TensorFlow と他のライブラリをインポートする

!pip install -q sklearn

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

!pip install -q tensorflow==2.0.0-beta1
import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

dataframe を作成するために Pandas を使用する

Pandas はロードや構造化データで作業するための多くの役立つユティリティを持つ Python ライブラリです。URL からデータセットをダウンロードしてそれを dataframe にロードするために Pandas を使用します。

URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)
dataframe.head()

	age	sex	cp	trestbps	chol	fbs	restecg	thalach	exang	oldpeak	slope	ca	thal	target
0	63	1	1	145	233	1	2	150	0	0.3	3	0	fixed	0
1	67	1	4	160	286	0	2	108	1	1.5	2	3	normal	1
2	67	1	4	120	229	0	2	129	1	2.6	2	2	reversible	0
3	37	1	3	130	250	0	0	187	0	3.5	3	0	normal	0
4	41	0	2	130	204	0	2	172	0	1.4	1	0	normal	0

dataframe を訓練、検証とテストに分割する

ダウンロードしたデータセットは単一の CSV ファイルでした。これを訓練、検証とテストセットに分割します。

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

193 train examples
49 validation examples
61 test examples

tf.data を使用して入力パイプラインを作成する

次に、tf.data で dataframe をラップします。これは Pandas dataframe のカラムからモデルを訓練するために使用される features へマップするブリッジとして feature columns を使用することを可能にします。 (メモリに収まらないほどに) 非常に巨大な CSV ファイルで作業するとしても、それをディスクから直接読むために tf.data を使用するでしょう。それはこのチュートリアルではカバーされません。

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

入力パイプラインを理解する

入力パイプラインを作成した今、それが返すデータのフォーマットを見るためにそれを呼び出しましょう。出力を可読に維持するために小さいバッチサイズを使用しています。

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['age'])
  print('A batch of targets:', label_batch )

Every feature: ['fbs', 'ca', 'slope', 'trestbps', 'thalach', 'restecg', 'oldpeak', 'cp', 'age', 'exang', 'thal', 'sex', 'chol']
A batch of ages: tf.Tensor([61 40 67 50 60], shape=(5,), dtype=int32)
A batch of targets: tf.Tensor([0 0 0 0 0], shape=(5,), dtype=int32)

dataset が (dataframe からの) カラム名の辞書を返すことを見ることができます、それは dataframe の行からカラム値へマップします。

feature column の幾つかのタイプを実演する

TensorFlow は feature columns の多くのタイプを提供します。このセクションでは、feature column の幾つかのタイプを作成して、それらが dataframe からのカラムをどのように変換するかを実演します。

# We will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_ds))[0]

# A utility method to create a feature column
# and to transform a batch of data
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

Numeric columns

feature column の出力はモデルへの入力になります (上で定義された demo 関数を使用して、dataframe からの各カラムがどのように変換されるかを正確に見ることができます)。numeric column はカラムの最も単純なタイプです。それは実数値の特徴を表わすために使用されます。このカラムを使用するとき、貴方のモデルは (不変の) dataframe からカラム値を受け取ります。

age = feature_column.numeric_column("age")
demo(age)

[[61.]
 [40.]
 [67.]
 [50.]
 [60.]]

心臓疾患データセットでは、dataframe からの殆どのカラムは numeric です。

Bucketized columns

しばしば、数値を直接モデルに供給することを望みません、しかし代わりにその値を数値の範囲に基づく異なるカテゴリー分割します。人の年齢を表わす生データを考えます。年齢を numeric column として表わす代わりに、bucketized column を使用して幾つかのバケツに分割できるでしょう。下の one-hot 値が各行がどの年齢範囲にマッチするかを記述していることが分かるでしょう。

age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]

Categorical columns

このデータセットでは、thal は文字列 (e.g. ‘fixed’, ‘normal’, or ‘reversible’) として表わされます。文字列をモデルに直接には供給できません。代わりに、最初にそれらを数値にマップしなければなりません。categorical 語彙 columns は文字列を one-hot ベクトルとして表わす方法を提供します (上で見た年齢バケツに良く似ています)。語彙は categorical_column_with_vocabulary_list を使用してリストとして渡したり、categorical_column_with_vocabulary_file を使用してファイルからロードすることができます。

thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])

thal_one_hot = feature_column.indicator_column(thal)
demo(thal_one_hot)

WARNING: Logging before flag parsing goes to stderr.
W0614 17:35:53.184868 140449077593856 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:2655: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0614 17:35:53.188472 140449077593856 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4215: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
W0614 17:35:53.189301 140449077593856 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4270: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]

より複雑なデータセットでは、多くのカラムが categorical (e.g. 文字列) でしょう。Feature columns は categorical データと作業するときにもっとも役立ちます。このデータセットには一つの categorical column があるだけですが、他のデータセットで作業するときに使用できる feature columns の幾つかの重要なタイプを実演するためにそれを使用します。

Embedding columns

ごく僅かの可能な文字列を持つ代わりに、カテゴリー毎に数千 (or それ以上) の値を持つことを仮定します。幾つかの理由で、カテゴリー数が巨大になるにつれて、one-hot エンコーディングを使用してニューラルネットワークを訓練することは実行不可能になります。この制限を打開するために embedding column を使用することができます。データを多くの次元の one-hot ベクトルとして表わす代わりに、embedding column はそのデータをより低次元な、密ベクトルとして表します、そこでは各セルは (単に 0 か 1 ではなく) 任意の数字を含むことができます。embedding のサイズ (下の例では 8) は調整しなければならないパラメータです。

Key point: categorical column が多くの可能な値を持つときに embedding column を使用することが最善です。ここでは実演目的で一つを使用していますので、将来的に異なるデータセットのために変更可能な完全なサンプルを貴方は持つことになります。

# Notice the input to the embedding column is the categorical column
# we previously created
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)

[[ 0.13279669  0.19413401 -0.69587415 -0.6805197   0.3184564   0.45431668
  -0.13196784 -0.57410216]
 [ 0.13279669  0.19413401 -0.69587415 -0.6805197   0.3184564   0.45431668
  -0.13196784 -0.57410216]
 [ 0.13279669  0.19413401 -0.69587415 -0.6805197   0.3184564   0.45431668
  -0.13196784 -0.57410216]
 [ 0.13279669  0.19413401 -0.69587415 -0.6805197   0.3184564   0.45431668
  -0.13196784 -0.57410216]
 [ 0.13279669  0.19413401 -0.69587415 -0.6805197   0.3184564   0.45431668
  -0.13196784 -0.57410216]]

Hashed feature columns

非常に多数の値を持つ categorical column を表わすもう一つの方法は categorical_column_with_hash_bucket を使用することです。この feature column は入力のハッシュ値を計算し、それから文字列をエンコードするために hash_bucket_size バケツの一つを選択します。この column を使用するとき、語彙を提供する必要はなく、そして空間をセーブするために実際のカテゴリの数よりも hash_buckets の数を小さくすることを選択できます。

Key point: このテクニックの重要な不都合な点は異なる文字列が同じバケツにマップされる衝突があるかもしれないことです。実際には、これは幾つかのデータセットに対して関係なく上手く動作します。

thal_hashed = feature_column.categorical_column_with_hash_bucket(
      'thal', hash_bucket_size=1000)
demo(feature_column.indicator_column(thal_hashed))

W0614 17:35:53.226184 140449077593856 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4270: HashedCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Crossed feature columns

feature cross として良く知られる、特徴 (群) を単一の特徴に結合することはモデルに特徴の各組み合わせについて個別の重みを学習することを可能にします。ここで、age と thal のクロスである新しい特徴を作成します。crossed_column は総ての可能な組み合わせの完全なテーブルを構築はしないことに注意してください (それは非常に巨大でしょう)。代わりに、それは hashed_column により支援されますので、テーブルがどのくらい巨大であるかを選択できます。

crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
demo(feature_column.indicator_column(crossed_feature))

W0614 17:35:53.243324 140449077593856 deprecation.py:323] From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column_v2.py:4270: CrossedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

どの column を使用するか選択する

feature columns の幾つかのタイプをどのように使用するか見てきました。今はそれらをモデルを訓練するために使用します。このチュートリアルのゴールは feature column で作業するために必要な完全なコード (e.g. mechanics) を示すことです。下のモデルを訓練するために幾つかの column を任意に選択しました。

Key point: 貴方の目的が正確なモデルを構築することであれば、貴方自身のより大きなデータセット試してください、そしてどの特徴が含めるために最も意味があるか、そしてそれらがどのように表わされるべきかを注意深く考えてください。

feature_columns = []

# numeric cols
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
  feature_columns.append(feature_column.numeric_column(header))

# bucketized cols
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# indicator cols
thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# embedding cols
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# crossed cols
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

feature 層を作成する

私達の feature columns を定義した今、それらを Keras モデルへ入力するために DenseFeatures を使用します。

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

先程は、feature columns がどのように動作したかを示すために小さいバッチサイズを使用しました。より大きなバッチサイズを持つ新しい入力パイプラインを作成します。

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

モデルを作成、コンパイルそして訓練する

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'],
              run_eagerly=True)

model.fit(train_ds,
          validation_data=val_ds,
          epochs=5)

Epoch 1/5
7/7 [==============================] - 1s 142ms/step - loss: 2.5852 - accuracy: 0.5992 - val_loss: 1.5481 - val_accuracy: 0.6735
Epoch 2/5
7/7 [==============================] - 0s 30ms/step - loss: 1.4630 - accuracy: 0.5475 - val_loss: 0.8428 - val_accuracy: 0.6735
Epoch 3/5
7/7 [==============================] - 0s 31ms/step - loss: 0.6788 - accuracy: 0.7359 - val_loss: 0.8275 - val_accuracy: 0.6531
Epoch 4/5
7/7 [==============================] - 0s 30ms/step - loss: 0.8789 - accuracy: 0.6067 - val_loss: 0.7656 - val_accuracy: 0.6327
Epoch 5/5
7/7 [==============================] - 0s 32ms/step - loss: 0.6756 - accuracy: 0.6843 - val_loss: 0.7049 - val_accuracy: 0.6735

<tensorflow.python.keras.callbacks.History at 0x7fbccd338668>

loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

2/2 [==============================] - 0s 18ms/step - loss: 0.4494 - accuracy: 0.7869
Accuracy 0.78688526

Key point: 典型的には遥かに巨大でより複雑なデータセットを伴う深層学習で最善の結果を見るでしょう。この一つのように小さいデータセットで作業するときは、決定木やランダムフォレストを強力なベースラインとして使用することを推奨します。このチュートリアルの目標は正確なモデルを訓練することではなく、構造化データで作業するメカニクスを実演することですので、将来的に貴方自身のデータセットで作業するときの開始点として使用するコードを貴方は持っています。

以上

2019年6月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30