![]() |
![]() |
![]() |
![]() |
このチュートリアルでは、pandas のDataFrameをロードして、tf.data.Dataset
にデータを読み込む例を示します。
このチュートリアルは、クリーブランドクリニック財団(the Cleveland Clinic Foundation for Heart Disease)から提供された、小さな データセット を使っています。このデータセット(CSV)には数百行のデータが含まれています。行は各患者を、列はさまざまな属性を表しています。
このデータを使って、患者が心臓病を罹患しているかどうかを判別予測することができます。なお、これは二値分類問題になります。
pandas を使ってデータを読み込む
import pandas as pd
import tensorflow as tf
heart データセットを含んだCSVをダウンロードします。
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv 16384/13273 [=====================================] - 0s 0us/step
pandas を使ってCSVを読み込みます。
df = pd.read_csv(csv_file)
df.head()
df.dtypes
age int64 sex int64 cp int64 trestbps int64 chol int64 fbs int64 restecg int64 thalach int64 exang int64 oldpeak float64 slope int64 ca int64 thal object target int64 dtype: object
dataframe 内で唯一の object
型である thal
列を離散値に変換します。
df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()
tf.data.Dataset
を使ってデータをロードする
tf.data.Dataset.from_tensor_slices
メソッドを使って、pandas の dataframeから値を読み込みます。
tf.data.Dataset
を使う利点は、シンプルに使えて、かつ、大変効率的なデータパイプラインを構築できることです。詳しくは loading data guide を参照してください。
target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63. 1. 1. 145. 233. 1. 2. 150. 0. 2.3 3. 0. 2. ], Target: 0 Features: [ 67. 1. 4. 160. 286. 0. 2. 108. 1. 1.5 2. 3. 3. ], Target: 1 Features: [ 67. 1. 4. 120. 229. 0. 2. 129. 1. 2.6 2. 2. 4. ], Target: 0 Features: [ 37. 1. 3. 130. 250. 0. 0. 187. 0. 3.5 3. 0. 3. ], Target: 0 Features: [ 41. 0. 2. 130. 204. 0. 2. 172. 0. 1.4 1. 0. 3. ], Target: 0
pd.Series
は __array__
プロトコルを実装しているため、np.array
や tf.Tensor
を使うところでは、だいたいどこでも使うことができます。
tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy= array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4, 2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4, 4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4, 3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2, 4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3, 3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2, 4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4, 3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>
データをシャッフルしてバッチ処理を行います。
train_dataset = dataset.shuffle(len(df)).batch(1)
モデルを作成して訓練する
def get_compiled_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15 WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx. If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2. To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor. 303/303 [==============================] - 1s 2ms/step - loss: 9.6765 - accuracy: 0.6139 Epoch 2/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6386 - accuracy: 0.7096 Epoch 3/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6770 - accuracy: 0.7129 Epoch 4/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5383 - accuracy: 0.7228 Epoch 5/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5702 - accuracy: 0.7789 Epoch 6/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5445 - accuracy: 0.7756 Epoch 7/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5119 - accuracy: 0.7756 Epoch 8/15 303/303 [==============================] - 1s 2ms/step - loss: 0.6449 - accuracy: 0.7393 Epoch 9/15 303/303 [==============================] - 1s 2ms/step - loss: 0.4893 - accuracy: 0.7756 Epoch 10/15 303/303 [==============================] - 1s 2ms/step - loss: 0.4827 - accuracy: 0.7855 Epoch 11/15 303/303 [==============================] - 1s 2ms/step - loss: 0.5166 - accuracy: 0.7525 Epoch 12/15 303/303 [==============================] - 1s 2ms/step - loss: 0.4551 - accuracy: 0.8053 Epoch 13/15 303/303 [==============================] - 1s 2ms/step - loss: 0.4804 - accuracy: 0.8053 Epoch 14/15 303/303 [==============================] - 1s 2ms/step - loss: 0.4786 - accuracy: 0.7888 Epoch 15/15 303/303 [==============================] - 1s 2ms/step - loss: 0.4304 - accuracy: 0.7987 <tensorflow.python.keras.callbacks.History at 0x7f83c0af59e8>
特徴列の代替
モデルへの入力に辞書型データを渡すことは、 tf.keras.layers.Input
におなじ型の辞書を作成し、何らかの前処理を適用して、functional api を使ってスタッキングすることと同様に、簡単に行うことができます。これを 特徴列 の替わりに使うことができます。
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)
x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model_func = tf.keras.Model(inputs=inputs, outputs=output)
model_func.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
tf.data
を使うときに、pandas の DataFrame の列構造を保持する一番簡単な方法は、DataFrame を辞書型データに変換して、先頭を切り取ることです。
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57], dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130, 120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256, 263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy= array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142, 173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy= array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6, 0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Epoch 1/15 19/19 [==============================] - 0s 3ms/step - loss: 33.5475 - accuracy: 0.2739 Epoch 2/15 19/19 [==============================] - 0s 3ms/step - loss: 14.4634 - accuracy: 0.3135 Epoch 3/15 19/19 [==============================] - 0s 3ms/step - loss: 2.5373 - accuracy: 0.6667 Epoch 4/15 19/19 [==============================] - 0s 3ms/step - loss: 1.7439 - accuracy: 0.7228 Epoch 5/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5952 - accuracy: 0.7360 Epoch 6/15 19/19 [==============================] - 0s 3ms/step - loss: 1.6009 - accuracy: 0.7261 Epoch 7/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5803 - accuracy: 0.7261 Epoch 8/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5652 - accuracy: 0.7261 Epoch 9/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5536 - accuracy: 0.7261 Epoch 10/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5390 - accuracy: 0.7261 Epoch 11/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5240 - accuracy: 0.7261 Epoch 12/15 19/19 [==============================] - 0s 3ms/step - loss: 1.5087 - accuracy: 0.7261 Epoch 13/15 19/19 [==============================] - 0s 3ms/step - loss: 1.4926 - accuracy: 0.7261 Epoch 14/15 19/19 [==============================] - 0s 3ms/step - loss: 1.4762 - accuracy: 0.7294 Epoch 15/15 19/19 [==============================] - 0s 3ms/step - loss: 1.4593 - accuracy: 0.7327 <tensorflow.python.keras.callbacks.History at 0x7f83c0af0cf8>