tf.data を使って pandas の DataFrame をロードする

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

このチュートリアルでは、pandas のDataFrameをロードして、tf.data.Dataset にデータを読み込む例を示します。

このチュートリアルは、クリーブランドクリニック財団(the Cleveland Clinic Foundation for Heart Disease)から提供された、小さな データセット を使っています。このデータセット(CSV)には数百行のデータが含まれています。行は各患者を、列はさまざまな属性を表しています。

このデータを使って、患者が心臓病を罹患しているかどうかを判別予測することができます。なお、これは二値分類問題になります。

pandas を使ってデータを読み込む

import pandas as pd
import tensorflow as tf

heart データセットを含んだCSVをダウンロードします。

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

pandas を使ってCSVを読み込みます。

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

dataframe 内で唯一の object 型である thal 列を離散値に変換します。

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

tf.data.Dataset を使ってデータをロードする

tf.data.Dataset.from_tensor_slices メソッドを使って、pandas の dataframeから値を読み込みます。

tf.data.Dataset を使う利点は、シンプルに使えて、かつ、大変効率的なデータパイプラインを構築できることです。詳しくは loading data guide を参照してください。

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

pd.Series__array__ プロトコルを実装しているため、np.arraytf.Tensor を使うところでは、だいたいどこでも使うことができます。

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>

データをシャッフルしてバッチ処理を行います。

train_dataset = dataset.shuffle(len(df)).batch(1)

モデルを作成して訓練する

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15
WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

303/303 [==============================] - 1s 2ms/step - loss: 9.6765 - accuracy: 0.6139
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6386 - accuracy: 0.7096
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6770 - accuracy: 0.7129
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5383 - accuracy: 0.7228
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5702 - accuracy: 0.7789
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5445 - accuracy: 0.7756
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5119 - accuracy: 0.7756
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6449 - accuracy: 0.7393
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4893 - accuracy: 0.7756
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4827 - accuracy: 0.7855
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5166 - accuracy: 0.7525
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4551 - accuracy: 0.8053
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4804 - accuracy: 0.8053
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4786 - accuracy: 0.7888
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4304 - accuracy: 0.7987

<tensorflow.python.keras.callbacks.History at 0x7f83c0af59e8>

特徴列の代替

モデルへの入力に辞書型データを渡すことは、 tf.keras.layers.Input におなじ型の辞書を作成し、何らかの前処理を適用して、functional api を使ってスタッキングすることと同様に、簡単に行うことができます。これを 特徴列 の替わりに使うことができます。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

tf.data を使うときに、pandas の DataFrame の列構造を保持する一番簡単な方法は、DataFrame を辞書型データに変換して、先頭を切り取ることです。

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)

model_func.fit(dict_slices, epochs=15)
Epoch 1/15
19/19 [==============================] - 0s 3ms/step - loss: 33.5475 - accuracy: 0.2739
Epoch 2/15
19/19 [==============================] - 0s 3ms/step - loss: 14.4634 - accuracy: 0.3135
Epoch 3/15
19/19 [==============================] - 0s 3ms/step - loss: 2.5373 - accuracy: 0.6667
Epoch 4/15
19/19 [==============================] - 0s 3ms/step - loss: 1.7439 - accuracy: 0.7228
Epoch 5/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5952 - accuracy: 0.7360
Epoch 6/15
19/19 [==============================] - 0s 3ms/step - loss: 1.6009 - accuracy: 0.7261
Epoch 7/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5803 - accuracy: 0.7261
Epoch 8/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5652 - accuracy: 0.7261
Epoch 9/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5536 - accuracy: 0.7261
Epoch 10/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5390 - accuracy: 0.7261
Epoch 11/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5240 - accuracy: 0.7261
Epoch 12/15
19/19 [==============================] - 0s 3ms/step - loss: 1.5087 - accuracy: 0.7261
Epoch 13/15
19/19 [==============================] - 0s 3ms/step - loss: 1.4926 - accuracy: 0.7261
Epoch 14/15
19/19 [==============================] - 0s 3ms/step - loss: 1.4762 - accuracy: 0.7294
Epoch 15/15
19/19 [==============================] - 0s 3ms/step - loss: 1.4593 - accuracy: 0.7327

<tensorflow.python.keras.callbacks.History at 0x7f83c0af0cf8>