tf.data を使って pandas の DataFrame をロードする

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

このチュートリアルでは、pandas のDataFrameをロードして、tf.data.Dataset にデータを読み込む例を示します。

このチュートリアルは、クリーブランドクリニック財団(the Cleveland Clinic Foundation for Heart Disease)から提供された、小さな データセット を使っています。このデータセット(CSV)には数百行のデータが含まれています。行は各患者を、列はさまざまな属性を表しています。

このデータを使って、患者が心臓病を罹患しているかどうかを判別予測することができます。なお、これは二値分類問題になります。

pandas を使ってデータを読み込む

from __future__ import absolute_import, division, print_function, unicode_literals

import pandas as pd
import tensorflow as tf

heart データセットを含んだCSVをダウンロードします。

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

pandas を使ってCSVを読み込みます。

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

dataframe 内で唯一の object 型である thal 列を離散値に変換します。

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

tf.data.Dataset を使ってデータをロードする

tf.data.Dataset.from_tensor_slices メソッドを使って、pandas の dataframeから値を読み込みます。

tf.data.Dataset を使う利点は、シンプルに使えて、かつ、大変効率的なデータパイプラインを構築できることです。詳しくは loading data guide を参照してください。

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.
   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

pd.Series__array__ プロトコルを実装しているため、np.arraytf.Tensor を使うところでは、だいたいどこでも使うことができます。

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int32, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int32)>

データをシャッフルしてバッチ処理を行います。

train_dataset = dataset.shuffle(len(df)).batch(1)

モデルを作成して訓練する

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Train for 303 steps
Epoch 1/15
303/303 [==============================] - 1s 5ms/step - loss: 2.8594 - accuracy: 0.6997
Epoch 2/15
303/303 [==============================] - 1s 3ms/step - loss: 0.6263 - accuracy: 0.6964
Epoch 3/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5709 - accuracy: 0.7294
Epoch 4/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5664 - accuracy: 0.7492
Epoch 5/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5399 - accuracy: 0.7459
Epoch 6/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5361 - accuracy: 0.7393
Epoch 7/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5178 - accuracy: 0.7525
Epoch 8/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5113 - accuracy: 0.7690
Epoch 9/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5216 - accuracy: 0.7426
Epoch 10/15
303/303 [==============================] - 1s 3ms/step - loss: 0.5012 - accuracy: 0.7558
Epoch 11/15
303/303 [==============================] - 1s 3ms/step - loss: 0.4947 - accuracy: 0.7624
Epoch 12/15
303/303 [==============================] - 1s 3ms/step - loss: 0.4745 - accuracy: 0.7558
Epoch 13/15
303/303 [==============================] - 1s 3ms/step - loss: 0.4704 - accuracy: 0.7789
Epoch 14/15
303/303 [==============================] - 1s 3ms/step - loss: 0.4706 - accuracy: 0.7591
Epoch 15/15
303/303 [==============================] - 1s 3ms/step - loss: 0.4522 - accuracy: 0.7855

<tensorflow.python.keras.callbacks.History at 0x7f3c101ebb00>

特徴列の代替

モデルへの入力に辞書型データを渡すことは、 tf.keras.layers.Input におなじ型の辞書を作成し、何らかの前処理を適用して、functional api を使ってスタッキングすることと同様に、簡単に行うことができます。これを 特徴列 の替わりに使うことができます。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

tf.data を使うときに、pandas の DataFrame の列構造を保持する一番簡単な方法は、DataFrame を辞書型データに変換して、先頭を切り取ることです。

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,
       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Train for 19 steps
Epoch 1/15
19/19 [==============================] - 0s 22ms/step - loss: 6.8100 - accuracy: 0.5611
Epoch 2/15
19/19 [==============================] - 0s 4ms/step - loss: 5.8396 - accuracy: 0.4653
Epoch 3/15
19/19 [==============================] - 0s 4ms/step - loss: 4.9932 - accuracy: 0.5215
Epoch 4/15
19/19 [==============================] - 0s 4ms/step - loss: 4.2268 - accuracy: 0.4917
Epoch 5/15
19/19 [==============================] - 0s 4ms/step - loss: 3.4864 - accuracy: 0.5017
Epoch 6/15
19/19 [==============================] - 0s 4ms/step - loss: 2.8785 - accuracy: 0.5182
Epoch 7/15
19/19 [==============================] - 0s 4ms/step - loss: 2.4413 - accuracy: 0.5347
Epoch 8/15
19/19 [==============================] - 0s 4ms/step - loss: 2.1580 - accuracy: 0.5512
Epoch 9/15
19/19 [==============================] - 0s 4ms/step - loss: 1.9228 - accuracy: 0.5578
Epoch 10/15
19/19 [==============================] - 0s 4ms/step - loss: 1.7238 - accuracy: 0.5677
Epoch 11/15
19/19 [==============================] - 0s 4ms/step - loss: 1.5732 - accuracy: 0.5743
Epoch 12/15
19/19 [==============================] - 0s 4ms/step - loss: 1.4488 - accuracy: 0.5776
Epoch 13/15
19/19 [==============================] - 0s 4ms/step - loss: 1.3455 - accuracy: 0.5908
Epoch 14/15
19/19 [==============================] - 0s 4ms/step - loss: 1.2603 - accuracy: 0.6007
Epoch 15/15
19/19 [==============================] - 0s 4ms/step - loss: 1.1828 - accuracy: 0.6073

<tensorflow.python.keras.callbacks.History at 0x7f3c100cf0b8>