使用 tf.data 加载 pandas dataframes

使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

本教程提供了如何将 pandas dataframes 加载到 tf.data.Dataset

本教程使用了一个小型数据集,由克利夫兰诊所心脏病基金会(Cleveland Clinic Foundation for Heart Disease)提供. 此数据集中有几百行CSV。每行表示一个患者,每列表示一个属性(describe)。我们将使用这些信息来预测患者是否患有心脏病,这是一个二分类问题。

使用 pandas 读取数据

import pandas as pd
import tensorflow as tf
2022-08-17 04:34:36.107768: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-08-17 04:34:36.761046: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory
2022-08-17 04:34:36.761325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory
2022-08-17 04:34:36.761339: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

下载包含心脏数据集的 csv 文件。

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/heart.csv
13273/13273 [==============================] - 0s 0us/step

使用 pandas 读取 csv 文件。

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

thal 列(数据帧(dataframe)中的 object )转换为离散数值。

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

使用 tf.data.Dataset 读取数据

使用 tf.data.Dataset.from_tensor_slices 从 pandas dataframe 中读取数值。

使用 tf.data.Dataset 的其中一个优势是可以允许您写一些简单而又高效的数据管道(data pipelines)。从 loading data guide 可以了解更多。

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

由于 pd.Series 实现了 __array__ 协议,因此几乎可以在任何使用 np.arraytf.Tensor 的地方透明地使用它。

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>

随机读取(shuffle)并批量处理数据集。

train_dataset = dataset.shuffle(len(df)).batch(1)

创建并训练模型

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15
303/303 [==============================] - 2s 2ms/step - loss: 5.3122 - accuracy: 0.5611
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 2.6524 - accuracy: 0.5776
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 2.0161 - accuracy: 0.5809
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 1.5642 - accuracy: 0.5908
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 1.0725 - accuracy: 0.6535
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.9007 - accuracy: 0.6733
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6849 - accuracy: 0.7096
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6602 - accuracy: 0.7525
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7093 - accuracy: 0.7195
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6546 - accuracy: 0.7426
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6147 - accuracy: 0.7327
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6334 - accuracy: 0.7492
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6366 - accuracy: 0.7558
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5893 - accuracy: 0.7294
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5902 - accuracy: 0.7525
<keras.callbacks.History at 0x7f8c08467cd0>

代替特征列

将字典作为输入传输给模型就像创建 tf.keras.layers.Input 层的匹配字典一样简单,应用任何预处理并使用 functional api。 您可以使用它作为 feature columns 的替代方法。

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

tf.data 一起使用时,保存 pd.DataFrame 列结构的最简单方法是将 pd.DataFrame 转换为 dict ,并对该字典进行切片。

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)
model_func.fit(dict_slices, epochs=15)
Epoch 1/15
19/19 [==============================] - 0s 4ms/step - loss: 77.0238 - accuracy: 0.2739
Epoch 2/15
19/19 [==============================] - 0s 4ms/step - loss: 56.7002 - accuracy: 0.2739
Epoch 3/15
19/19 [==============================] - 0s 4ms/step - loss: 37.6057 - accuracy: 0.2739
Epoch 4/15
19/19 [==============================] - 0s 3ms/step - loss: 19.2780 - accuracy: 0.2739
Epoch 5/15
19/19 [==============================] - 0s 4ms/step - loss: 5.9422 - accuracy: 0.4224
Epoch 6/15
19/19 [==============================] - 0s 3ms/step - loss: 3.6487 - accuracy: 0.6271
Epoch 7/15
19/19 [==============================] - 0s 4ms/step - loss: 3.4792 - accuracy: 0.6205
Epoch 8/15
19/19 [==============================] - 0s 3ms/step - loss: 3.3901 - accuracy: 0.6007
Epoch 9/15
19/19 [==============================] - 0s 4ms/step - loss: 3.3224 - accuracy: 0.5908
Epoch 10/15
19/19 [==============================] - 0s 3ms/step - loss: 3.2393 - accuracy: 0.5908
Epoch 11/15
19/19 [==============================] - 0s 3ms/step - loss: 3.1605 - accuracy: 0.5941
Epoch 12/15
19/19 [==============================] - 0s 3ms/step - loss: 3.0813 - accuracy: 0.5941
Epoch 13/15
19/19 [==============================] - 0s 3ms/step - loss: 2.9993 - accuracy: 0.5941
Epoch 14/15
19/19 [==============================] - 0s 4ms/step - loss: 2.9160 - accuracy: 0.5974
Epoch 15/15
19/19 [==============================] - 0s 4ms/step - loss: 2.8319 - accuracy: 0.6007
<keras.callbacks.History at 0x7f8c080f5fd0>