TF 2.0 is out! Get hands-on practice at TF World, Oct 28-31. Use code TF20 for 20% off select passes. Register now

对结构化数据进行分类

在 tensorflow.google.cn 上查看 在 Google Colab 运行 在 Github 上查看源代码 下载此 notebook

本教程演示了如何对结构化数据进行分类(例如,CSV 中的表格数据)。我们将使用 Keras 来定义模型,将特征列(feature columns) 作为从 CSV 中的列(columns)映射到用于训练模型的特征(features)的桥梁。本教程包括了以下内容的完整代码:

  • Pandas 导入 CSV 文件。
  • tf.data 建立了一个输入流水线(pipeline),用于对行进行分批(batch)和随机排序(shuffle)。
  • 用特征列将 CSV 中的列映射到用于训练模型的特征。
  • 用 Keras 构建,训练并评估模型。

数据集

我们将使用一个小型 数据集,该数据集由克利夫兰心脏病诊所基金会(Cleveland Clinic Foundation for Heart Disease)提供。CSV 中有几百行数据。每行描述了一个病人(patient),每列描述了一个属性(attribute)。我们将使用这些信息来预测一位病人是否患有心脏病,这是在该数据集上的二分类任务。

下面是该数据集的描述。 请注意,有数值(numeric)和类别(categorical)类型的列。

描述 特征类型 数据类型
Age 年龄以年为单位 Numerical integer
Sex (1 = 男;0 = 女) Categorical integer
CP 胸痛类型(0,1,2,3,4) Categorical integer
Trestbpd 静息血压(入院时,以mm Hg计) Numerical integer
Chol 血清胆固醇(mg/dl) Numerical integer
FBS (空腹血糖> 120 mg/dl)(1 = true;0 = false) Categorical integer
RestECG 静息心电图结果(0,1,2) Categorical integer
Thalach 达到的最大心率 Numerical integer
Exang 运动诱发心绞痛(1 =是;0 =否) Categorical integer
Oldpeak 与休息时相比由运动引起的 ST 节段下降 Numerical integer
Slope 在运动高峰 ST 段的斜率 Numerical float
CA 荧光透视法染色的大血管动脉(0-3)的数量 Numerical integer
Thal 3 =正常;6 =固定缺陷;7 =可逆缺陷 Categorical string
Target 心脏病诊断(1 = true;0 = false) Classification integer

导入 TensorFlow 和其他库

!pip install -q sklearn
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

使用 Pandas 创建一个 dataframe

Pandas 是一个 Python 库,它有许多有用的实用程序,用于加载和处理结构化数据。我们将使用 Pandas 从 URL下载数据集,并将其加载到 dataframe 中。

URL = 'https://storage.googleapis.com/applied-dl/heart.csv'
dataframe = pd.read_csv(URL)
dataframe.head()

将 dataframe 拆分为训练、验证和测试集

我们下载的数据集是一个 CSV 文件。 我们将其拆分为训练、验证和测试集。

train, test = train_test_split(dataframe, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
193 train examples
49 validation examples
61 test examples

用 tf.data 创建输入流水线

接下来,我们将使用 tf.data 包装 dataframe。这让我们能将特征列作为一座桥梁,该桥梁将 Pandas dataframe 中的列映射到用于训练模型的特征。如果我们使用一个非常大的 CSV 文件(非常大以至于它不能放入内存),我们将使用 tf.data 直接从磁盘读取它。本教程不涉及这一点。

# 一种从 Pandas Dataframe 创建 tf.data 数据集的实用程序方法(utility method)
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('target')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds
batch_size = 5 # 小批量大小用于演示
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

理解输入流水线

现在我们已经创建了输入流水线,让我们调用它来查看它返回的数据的格式。 我们使用了一小批量大小来保持输出的可读性。

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['age'])
  print('A batch of targets:', label_batch )
Every feature: ['restecg', 'fbs', 'cp', 'thalach', 'thal', 'exang', 'age', 'oldpeak', 'slope', 'sex', 'trestbps', 'ca', 'chol']
A batch of ages: tf.Tensor([50 60 52 60 52], shape=(5,), dtype=int32)
A batch of targets: tf.Tensor([1 1 0 0 0], shape=(5,), dtype=int32)

我们可以看到数据集返回了一个字典,该字典从列名称(来自 dataframe)映射到 dataframe 中行的列值。

演示几种特征列

TensorFlow 提供了多种特征列。本节中,我们将创建几类特征列,并演示特征列如何转换 dataframe 中的列。

# 我们将使用该批数据演示几种特征列
example_batch = next(iter(train_ds))[0]
# 用于创建一个特征列
# 并转换一批次数据的一个实用程序方法
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

数值列

一个特征列的输出将成为模型的输入(使用上面定义的 demo 函数,我们将能准确地看到 dataframe 中的每列的转换方式)。 数值列(numeric column) 是最简单的列类型。它用于表示实数特征。使用此列时,模型将从 dataframe 中接收未更改的列值。

age = feature_column.numeric_column("age")
demo(age)
WARNING:tensorflow:Layer dense_features is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[64.]
 [45.]
 [60.]
 [43.]
 [57.]]

在这个心脏病数据集中,dataframe 中的大多数列都是数值列。

分桶列

通常,您不希望将数字直接输入模型,而是根据数值范围将其值分成不同的类别。考虑代表一个人年龄的原始数据。我们可以用 分桶列(bucketized column)将年龄分成几个分桶(buckets),而不是将年龄表示成数值列。请注意下面的 one-hot 数值表示每行匹配的年龄范围。

age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)
WARNING:tensorflow:Layer dense_features_1 is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

分类列

在此数据集中,thal 用字符串表示(如 'fixed','normal',或 'reversible')。我们无法直接将字符串提供给模型。相反,我们必须首先将它们映射到数值。分类词汇列(categorical vocabulary columns)提供了一种用 one-hot 向量表示字符串的方法(就像您在上面看到的年龄分桶一样)。词汇表可以用 categorical_column_with_vocabulary_list 作为 list 传递,或者用 categorical_column_with_vocabulary_file 从文件中加载。

thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])

thal_one_hot = feature_column.indicator_column(thal)
demo(thal_one_hot)
WARNING:tensorflow:Layer dense_features_2 is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4273: IndicatorColumn._variable_shape (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4328: VocabularyListCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]

在更复杂的数据集中,许多列都是分类列(如 strings)。在处理分类数据时,特征列最有价值。尽管在该数据集中只有一列分类列,但我们将使用它来演示在处理其他数据集时,可以使用的几种重要的特征列。

嵌入列

假设我们不是只有几个可能的字符串,而是每个类别有数千(或更多)值。 由于多种原因,随着类别数量的增加,使用 one-hot 编码训练神经网络变得不可行。我们可以使用嵌入列来克服此限制。嵌入列(embedding column)将数据表示为一个低维度密集向量,而非多维的 one-hot 向量,该低维度密集向量可以包含任何数,而不仅仅是 0 或 1。嵌入的大小(在下面的示例中为 8)是必须调整的参数。

关键点:当分类列具有许多可能的值时,最好使用嵌入列。我们在这里使用嵌入列用于演示目的,为此您有一个完整的示例,以在将来可以修改用于其他数据集。

# 注意到嵌入列的输入是我们之前创建的类别列
thal_embedding = feature_column.embedding_column(thal, dimension=8)
demo(thal_embedding)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:353: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:Layer dense_features_3 is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/ops/embedding_ops.py:802: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
[[ 0.10606854  0.10008239 -0.24108642  0.0186569  -0.08850511  0.14744145
   0.0191577   0.02641513]
 [ 0.10606854  0.10008239 -0.24108642  0.0186569  -0.08850511  0.14744145
   0.0191577   0.02641513]
 [ 0.2372275   0.27889338 -0.42129204  0.03703581 -0.3204124   0.35321754
   0.47390804 -0.22264898]
 [ 0.10606854  0.10008239 -0.24108642  0.0186569  -0.08850511  0.14744145
   0.0191577   0.02641513]
 [ 0.10606854  0.10008239 -0.24108642  0.0186569  -0.08850511  0.14744145
   0.0191577   0.02641513]]

经过哈希处理的特征列

表示具有大量数值的分类列的另一种方法是使用 categorical_column_with_hash_bucket。该特征列计算输入的一个哈希值,然后选择一个 hash_bucket_size 分桶来编码字符串。使用此列时,您不需要提供词汇表,并且可以选择使 hash_buckets 的数量远远小于实际类别的数量以节省空间。

关键点:该技术的一个重要缺点是可能存在冲突,不同的字符串被映射到同一个范围。实际上,无论如何,经过哈希处理的特征列对某些数据集都有效。

thal_hashed = feature_column.categorical_column_with_hash_bucket(
      'thal', hash_bucket_size=1000)
demo(feature_column.indicator_column(thal_hashed))
WARNING:tensorflow:Layer dense_features_4 is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4328: HashedCategoricalColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

组合的特征列

将多种特征组合到一个特征中,称为特征组合(feature crosses),它让模型能够为每种特征组合学习单独的权重。此处,我们将创建一个 age 和 thal 组合的新特征。请注意,crossed_column 不会构建所有可能组合的完整列表(可能非常大)。相反,它由 hashed_column 支持,因此您可以选择表的大小。

crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
demo(feature_column.indicator_column(crossed_feature))
WARNING:tensorflow:Layer dense_features_5 is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.5/site-packages/tensorflow_core/python/feature_column/feature_column_v2.py:4328: CrossedColumn._num_buckets (from tensorflow.python.feature_column.feature_column_v2) is deprecated and will be removed in a future version.
Instructions for updating:
The old _FeatureColumn APIs are being deprecated. Please use the new FeatureColumn APIs instead.
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

选择要使用的列

我们已经了解了如何使用几种类型的特征列。 现在我们将使用它们来训练模型。本教程的目标是向您展示使用特征列所需的完整代码(例如,机制)。我们任意地选择了几列来训练我们的模型。

关键点:如果您的目标是建立一个准确的模型,请尝试使用您自己的更大的数据集,并仔细考虑哪些特征最有意义,以及如何表示它们。

feature_columns = []

# 数值列
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
  feature_columns.append(feature_column.numeric_column(header))

# 分桶列
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

# 分类列
thal = feature_column.categorical_column_with_vocabulary_list(
      'thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# 嵌入列
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# 组合列
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

建立一个新的特征层

现在我们已经定义了我们的特征列,我们将使用密集特征(DenseFeatures)层将特征列输入到我们的 Keras 模型中。

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

之前,我们使用一个小批量大小来演示特征列如何运转。我们将创建一个新的更大批量的输入流水线。

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

创建,编译和训练模型

model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'],
              run_eagerly=True)

model.fit(train_ds,
          validation_data=val_ds,
          epochs=5)
WARNING:tensorflow:Layer sequential is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Epoch 1/5
7/7 [==============================] - 1s 95ms/step - loss: 1.7189 - accuracy: 0.6736 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/5
7/7 [==============================] - 0s 45ms/step - loss: 0.8126 - accuracy: 0.6321 - val_loss: 0.7444 - val_accuracy: 0.7347
Epoch 3/5
7/7 [==============================] - 0s 49ms/step - loss: 1.2775 - accuracy: 0.7358 - val_loss: 1.1860 - val_accuracy: 0.7551
Epoch 4/5
7/7 [==============================] - 0s 49ms/step - loss: 0.7634 - accuracy: 0.7047 - val_loss: 0.8533 - val_accuracy: 0.5102
Epoch 5/5
7/7 [==============================] - 0s 48ms/step - loss: 0.6421 - accuracy: 0.7202 - val_loss: 0.7496 - val_accuracy: 0.7551

<tensorflow.python.keras.callbacks.History at 0x7fc68ac81c50>
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)
2/2 [==============================] - 0s 26ms/step - loss: 0.8340 - accuracy: 0.6721
Accuracy 0.6721311

关键点:通常使用更大更复杂的数据集进行深度学习,您将看到最佳结果。使用像这样的小数据集时,我们建议使用决策树或随机森林作为强有力的基准。本教程的目的不是训练一个准确的模型,而是演示处理结构化数据的机制,这样,在将来使用自己的数据集时,您有可以使用的代码作为起点。

下一步

了解有关分类结构化数据的更多信息的最佳方法是亲自尝试。我们建议寻找另一个可以使用的数据集,并使用和上面相似的代码,训练一个模型,对其分类。要提高准确率,请仔细考虑模型中包含哪些特征,以及如何表示这些特征。