Carregar um pandas.DataFrame

Ver em TensorFlow.org Executar no Google Colab Ver código fonte no GitHub Baixar notebook

Este tutorial fornece um exemplo de como carregar dataframe do pandas em um tf.data.Dataset.

Este tutorial usa um pequeno conjunto de dados fornecido pela Cleveland Clinic Foundation for Heart Disease. Existem várias centenas de linhas no CSV. Cada linha descreve um paciente e cada coluna descreve um atributo. Usaremos essas informações para prever se um paciente tem uma doença cardíaca, que neste conjunto de dados é uma tarefa de classificação binária.

Ler os dados usando pandas

from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import pandas as pd
import tensorflow as tf

Fazer download do arquivo csv que contém o conjunto de dados do coração.

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

Ler o arquivo csv usando pandas.

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

Converta a coluna thal, que é um objeto no dataframe para um valor numérico discreto

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

Carregar dados usando o tf.data.Dataset

Use tf.data.Dataset.from_tensor_slices para ler os valores de um dataframe do pandas.

Uma das vantagens do uso do tf.data.Dataset é que ele permite escrever pipelines de dados simples e altamente eficientes. Leia o loading data guide para obter mais informações.

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

Como um pd.Series implementa o protocolo __array__, ele pode ser usado de forma transparente em praticamente qualquer lugar que você usaria um np.array ou um tf.Tensor.

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>

Aleatório e lote do conjunto de dados.

train_dataset = dataset.shuffle(len(df)).batch(1)

Crirar e treinar um modelo

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15
WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

303/303 [==============================] - 1s 2ms/step - loss: 3.3850 - accuracy: 0.6964
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 1.8797 - accuracy: 0.6931
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 1.3348 - accuracy: 0.7063
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 1.5040 - accuracy: 0.6997
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 1.0072 - accuracy: 0.7393
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.8372 - accuracy: 0.7822
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7832 - accuracy: 0.7888
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7457 - accuracy: 0.7921
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6368 - accuracy: 0.7789
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7353 - accuracy: 0.7756
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6158 - accuracy: 0.8218
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5253 - accuracy: 0.7954
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7066 - accuracy: 0.7921
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6731 - accuracy: 0.7921
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7600 - accuracy: 0.7756

<tensorflow.python.keras.callbacks.History at 0x7f3f5f32c710>

Alternativa para colunas de características

Passar um dicionário como entrada para um modelo é tão fácil quanto criar um dicionário correspondente de camadas tf.keras.layers.Input, aplicar qualquer pré-processamento e empilhá-los usando a API funcional. Você pode usar isso como uma alternativa para colunas de características.

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1)(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                   metrics=['accuracy'])

A maneira mais fácil de preservar a estrutura da coluna de um pd.DataFrame quando usado com tf.data é converter o pd.DataFrame em um dict e dividir esse dicionário.

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)

model_func.fit(dict_slices, epochs=15)
Epoch 1/15
19/19 [==============================] - 0s 2ms/step - loss: 2.8664 - accuracy: 0.6799
Epoch 2/15
19/19 [==============================] - 0s 2ms/step - loss: 1.2796 - accuracy: 0.5842
Epoch 3/15
19/19 [==============================] - 0s 2ms/step - loss: 0.8998 - accuracy: 0.6766
Epoch 4/15
19/19 [==============================] - 0s 3ms/step - loss: 0.8758 - accuracy: 0.6931
Epoch 5/15
19/19 [==============================] - 0s 2ms/step - loss: 0.8052 - accuracy: 0.6964
Epoch 6/15
19/19 [==============================] - 0s 2ms/step - loss: 0.7569 - accuracy: 0.6898
Epoch 7/15
19/19 [==============================] - 0s 2ms/step - loss: 0.7212 - accuracy: 0.6931
Epoch 8/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6975 - accuracy: 0.7063
Epoch 9/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6805 - accuracy: 0.6997
Epoch 10/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6660 - accuracy: 0.7030
Epoch 11/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6535 - accuracy: 0.7096
Epoch 12/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6415 - accuracy: 0.7096
Epoch 13/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6296 - accuracy: 0.7096
Epoch 14/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6207 - accuracy: 0.7129
Epoch 15/15
19/19 [==============================] - 0s 2ms/step - loss: 0.6114 - accuracy: 0.7162

<tensorflow.python.keras.callbacks.History at 0x7f3f8d4789b0>