Memuat sebuah pandas.DataFrame

Lihat di TensorFlow.org Jalankan di Google Colab Lihat source di GitHub Unduh notebook

Tutorial ini menunjukkan sebuah contoh bagaimana memuat dataframe pandas menjadi sebuah tf.data.Dataset

Tutorial ini menggunakan dataset kecil yang disediakan oleh Cleveland Clinic Foundation for Heart Disease. Terdapat beberapa ratus baris data dalam CSV. Setiap baris mendeskripsikan seorang pasien, dan setiap kolom menjelaskan atribut dari pasien tersebut. Kita akan menggunakan informasi ini untuk memprediksi apakah seorang pasien memiliki penyakit jantung, dimana dalam dataset ini merupakan permasalahan klasifikasi biner.

Membaca data menggunakan pandas

from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version hanya terdapat di Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import pandas as pd
import tensorflow as tf

Unduh file csv yang berisi data jantung pasien.

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

Baca csv file menggunakan pandas.

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

Ubah kolom thal yang merupakan sebuah data bertipe object pada dataframe menjadi tipe data numerik diskrit.

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

Muat data menggunakan tf.data.Dataset

Gunakan tf.data.Dataset.from_tensor_slices untuk membaca nilai dari dataframe pandas.

Salah satu keunggulan menggunakan tf.data.Dataset adalah Anda dapat menuliskan pipeline data yang sederhana dan sangat efisien. Baca Petunjuk memuat data untuk mengetahui lebih banyak lagi tentang hal ini.

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

Karena pd.Series mengimplementasikan protokol __array__, hal ini menyebabkan Anda dapat menggunakan np.array atau tf.Tensor.

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>

Lakukan shuffle dan batch dataset.

train_dataset = dataset.shuffle(len(df)).batch(1)

Buat dan latih sebuah model

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15
WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

303/303 [==============================] - 1s 2ms/step - loss: 3.5384 - accuracy: 0.5314
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 0.7229 - accuracy: 0.7096
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6647 - accuracy: 0.7162
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6092 - accuracy: 0.7030
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5701 - accuracy: 0.7063
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5498 - accuracy: 0.7228
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5415 - accuracy: 0.7492
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5292 - accuracy: 0.7195
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5309 - accuracy: 0.7195
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5121 - accuracy: 0.7426
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5126 - accuracy: 0.7327
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5009 - accuracy: 0.7327
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4915 - accuracy: 0.7690
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4867 - accuracy: 0.7558
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.4620 - accuracy: 0.7723

<tensorflow.python.keras.callbacks.History at 0x7fc8bd00da20>

Alternatif dari feature columns

Menggunakan sebuah dictionary sebagai input untuk sebuah model sama mudahnya dengan membuat matching dictionary dari layer tf.keras.layers.Input, melakukan setiap proses pre-processing dan kemudian melakukan stacking terhadap layer tersebut menggunakan api fungsional. Anda dapat menggunakan ini sebagai alternatif dari feature columns.

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

Cara termudah untuk menjaga struktur kolom dari pd.DataFrame ketika digunakan dengan tf.data adalah dengan cara mengkonversi pd.DataFrame menjadi sebuah dict, kemudian melakukan slicing dari dictionary tersebut.

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)

model_func.fit(dict_slices, epochs=15)
Epoch 1/15
19/19 [==============================] - 0s 3ms/step - loss: 79.9160 - accuracy: 0.2739
Epoch 2/15
19/19 [==============================] - 0s 3ms/step - loss: 68.3307 - accuracy: 0.2739
Epoch 3/15
19/19 [==============================] - 0s 2ms/step - loss: 57.2151 - accuracy: 0.2739
Epoch 4/15
19/19 [==============================] - 0s 2ms/step - loss: 46.2963 - accuracy: 0.2739
Epoch 5/15
19/19 [==============================] - 0s 2ms/step - loss: 35.4408 - accuracy: 0.2739
Epoch 6/15
19/19 [==============================] - 0s 2ms/step - loss: 24.5233 - accuracy: 0.2739
Epoch 7/15
19/19 [==============================] - 0s 2ms/step - loss: 13.4328 - accuracy: 0.2739
Epoch 8/15
19/19 [==============================] - 0s 2ms/step - loss: 3.4026 - accuracy: 0.4092
Epoch 9/15
19/19 [==============================] - 0s 3ms/step - loss: 1.3497 - accuracy: 0.7162
Epoch 10/15
19/19 [==============================] - 0s 2ms/step - loss: 1.1025 - accuracy: 0.6634
Epoch 11/15
19/19 [==============================] - 0s 2ms/step - loss: 1.1010 - accuracy: 0.6403
Epoch 12/15
19/19 [==============================] - 0s 2ms/step - loss: 1.0536 - accuracy: 0.6502
Epoch 13/15
19/19 [==============================] - 0s 2ms/step - loss: 1.0217 - accuracy: 0.6469
Epoch 14/15
19/19 [==============================] - 0s 2ms/step - loss: 0.9914 - accuracy: 0.6469
Epoch 15/15
19/19 [==============================] - 0s 2ms/step - loss: 0.9558 - accuracy: 0.6502

<tensorflow.python.keras.callbacks.History at 0x7fc8bd000320>