Ta strona została przetłumaczona przez Cloud Translation API.
Switch to English

Załaduj pandy.DataFrame

Zobacz na TensorFlow.org Uruchom w Google Colab Wyświetl źródło na GitHub Pobierz notatnik

W tym samouczku przedstawiono przykład ładowania ramek danych pandy do pliku tf.data.Dataset .

W tych samouczkach wykorzystano mały zestaw danych dostarczony przez Cleveland Clinic Foundation for Heart Disease. Plik CSV zawiera kilkaset wierszy. Każdy wiersz opisuje pacjenta, a każda kolumna opisuje atrybut. Wykorzystamy te informacje, aby przewidzieć, czy pacjent ma chorobę serca, co w tym zbiorze danych jest zadaniem klasyfikacji binarnej.

Czytaj dane za pomocą pand

import pandas as pd
import tensorflow as tf

Pobierz plik csv zawierający zestaw danych serca.

csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
Downloading data from https://storage.googleapis.com/applied-dl/heart.csv
16384/13273 [=====================================] - 0s 0us/step

Przeczytaj plik csv za pomocą pand.

df = pd.read_csv(csv_file)
df.head()
df.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

Konwertuj kolumnę thal która jest object w ramce danych, na dyskretną wartość liczbową.

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes
df.head()

Ładowanie danych za pomocą tf.data.Dataset

Użyj tf.data.Dataset.from_tensor_slices aby odczytać wartości z ramki danych pandas.

Jedną z zalet korzystania z tf.data.Dataset jest to, że umożliwia pisanie prostych, wysoce wydajnych potoków danych. Przeczytaj przewodnik dotyczący danych ładowania, aby dowiedzieć się więcej.

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))
Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.

   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0

Ponieważ pd.Series implementuje protokół __array__ , może być używany w sposób przezroczysty prawie wszędzie, gdzie używałbyś np.array lub tf.Tensor .

tf.constant(df['thal'])
<tf.Tensor: shape=(303,), dtype=int8, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3,
       3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4], dtype=int8)>

Przetasuj i zbiorczo zestaw danych.

train_dataset = dataset.shuffle(len(df)).batch(1)

Utwórz i wytrenuj model

def get_compiled_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model
model = get_compiled_model()
model.fit(train_dataset, epochs=15)
Epoch 1/15
WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because its dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

303/303 [==============================] - 1s 2ms/step - loss: 2.4922 - accuracy: 0.6007
Epoch 2/15
303/303 [==============================] - 1s 2ms/step - loss: 1.4119 - accuracy: 0.6634
Epoch 3/15
303/303 [==============================] - 1s 2ms/step - loss: 0.8979 - accuracy: 0.7228
Epoch 4/15
303/303 [==============================] - 1s 2ms/step - loss: 0.9051 - accuracy: 0.7030
Epoch 5/15
303/303 [==============================] - 1s 2ms/step - loss: 0.8189 - accuracy: 0.7657
Epoch 6/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6901 - accuracy: 0.7690
Epoch 7/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6077 - accuracy: 0.7921
Epoch 8/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6459 - accuracy: 0.7690
Epoch 9/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6458 - accuracy: 0.7690
Epoch 10/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6309 - accuracy: 0.7591
Epoch 11/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5737 - accuracy: 0.7921
Epoch 12/15
303/303 [==============================] - 1s 2ms/step - loss: 0.6617 - accuracy: 0.7789
Epoch 13/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5025 - accuracy: 0.8053
Epoch 14/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5386 - accuracy: 0.7822
Epoch 15/15
303/303 [==============================] - 1s 2ms/step - loss: 0.5999 - accuracy: 0.7690

<tensorflow.python.keras.callbacks.History at 0x7f770ea8dfd0>

Alternatywa dla kolumn funkcji

Przekazywanie słownika jako danych wejściowych do modelu jest tak proste, jak utworzenie dopasowanego słownika warstw tf.keras.layers.Input , zastosowanie dowolnego przetwarzania wstępnego i ułożenie ich w stos za pomocą funkcjonalnego interfejsu API . Możesz użyć tego jako alternatywy dla kolumn funkcji .

inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1)(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                   metrics=['accuracy'])

Najłatwiejszym sposobem zachowania struktury kolumn pd.DataFrame gdy jest używany z tf.data jest przekonwertowanie pd.DataFrame na dict i pd.DataFrame tego słownika.

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
  print (dict_slice)
({'age': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0], dtype=int32)>, 'thalach': <tf.Tensor: shape=(16,), dtype=int32, numpy=
array([150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,
       173, 162, 174], dtype=int32)>, 'exang': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'oldpeak': <tf.Tensor: shape=(16,), dtype=float32, numpy=
array([2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,

       0. , 0.5, 1.6], dtype=float32)>, 'slope': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1], dtype=int32)>, 'ca': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0], dtype=int32)>, 'thal': <tf.Tensor: shape=(16,), dtype=int32, numpy=array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3], dtype=int32)>}, <tf.Tensor: shape=(16,), dtype=int64, numpy=array([0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0])>)

model_func.fit(dict_slices, epochs=15)
Epoch 1/15
19/19 [==============================] - 0s 3ms/step - loss: 43.4762 - accuracy: 0.7261
Epoch 2/15
19/19 [==============================] - 0s 3ms/step - loss: 35.6609 - accuracy: 0.7261
Epoch 3/15
19/19 [==============================] - 0s 3ms/step - loss: 28.0946 - accuracy: 0.7261
Epoch 4/15
19/19 [==============================] - 0s 3ms/step - loss: 20.6991 - accuracy: 0.7195
Epoch 5/15
19/19 [==============================] - 0s 3ms/step - loss: 14.4886 - accuracy: 0.6700
Epoch 6/15
19/19 [==============================] - 0s 3ms/step - loss: 11.5720 - accuracy: 0.5479
Epoch 7/15
19/19 [==============================] - 0s 3ms/step - loss: 10.9353 - accuracy: 0.5083
Epoch 8/15
19/19 [==============================] - 0s 3ms/step - loss: 10.3756 - accuracy: 0.5413
Epoch 9/15
19/19 [==============================] - 0s 3ms/step - loss: 9.8816 - accuracy: 0.5347
Epoch 10/15
19/19 [==============================] - 0s 3ms/step - loss: 9.4354 - accuracy: 0.5314
Epoch 11/15
19/19 [==============================] - 0s 3ms/step - loss: 9.0222 - accuracy: 0.5347
Epoch 12/15
19/19 [==============================] - 0s 2ms/step - loss: 8.6444 - accuracy: 0.5347
Epoch 13/15
19/19 [==============================] - 0s 3ms/step - loss: 8.2889 - accuracy: 0.5347
Epoch 14/15
19/19 [==============================] - 0s 3ms/step - loss: 7.9492 - accuracy: 0.5314
Epoch 15/15
19/19 [==============================] - 0s 3ms/step - loss: 7.6179 - accuracy: 0.5347

<tensorflow.python.keras.callbacks.History at 0x7f770ea8a518>