Muat data CSV

Lihat di TensorFlow.org Jalankan di Google Colab Lihat kode di GitHub Unduh notebook

Tutorial ini memberikan contoh cara memuat data CSV dari file ke tf.data.Dataset.

Data yang digunakan dalam tutorial ini diambil dari daftar penumpang Titanic. Model akan memprediksi kemungkinan penumpang selamat berdasarkan karakteristik seperti usia, jenis kelamin, kelas tiket, dan apakah orang tersebut bepergian sendirian.

Setup

import functools

import numpy as np
import tensorflow as tf
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/eval.csv
16384/13049 [=====================================] - 0s 0us/step

# Membuat nilai numpy agar lebih mudah dibaca.
np.set_printoptions(precision=3, suppress=True)

Memuat data

Untuk memulai, mari kita lihat bagian atas file CSV untuk melihat formatnya.

head {train_file_path}
survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone
0,male,22.0,1,0,7.25,Third,unknown,Southampton,n
1,female,38.0,1,0,71.2833,First,C,Cherbourg,n
1,female,26.0,0,0,7.925,Third,unknown,Southampton,y
1,female,35.0,1,0,53.1,First,C,Southampton,n
0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y
0,male,2.0,3,1,21.075,Third,unknown,Southampton,n
1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n
1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n
1,female,4.0,1,1,16.7,Third,G,Southampton,n

Anda dapat memuat ini menggunakan panda, dan meneruskan array NumPy ke TensorFlow. Jika Anda perlu meningkatkan ke file dengan skala besar, atau membutuhkan loader yang terintegrasi dengan TensorFlow dan tf.data kemudian gunakan fungsi tf.data.experimental.make_csv_dataset:

Satu-satunya kolom yang perlu Anda identifikasi secara eksplisit adalah kolom dengan nilai yang dimaksudkan untuk diprediksi oleh model.

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

Sekarang baca data CSV dari file dan buat sebuah dataset.

(Untuk dokumentasi lengkap, lihat tf.data.experimental.make_csv_dataset)

def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, # Artifisial kecil untuk membuat contoh lebih mudah ditampilkan.
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True, 
      **kwargs)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))

Setiap item dalam dataset adalah batch, direpresentasikan sebagai tuple dari (banyak contoh, banyak label). Data dari contoh-contoh tersebut disusun dalam tensor berbasis kolom (bukan tensor berbasis baris), masing-masing dengan elemen sebanyak ukuran batch (dalam kasus ini 5).

Dengan melihat sendiri, mungkin akan membantu Anda untuk memahami.

show_batch(raw_train_data)
sex                 : [b'female' b'female' b'female' b'male' b'male']
age                 : [18. 23. 30. 21. 60.]
n_siblings_spouses  : [0 3 1 0 1]
parch               : [1 2 1 0 1]
fare                : [ 23.   263.    24.15   8.05  39.  ]
class               : [b'Second' b'First' b'Third' b'Third' b'Second']
deck                : [b'unknown' b'C' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'n' b'n' b'n' b'y' b'n']

Seperti yang Anda lihat, kolom dalam file CSV diberi nama. Konstruktor dataset akan mengambil nama-nama ini secara otomatis. Jika file yang sedang Anda kerjakan tidak mengandung nama kolom di baris pertama, berikan mereka dalam daftar string ke argumen column_names dalam fungsimake_csv_dataset.

CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

show_batch(temp_dataset)
sex                 : [b'female' b'male' b'male' b'female' b'male']
age                 : [28. 34. 28. 28. 37.]
n_siblings_spouses  : [1 1 0 0 2]
parch               : [0 1 0 1 0]
fare                : [133.65   14.4    35.5    55.      7.925]
class               : [b'First' b'Third' b'First' b'First' b'Third']
deck                : [b'unknown' b'unknown' b'C' b'E' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'n' b'n' b'y' b'n' b'n']

Contoh ini akan menggunakan semua kolom yang tersedia. Jika Anda perlu menghilangkan beberapa kolom dari dataset, buat daftar hanya kolom yang Anda rencanakan untuk digunakan, dan kirimkan ke argumen select_columns (opsional) dari konstruktor.

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)
age                 : [42. 28. 19. 43. 50.]
n_siblings_spouses  : [1 0 0 0 2]
class               : [b'Second' b'Third' b'Third' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'F' b'unknown' b'unknown']
alone               : [b'n' b'y' b'y' b'y' b'n']

Data pre-processing

File CSV dapat berisi berbagai tipe data. Biasanya Anda ingin mengonversi dari berbagai tipe ke vektor dengan panjang tetap sebelum memasukkan data ke dalam model Anda.

TensorFlow memiliki sistem bawaan untuk menjelaskan konversi input umum: tf.feature_column, lihat tutorial ini untuk detailnya.

Anda dapat memproses data Anda menggunakan alat apa pun yang Anda suka (seperti nltk atau sklearn), dan kemudian memberikan output yang telah diproses ke TensorFlow.

Keuntungan utama melakukan preprocessing di dalam model Anda adalah ketika Anda mengekspor model itu termasuk dengan proses preprocessing. Dengan cara ini Anda bisa mengirimkan data mentah langsung ke model Anda.

Continuous data

Jika data Anda sudah dalam format numerik yang sesuai, Anda bisa mengemas data ke dalam vektor sebelum meneruskannya ke model:

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, 
                           select_columns=SELECT_COLUMNS,
                           column_defaults = DEFAULTS)

show_batch(temp_dataset)
age                 : [28. 28. 18. 80. 49.]
n_siblings_spouses  : [0. 1. 0. 0. 1.]
parch               : [0. 0. 0. 0. 0.]
fare                : [  7.75  146.521  11.5    30.     89.104]

example_batch, labels_batch = next(iter(temp_dataset)) 

Berikut adalah fungsi sederhana yang akan menyatukan semua


kolom:

def pack(features, label):
  return tf.stack(list(features.values()), axis=-1), label

Terapkan ini ke setiap elemen dataset:

packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())
[[20.     0.     0.     8.05 ]
 [33.     0.     0.     8.654]
 [26.     0.     0.     7.887]
 [45.     0.     0.     7.75 ]
 [24.     0.     0.    49.504]]

[0 0 0 0 1]

Jika Anda memiliki data dengan berbagai tipe, Anda mungkin ingin memisahkan data simple-numeric fields. Api tf.feature_column dapat menanganinya, tetapi hal ini menimbulkan beberapa overhead dan harus dihindari kecuali benar-benar diperlukan. Kembali ke kumpulan data campuran:

show_batch(raw_train_data)
sex                 : [b'female' b'male' b'female' b'male' b'female']
age                 : [44.  4. 28. 18. 25.]
n_siblings_spouses  : [0 1 1 1 1]
parch               : [1 1 0 1 2]
fare                : [ 57.979  11.133  82.171   7.854 151.55 ]
class               : [b'First' b'Third' b'First' b'Third' b'First']
deck                : [b'B' b'unknown' b'unknown' b'unknown' b'C']
embark_town         : [b'Cherbourg' b'Southampton' b'Cherbourg' b'Southampton' b'Southampton']
alone               : [b'n' b'n' b'n' b'n' b'n']

example_batch, labels_batch = next(iter(temp_dataset)) 

Jadi tentukan preprosesor yang lebih umum yang memilih daftar fitur numerik dan mengemasnya ke dalam satu kolom:

class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names

  def __call__(self, features, labels):
    numeric_freatures = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_freatures]
    numeric_features = tf.stack(numeric_features, axis=-1)
    features['numeric'] = numeric_features

    return features, labels
NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))
show_batch(packed_train_data)
sex                 : [b'female' b'male' b'male' b'female' b'male']
class               : [b'First' b'First' b'Third' b'First' b'Third']
deck                : [b'C' b'E' b'unknown' b'B' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'y' b'y' b'y' b'y' b'n']
numeric             : [[ 35.      0.      0.    135.633]
 [ 47.      0.      0.     38.5  ]
 [ 27.      0.      0.      8.663]
 [ 19.      0.      0.     30.   ]
 [  2.      3.      1.     21.075]]

example_batch, labels_batch = next(iter(packed_train_data)) 

Normalisasi Data

Data kontinu (continues data) harus selalu dinormalisasi.

import pandas as pd
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])
def normalize_numeric_data(data, mean, std):
  # Center the data
  return (data-mean)/std

Sekarang buat kolom angka. API tf.feature_columns.numeric_column menerima argumen normalizer_fn, yang akan dijalankan pada setiap batch.

Bundlekan MEAN dan STD ke normalizer fn menggunakan functools.partial.

# Lihat apa yang baru saja Anda buat.
normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column
NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x7fe504212b70>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))

Saat Anda melatih model, sertakan kolom fitur ini untuk memilih dan memusatkan blok data numerik ini:

example_batch['numeric']
<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[ 16.   ,   0.   ,   0.   ,  10.5  ],
       [ 64.   ,   1.   ,   4.   , 263.   ],
       [ 30.   ,   1.   ,   0.   ,  16.1  ],
       [ 54.   ,   1.   ,   0.   ,  59.4  ],
       [ 14.   ,   1.   ,   0.   ,  11.242]], dtype=float32)>
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()
array([[-1.089, -0.474, -0.479, -0.437],
       [ 2.747,  0.395,  4.565,  4.187],
       [ 0.029,  0.395, -0.479, -0.335],
       [ 1.948,  0.395, -0.479,  0.458],
       [-1.249,  0.395, -0.479, -0.424]], dtype=float32)

Normalisasi berdasarkan rata-rata yang digunakan di sini mewajibkan kita untuk mengetahui rata-rata setiap kolom sebelumnya.

Kategori data

Beberapa kolom dalam data CSV adalah kolom kategorikal. Artinya, kontennya harus menjadi salah satu dari opsi yang ada.

Gunakan API tf.feature_column untuk membuat koleksi dengan tf.feature_column.indicator_column untuk setiap kolom kategorikal.

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))
# See what you just created.
categorical_columns
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])
[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]

Ini akan menjadi bagian dari pemrosesan input data ketika Anda membangun model.

Layer preprocessing gabungan

Tambahkan dua koleksi kolom fitur dan teruskan ke tf.keras.layers.DenseFeatures untuk membuat lapisan input yang akan mengekstraksi dan memproses kedua jenis input:

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)
print(preprocessing_layer(example_batch).numpy()[0])
[ 1.     0.     0.     1.     0.     0.     0.     0.     0.     0.

  0.     0.     0.     0.     0.     0.     0.     0.    -1.089 -0.474
 -0.479 -0.437  1.     0.   ]

Membangun Model

Jalankan tf.keras.Sequential, mulai dari preprocessing_layer.

model = tf.keras.Sequential([
  preprocessing_layer,
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])

Latih, evaluasi, dan prediksi

Sekarang model dapat dipakai dan dilatih.

train_data = packed_train_data.shuffle(500)
test_data = packed_test_data
model.fit(train_data, epochs=20)
Epoch 1/20
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('class', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=string>), ('numeric', <tf.Tensor 'IteratorGetNext:4' shape=(None, 4) dtype=float32>)])
Consider rewriting this model with the Functional API.
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('class', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=string>), ('numeric', <tf.Tensor 'IteratorGetNext:4' shape=(None, 4) dtype=float32>)])
Consider rewriting this model with the Functional API.
126/126 [==============================] - 0s 3ms/step - loss: 0.4878 - accuracy: 0.7767
Epoch 2/20
126/126 [==============================] - 0s 3ms/step - loss: 0.4183 - accuracy: 0.8230
Epoch 3/20
126/126 [==============================] - 0s 3ms/step - loss: 0.4003 - accuracy: 0.8246
Epoch 4/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3924 - accuracy: 0.8341
Epoch 5/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3725 - accuracy: 0.8357
Epoch 6/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3716 - accuracy: 0.8469
Epoch 7/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3604 - accuracy: 0.8453
Epoch 8/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3523 - accuracy: 0.8389
Epoch 9/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3550 - accuracy: 0.8517
Epoch 10/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3441 - accuracy: 0.8469
Epoch 11/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3440 - accuracy: 0.8453
Epoch 12/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3397 - accuracy: 0.8405
Epoch 13/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3339 - accuracy: 0.8612
Epoch 14/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3300 - accuracy: 0.8596
Epoch 15/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3273 - accuracy: 0.8596
Epoch 16/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3204 - accuracy: 0.8612
Epoch 17/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3108 - accuracy: 0.8628
Epoch 18/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3086 - accuracy: 0.8596
Epoch 19/20
126/126 [==============================] - 0s 3ms/step - loss: 0.3083 - accuracy: 0.8740
Epoch 20/20
126/126 [==============================] - 0s 3ms/step - loss: 0.2998 - accuracy: 0.8756

<tensorflow.python.keras.callbacks.History at 0x7fe504177400>

Setelah model dilatih, Anda dapat memeriksa akurasinya pada set test_data.

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('class', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=string>), ('numeric', <tf.Tensor 'IteratorGetNext:4' shape=(None, 4) dtype=float32>)])
Consider rewriting this model with the Functional API.
53/53 [==============================] - 0s 3ms/step - loss: 0.4902 - accuracy: 0.8068


Test Loss 0.4901774525642395, Test Accuracy 0.8068181872367859

Gunakan tf.keras.Model.predict untuk menyimpulkan pada label batch atau dataset batch.

predictions = model.predict(test_data)

# Tampilkan beberapa hasil
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
  print("Predicted survival: {:.2%}".format(prediction[0]),
        " | Actual outcome: ",
        ("SURVIVED" if bool(survived) else "DIED"))
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'collections.OrderedDict'> input: OrderedDict([('sex', <tf.Tensor 'ExpandDims_4:0' shape=(None, 1) dtype=string>), ('class', <tf.Tensor 'ExpandDims_1:0' shape=(None, 1) dtype=string>), ('deck', <tf.Tensor 'ExpandDims_2:0' shape=(None, 1) dtype=string>), ('embark_town', <tf.Tensor 'ExpandDims_3:0' shape=(None, 1) dtype=string>), ('alone', <tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=string>), ('numeric', <tf.Tensor 'IteratorGetNext:4' shape=(None, 4) dtype=float32>)])
Consider rewriting this model with the Functional API.
Predicted survival: 99.94%  | Actual outcome:  SURVIVED
Predicted survival: 99.92%  | Actual outcome:  DIED
Predicted survival: 38.18%  | Actual outcome:  SURVIVED
Predicted survival: 46.39%  | Actual outcome:  DIED
Predicted survival: 77.95%  | Actual outcome:  SURVIVED