本頁面由 Cloud Translation API 翻譯而成。
Switch to English

tf.data:體形TensorFlow輸入管道

查看上TensorFlow.org 在谷歌Colab運行 GitHub上查看源代碼 下載筆記本

tf.data API使您能夠構建從簡單的,可重用代碼的複雜的輸入管道。例如,管道的圖像模型可以在分佈式文件系統的文件匯總數據,應用隨機擾動每個圖像,並合併隨機選擇圖像轉化成間歇進行訓練。對於文本模型管道可能涉及從原始文本數據提取符號,將它們轉換為具有查找表嵌入標識符和不同長度的序列配料一起。該tf.data API能夠處理大量的數據,從不同的數據格式讀取,並執行複雜的轉換。

所述tf.data API引入tf.data.Dataset抽象表示元素的序列,其中,每個元件由一個或多個組件。例如,在圖像管線,一個元件可能是一個單個訓練實例中,用一對表示所述圖像和它的標籤張量分量。

有兩種不同的方法來創建一個數據集:

  • 一種數據構造一個Dataset從存儲在存儲器中或一個或多個文件的數據。

  • 一種數據變換構造從一個或多個數據集tf.data.Dataset對象。

 import tensorflow as tf
 
 import pathlib
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.set_printoptions(precision=4)
 

基本力學

要創建一個輸入管道之前,必須與數據開始。例如,為了構建一個Dataset從數據存儲在存儲器中,可以使用tf.data.Dataset.from_tensors()tf.data.Dataset.from_tensor_slices()另外,如果您輸入的數據存儲在推薦TFRecord格式的文件,你可以使用tf.data.TFRecordDataset()

一旦你有一個Dataset對象,你可以其轉化為一個新的Dataset通過級聯的方法調用tf.data.Dataset對象。例如,可以應用每個元素的轉換,如Dataset.map()和多元素轉換,如Dataset.batch()請參閱文件tf.data.Dataset用於轉換的完整列表。

Dataset對象是一個Python迭代。這使得消費使用for循環的要素:

 dataset = tf.data.Dataset.from_tensor_slices([8, 3, 0, 8, 2, 1])
dataset
 
<TensorSliceDataset shapes: (), types: tf.int32>
 for elem in dataset:
  print(elem.numpy())
 
8
3
0
8
2
1

或通過顯式地創建一個Python迭代器使用iter和消費使用它的元素next

 it = iter(dataset)

print(next(it).numpy())
 
8

可替換地,數據集單元能使用被消耗reduce變換,從而降低所有元素以產生單個結果。下面的例子演示了如何使用reduce轉換來計算整數數據集的總和。

 print(dataset.reduce(0, lambda state, value: state + value).numpy())
 
22

數據集結構

數據集包含的元素,每個具有相同的(嵌套的)結構和該結構的各個部件可以通過任何類型的可表示的tf.TypeSpec ,包括tf.Tensortf.sparse.SparseTensortf.RaggedTensortf.TensorArray ,或tf.data.Dataset

Dataset.element_spec屬性,可以檢查每個元素成分的類型。的屬性返回的嵌套結構 tf.TypeSpec對象,匹配元件,其可以是單個部件,組件的一個元組,或組件的嵌套元組的結構。例如:

 dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))

dataset1.element_spec
 
TensorSpec(shape=(10,), dtype=tf.float32, name=None)
 dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2.element_spec
 
(TensorSpec(shape=(), dtype=tf.float32, name=None),
 TensorSpec(shape=(100,), dtype=tf.int32, name=None))
 dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3.element_spec
 
(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
 (TensorSpec(shape=(), dtype=tf.float32, name=None),
  TensorSpec(shape=(100,), dtype=tf.int32, name=None)))
 # Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))

dataset4.element_spec
 
SparseTensorSpec(TensorShape([3, 4]), tf.int32)
 # Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type
 
tensorflow.python.framework.sparse_tensor.SparseTensor

Dataset的轉換支持任何結構的數據集。當使用Dataset.map()Dataset.filter()變換,其中應用一個函數到每個元件中,元件結構決定了函數的自變量:

 dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset1
 
<TensorSliceDataset shapes: (10,), types: tf.int32>
 for z in dataset1:
  print(z.numpy())
 
[8 1 2 6 1 7 2 6 1 3]
[6 5 6 5 3 5 2 5 3 6]
[5 8 4 8 3 1 4 6 4 8]
[2 4 5 8 3 5 7 9 4 2]

 dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset2
 
<TensorSliceDataset shapes: ((), (100,)), types: (tf.float32, tf.int32)>
 dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset3
 
<ZipDataset shapes: ((10,), ((), (100,))), types: (tf.int32, (tf.float32, tf.int32))>
 for a, (b,c) in dataset3:
  print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))
 
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

讀取輸入數據

消費NumPy的陣列

載入NumPy的陣列更多的例子。

如果所有內存中輸入數據擬合,創建一個最簡單的方法Dataset從他們的是將它們轉換為tf.Tensor對象和使用Dataset.from_tensor_slices()

 train, test = tf.keras.datasets.fashion_mnist.load_data()
 
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

 images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset
 
<TensorSliceDataset shapes: ((28, 28), ()), types: (tf.float64, tf.uint8)>

消費Python生成

可以很容易地攝取的另一個常見的數據源tf.data.Dataset是Python發生器。

 def count(stop):
  i = 0
  while i<stop:
    yield i
    i += 1
 
 for n in count(5):
  print(n)
 
0
1
2
3
4

Dataset.from_generator構造蟒蛇發電機轉換成一個功能齊全的tf.data.Dataset

構造函數可調用作為輸入,而不是一個迭代器。這使得它,當它到達末端重新啟動發電機。它帶有一個可選的args的說法,這是因為調用的參數傳遞。

所述output_types參數是必需的,因為tf.data建立一個tf.Graph內部,和圖形邊緣需要tf.dtype

 ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )
 
 for count_batch in ds_counter.repeat().batch(10).take(10):
  print(count_batch.numpy())
 
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24  0  1  2  3  4]
[ 5  6  7  8  9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]

output_shapes參數不是必需的 ,但強烈recomended許多tensorflow操作不支持與未知階張量。如果一個特定的軸線的長度為未知的或可變的,將其設置為Noneoutput_shapes

同樣重要的是要注意, output_shapesoutput_types遵循相同的規則,嵌套和其他數據集的方法。

下面是一個說明兩個方面的示例發生器,它返回陣列,其中第二陣列是具有未知長度的矢量的元組。

 def gen_series():
  i = 0
  while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,))
    i += 1
 
 for i, series in gen_series():
  print(i, ":", str(series))
  if i > 5:
    break
 
0 : [-1.978  -1.0531  0.1959 -2.1618]
1 : [ 1.9185 -0.1874  0.5084]
2 : [0.1441 0.3987 0.7737 0.9266 1.5057 0.9151]
3 : [ 0.681  -0.6155 -0.1231 -0.2429  0.6892  1.2571 -1.7588 -1.6575 -0.5375]
4 : [-0.5567  1.5298  0.7242  0.2213]
5 : [ 1.5572 -0.6856]
6 : [-1.0965 -0.336   1.2405  0.6006]

第一輸出是int32第二個是一個float32

的第一項是一個標量,形狀()第二個是未知的長度,形狀的矢量(None,)

 ds_series = tf.data.Dataset.from_generator(
    gen_series, 
    output_types=(tf.int32, tf.float32), 
    output_shapes=((), (None,)))

ds_series
 
<FlatMapDataset shapes: ((), (None,)), types: (tf.int32, tf.float32)>

現在,它可以用於像一個普通的tf.data.Dataset 。請注意,具有可變形狀批處理數據集時,你需要使用Dataset.padded_batch

 ds_series_batch = ds_series.shuffle(20).padded_batch(10)

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())
 
[ 2 10 18  3  6 15 25 23  0  4]

[[ 1.2665 -0.6274  0.4076  1.0146  0.      0.      0.      0.    ]
 [ 0.8091 -0.0683 -0.1464  0.2734  0.7461 -0.1009  0.      0.    ]
 [-0.9381  1.5075  0.      0.      0.      0.      0.      0.    ]
 [ 1.5705  0.4438  0.      0.      0.      0.      0.      0.    ]
 [-0.4692 -1.8328 -2.2838  0.7418  0.0172 -0.3547 -1.4502 -1.2786]
 [-1.574   0.      0.      0.      0.      0.      0.      0.    ]
 [-0.9274  1.4758  0.      0.      0.      0.      0.      0.    ]
 [-0.5043  0.7066  0.9599 -1.2986  0.      0.      0.      0.    ]
 [ 0.      0.      0.      0.      0.      0.      0.      0.    ]
 [-0.4893 -0.6937  0.      0.      0.      0.      0.      0.    ]]

對於一個更現實的例子,嘗試包裹preprocessing.image.ImageDataGenerator作為tf.data.Dataset

首先下載數據:

 flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
 
Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228818944/228813984 [==============================] - 2s 0us/step

創建image.ImageDataGenerator

 img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)
 
 images, labels = next(img_gen.flow_from_directory(flowers))
 
Found 3670 images belonging to 5 classes.

 print(images.dtype, images.shape)
print(labels.dtype, labels.shape)
 
float32 (32, 256, 256, 3)
float32 (32, 5)

 ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers], 
    output_types=(tf.float32, tf.float32), 
    output_shapes=([32,256,256,3], [32,5])
)

ds
 
<FlatMapDataset shapes: ((32, 256, 256, 3), (32, 5)), types: (tf.float32, tf.float32)>

消費TFRecord數據

加載TFRecords用於端至端的例子。

tf.data API支持多種文件格式,這樣就可以處理,不適合在內存中的大型數據集。例如,TFRecord文件格式是一種簡單的面向記錄的二進制格式,很多TensorFlow應用程序使用的訓練數據。該tf.data.TFRecordDataset類可以一個或多個TFRecord文件的內容流過作為輸入管道的一部分。

下面是使用來自法國的街道名稱標誌(FSNS)的測試文件的例子。

 # Creates a dataset that reads all of the examples from two files.
fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
 
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001
7905280/7904079 [==============================] - 0s 0us/step

filenames參數的TFRecordDataset初始化可以是一個字符串,字符串列表,或tf.Tensor字符串。因此,如果你有兩套文件的培訓和驗證的目的,您可以創建生產數據集,以文件名作為輸入參數工廠方法:

 dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset
 
<TFRecordDatasetV2 shapes: (), types: tf.string>

許多TensorFlow項目中使用序列化tf.train.Example他們TFRecord文件記錄。要解碼這些需要,他們可以檢查前:

 raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

parsed.features.feature['image/text']
 
bytes_list {
  value: "Rue Perreyon"
}

消費文本數據

文本加載一個端到端的例子。

許多數據集分發作為一個或多個文本文件。該tf.data.TextLineDataset提供了一種簡單的方法來提取一個或多個文本文件行。給定一個或多個文件名,一個TextLineDataset會產生每這些文件的行一個字符串值的元素。

 directory_url = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
file_names = ['cowper.txt', 'derby.txt', 'butler.txt']

file_paths = [
    tf.keras.utils.get_file(file_name, directory_url + file_name)
    for file_name in file_names
]
 
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
819200/815980 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
811008/809730 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt
811008/807992 [==============================] - 0s 0us/step

 dataset = tf.data.TextLineDataset(file_paths)
 

以下是第一個文件的前幾行:

 for line in dataset.take(5):
  print(line.numpy())
 
b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b'His wrath pernicious, who ten thousand woes'
b"Caused to Achaia's host, sent many a soul"
b'Illustrious into Ades premature,'
b'And Heroes gave (so stood the will of Jove)'

到文件之間交替行使用Dataset.interleave 。這使得它更容易洗牌文件一起。這裡是從每個平移所述第一,第二和第三行:

 files_ds = tf.data.Dataset.from_tensor_slices(file_paths)
lines_ds = files_ds.interleave(tf.data.TextLineDataset, cycle_length=3)

for i, line in enumerate(lines_ds.take(9)):
  if i % 3 == 0:
    print()
  print(line.numpy())
 

b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;"
b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,"
b'\xef\xbb\xbfSing, O goddess, the anger of Achilles son of Peleus, that brought'

b'His wrath pernicious, who ten thousand woes'
b'The vengeance, deep and deadly; whence to Greece'
b'countless ills upon the Achaeans. Many a brave soul did it send'

b"Caused to Achaia's host, sent many a soul"
b'Unnumbered ills arose; which many a soul'
b'hurrying down to Hades, and many a hero did it yield a prey to dogs and'

缺省情況下, TextLineDataset產生的每個文件,這可能不是所期望的,例如行,如果文件具有標題行啟動時,或包含註釋。這些線可使用被移除Dataset.skip()Dataset.filter()變換。在這裡,你跳過第一行,然後過濾器來查找唯一的倖存者。

 titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)
 
Downloading data from https://storage.googleapis.com/tf-datasets/titanic/train.csv
32768/30874 [===============================] - 0s 0us/step

 for line in titanic_lines.take(10):
  print(line.numpy())
 
b'survived,sex,age,n_siblings_spouses,parch,fare,class,deck,embark_town,alone'
b'0,male,22.0,1,0,7.25,Third,unknown,Southampton,n'
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'0,male,28.0,0,0,8.4583,Third,unknown,Queenstown,y'
b'0,male,2.0,3,1,21.075,Third,unknown,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'

 def survived(line):
  return tf.not_equal(tf.strings.substr(line, 0, 1), "0")

survivors = titanic_lines.skip(1).filter(survived)
 
 for line in survivors.take(10):
  print(line.numpy())
 
b'1,female,38.0,1,0,71.2833,First,C,Cherbourg,n'
b'1,female,26.0,0,0,7.925,Third,unknown,Southampton,y'
b'1,female,35.0,1,0,53.1,First,C,Southampton,n'
b'1,female,27.0,0,2,11.1333,Third,unknown,Southampton,n'
b'1,female,14.0,1,0,30.0708,Second,unknown,Cherbourg,n'
b'1,female,4.0,1,1,16.7,Third,G,Southampton,n'
b'1,male,28.0,0,0,13.0,Second,unknown,Southampton,y'
b'1,female,28.0,0,0,7.225,Third,unknown,Cherbourg,y'
b'1,male,28.0,0,0,35.5,First,A,Southampton,y'
b'1,female,38.0,1,5,31.3875,Third,unknown,Southampton,n'

消費CSV數據

加載CSV文件 ,並加載熊貓DataFrames更多的例子。

CSV文件格式是採用明文存儲表格數據一個流行的格式。

例如:

 titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
 
 df = pd.read_csv(titanic_file, index_col=None)
df.head()
 

如果在內存中數據擬合相同Dataset.from_tensor_slices辭書法的作品,使這個數據可以很容易地輸入:

 titanic_slices = tf.data.Dataset.from_tensor_slices(dict(df))

for feature_batch in titanic_slices.take(1):
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))
 
  'survived'          : 0
  'sex'               : b'male'
  'age'               : 22.0
  'n_siblings_spouses': 1
  'parch'             : 0
  'fare'              : 7.25
  'class'             : b'Third'
  'deck'              : b'unknown'
  'embark_town'       : b'Southampton'
  'alone'             : b'n'

更具擴展性的方法是從磁盤加載是必要的。

tf.data模塊提供從一個或多個CSV文件,以提取記錄方法,與符合RFC 4180

experimental.make_csv_dataset功能是用於讀取套CSV文件的高級接口。它支持的列類型推斷和許多其他功能,如配料和洗牌,使使用簡單。

 titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived")
 
 for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  print("features:")
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))
 
'survived': [0 1 0 0]
features:
  'sex'               : [b'male' b'female' b'male' b'male']
  'age'               : [28. 42. 43. 21.]
  'n_siblings_spouses': [0 0 0 0]
  'parch'             : [0 0 0 0]
  'fare'              : [47.1    13.      8.05    8.6625]
  'class'             : [b'First' b'Second' b'Third' b'Third']
  'deck'              : [b'unknown' b'unknown' b'unknown' b'unknown']
  'embark_town'       : [b'Southampton' b'Southampton' b'Southampton' b'Southampton']
  'alone'             : [b'y' b'y' b'y' b'y']

您可以使用select_columns說法,如果你只需要列的子集。

 titanic_batches = tf.data.experimental.make_csv_dataset(
    titanic_file, batch_size=4,
    label_name="survived", select_columns=['class', 'fare', 'survived'])
 
 for feature_batch, label_batch in titanic_batches.take(1):
  print("'survived': {}".format(label_batch))
  for key, value in feature_batch.items():
    print("  {!r:20s}: {}".format(key, value))
 
'survived': [1 0 1 1]
  'fare'              : [24.15    0.     13.8583 53.1   ]
  'class'             : [b'Third' b'Second' b'Second' b'First']

還有一個較低級別的experimental.CsvDataset類,它提供更加精確的控制。它不支持列類型推斷。相反,你必須指定每一列的類型。

 titanic_types  = [tf.int32, tf.string, tf.float32, tf.int32, tf.int32, tf.float32, tf.string, tf.string, tf.string, tf.string] 
dataset = tf.data.experimental.CsvDataset(titanic_file, titanic_types , header=True)

for line in dataset.take(10):
  print([item.numpy() for item in line])
 
[0, b'male', 22.0, 1, 0, 7.25, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 38.0, 1, 0, 71.2833, b'First', b'C', b'Cherbourg', b'n']
[1, b'female', 26.0, 0, 0, 7.925, b'Third', b'unknown', b'Southampton', b'y']
[1, b'female', 35.0, 1, 0, 53.1, b'First', b'C', b'Southampton', b'n']
[0, b'male', 28.0, 0, 0, 8.4583, b'Third', b'unknown', b'Queenstown', b'y']
[0, b'male', 2.0, 3, 1, 21.075, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 27.0, 0, 2, 11.1333, b'Third', b'unknown', b'Southampton', b'n']
[1, b'female', 14.0, 1, 0, 30.0708, b'Second', b'unknown', b'Cherbourg', b'n']
[1, b'female', 4.0, 1, 1, 16.7, b'Third', b'G', b'Southampton', b'n']
[0, b'male', 20.0, 0, 0, 8.05, b'Third', b'unknown', b'Southampton', b'y']

如果某些列是空的,這種低層次的界面允許您提供的默認值,而不是列類型。

 %%writefile missing.csv
1,2,3,4
,2,3,4
1,,3,4
1,2,,4
1,2,3,
,,,
 
Writing missing.csv

 # Creates a dataset that reads all of the records from two CSV files, each with
# four float columns which may have missing values.

record_defaults = [999,999,999,999]
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults)
dataset = dataset.map(lambda *items: tf.stack(items))
dataset
 
<MapDataset shapes: (4,), types: tf.int32>
 for line in dataset:
  print(line.numpy())
 
[1 2 3 4]
[999   2   3   4]
[  1 999   3   4]
[  1   2 999   4]
[  1   2   3 999]
[999 999 999 999]

缺省情況下, CsvDataset產生的文件的每一行,如果文件具有標題行應該被忽略開始,或者如果某些列未在輸入所需這可能不是所期望的,例如每一列。這些線和字段可以與被去除headerselect_cols參數分別。

 # Creates a dataset that reads all of the records from two CSV files with
# headers, extracting float data from columns 2 and 4.
record_defaults = [999, 999] # Only provide defaults for the selected columns
dataset = tf.data.experimental.CsvDataset("missing.csv", record_defaults, select_cols=[1, 3])
dataset = dataset.map(lambda *items: tf.stack(items))
dataset
 
<MapDataset shapes: (2,), types: tf.int32>
 for line in dataset:
  print(line.numpy())
 
[2 4]
[2 4]
[999   4]
[2 4]
[  2 999]
[999 999]

消費組文件

有分佈為一組文件,每個文件就是一個例子很多數據集。

 flowers_root = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)
flowers_root = pathlib.Path(flowers_root)

 

根目錄包含針對每個類別目錄:

 for item in flowers_root.glob("*"):
  print(item.name)
 
sunflowers
daisy
LICENSE.txt
roses
tulips
dandelion

在每類目錄中的文件的例子:

 list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))

for f in list_ds.take(5):
  print(f.numpy())
 
b'/home/kbuilder/.keras/datasets/flower_photos/dandelion/7243478942_30bf542a2d_m.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/tulips/4525067924_177ea3bfb4.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/tulips/7002703410_3e97b29da5_n.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/daisy/6299910262_336309ffa5_n.jpg'
b'/home/kbuilder/.keras/datasets/flower_photos/sunflowers/6140661443_bb48344226.jpg'

使用所述讀出的數據tf.io.read_file功能和從路徑中提取標籤,返回(image, label)對:

 def process_path(file_path):
  label = tf.strings.split(file_path, os.sep)[-2]
  return tf.io.read_file(file_path), label

labeled_ds = list_ds.map(process_path)
 
 for image_raw, label_text in labeled_ds.take(1):
  print(repr(image_raw.numpy()[:100]))
  print()
  print(label_text.numpy())
 
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00H\x00H\x00\x00\xff\xe2\x0cXICC_PROFILE\x00\x01\x01\x00\x00\x0cHLino\x02\x10\x00\x00mntrRGB XYZ \x07\xce\x00\x02\x00\t\x00\x06\x001\x00\x00acspMSFT\x00\x00\x00\x00IEC sRGB\x00\x00\x00\x00\x00\x00'

b'tulips'

配料數據集元素

簡單配料

配料疊層的最簡單的形式n數據集的連續元素到一個單一的元素。所述Dataset.batch()變換正是這一點,具有相同的約束作為tf.stack()算子,施加到元件的每個分量:即,對於每個分量i,所有元素都必須具有完全相同的形狀的張量。

 inc_dataset = tf.data.Dataset.range(100)
dec_dataset = tf.data.Dataset.range(0, -100, -1)
dataset = tf.data.Dataset.zip((inc_dataset, dec_dataset))
batched_dataset = dataset.batch(4)

for batch in batched_dataset.take(4):
  print([arr.numpy() for arr in batch])
 
[array([0, 1, 2, 3]), array([ 0, -1, -2, -3])]
[array([4, 5, 6, 7]), array([-4, -5, -6, -7])]
[array([ 8,  9, 10, 11]), array([ -8,  -9, -10, -11])]
[array([12, 13, 14, 15]), array([-12, -13, -14, -15])]

雖然tf.data試圖傳播形狀信息,默認設置Dataset.batch結果未知批量大小,因為最後一批可能不充分。請注意, None在形狀s:

 batched_dataset
 
<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.int64)>

使用drop_remainder論點忽略最後一批,並得到充分的形狀傳播:

 batched_dataset = dataset.batch(7, drop_remainder=True)
batched_dataset
 
<BatchDataset shapes: ((7,), (7,)), types: (tf.int64, tf.int64)>

與填充配料張量

以上配方適用於張量,所有具有相同的大小。然而,許多模型(例如序列型號)與可以具有變化的尺寸的輸入數據工作(例如不同長度的序列)。為了處理這種情況下, Dataset.padded_batch變換通過指定一個或多個維度,其中它們可被填充,您可以不同的形狀的批量張量。

 dataset = tf.data.Dataset.range(100)
dataset = dataset.map(lambda x: tf.fill([tf.cast(x, tf.int32)], x))
dataset = dataset.padded_batch(4, padded_shapes=(None,))

for batch in dataset.take(2):
  print(batch.numpy())
  print()

 
[[0 0 0]
 [1 0 0]
 [2 2 0]
 [3 3 3]]

[[4 4 4 4 0 0 0]
 [5 5 5 5 5 0 0]
 [6 6 6 6 6 6 0]
 [7 7 7 7 7 7 7]]


所述Dataset.padded_batch變換允許為各成分的各尺寸設置不同的填充物,並且它可以是可變長度(由所指None或定長在上面的例子)。另外,也可以以覆蓋填充值,缺省值為0。

培訓工作流程

處理多個歷元

tf.data API提供了兩種主要的方式來處理相同數據的多個時期。

遍歷多個歷元數據集的最簡單方法是使用Dataset.repeat()改造。首先,創建泰坦尼克號數據的數據集:

 titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)
 
 def plot_batch_sizes(ds):
  batch_sizes = [batch.shape[0] for batch in ds]
  plt.bar(range(len(batch_sizes)), batch_sizes)
  plt.xlabel('Batch number')
  plt.ylabel('Batch size')
 

運用Dataset.repeat()不帶參數的轉變將無限期地重複輸入。

Dataset.repeat改造會將其參數沒有信號一個時代的結束和下一個時代的開始。正因為如此一Dataset.batch施加後Dataset.repeat將產生的批次即騎乘劃時代界限:

 titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)
 

PNG

如果您需要明確的劃時代的分離,把Dataset.batch重複之前:

 titanic_batches = titanic_lines.batch(128).repeat(3)

plot_batch_sizes(titanic_batches)
 

PNG

如果你想在每一個時代的結束執行自定義計算(如收集統計數據),那麼這是最簡單的重新啟動對每個時期的數據集迭代:

 epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
    print(batch.shape)
  print("End of epoch: ", epoch)
 
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  0
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  1
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch:  2

隨機洗牌輸入數據

所述Dataset.shuffle()變換保持一個固定大小的緩衝器,並從該緩衝器在隨機均勻地選擇下一個元素。

索引添加到數據集,所以你可以看到效果:

 lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset
 
<BatchDataset shapes: ((None,), (None,)), types: (tf.int64, tf.string)>

由於buffer_size為100,批量大小為20,第一批含有具有超過120的指數沒有元素。

 n,line_batch = next(iter(dataset))
print(n.numpy())
 
[ 63  48 101  12 103  52   6  39   4   9  93  91   5  86  79  64  95  33
 102  50]

正如Dataset.batch相對順序Dataset.repeat事宜。

Dataset.shuffle直到洗牌緩衝區為空不用信號的歷元的末端。所以,放置在重複之前會顯示一個時代的每一個元素移動到下一個前一個洗牌:

 dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
  print(n.numpy())
 
Here are the item ID's near the epoch boundary:

[613 609 624 553 608 583 493 617 611 610]
[217 508 579 601 319 616 606 549 618 623]
[416 567 404 622 283 458 503 602]
[ 87  68  56  16   6  62   1  89  58 106]
[98 80 43 10 67 44 19 34 13 57]

 shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()
 
<matplotlib.legend.Legend at 0x7fe0a00a1d68>

PNG

但洗牌之前重複混合時代界限在一起:

 dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
  print(n.numpy())
 
Here are the item ID's near the epoch boundary:

[440  15   8 599 567  18 550   5  19  17]
[ 12 501 571 473 466  21 531 596 580 555]
[  3 573  38 563  25 416 595  29  46 602]
[485 566 561  16 331 615 386  28 609  41]
[611 622 575  10 589  61 598 527  52  35]
[ 55 597  42  23  13  47  11 505  68 582]
[612 613  75  43   7 392  74 452  82 509]
[  9  44  62 491  71 343  51 590  60  98]
[  6  95 619  86 625 537 617  85 465   0]
[ 88  27  92 101 109 111 104  24  36 113]
[103 118  79  53  70  40 121 100  65  33]
[562 588 124 125  64  84  83  67 610 130]
[  4 142 131  90 518 129 143 112   2 551]
[377  91 140  76  50  48 526 553 156 591]
[105 128  69 114  93 520 154  56 145 115]

 repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()
 
<matplotlib.legend.Legend at 0x7fe0582fbb70>

PNG

數據預處理

所述Dataset.map(f)變換通過施加一個給定的函數產生一個新的數據集f到輸入數據集的每個元素。它是基於map()其通常適用於函數式編程語言列表(和其它結構)的功能。函數ftf.Tensor表示在輸入一個單一的元素對象,並返回tf.Tensor對象,將在新的數據集表示單個元件。它採用的是標準TensorFlow操作一個元素轉換為另一種。

本節將介紹如何使用常見的例子Dataset.map()

解碼圖像數據,並調整其大小

當訓練對真實世界的圖像數據的神經網絡,它通常需要不同尺寸的圖像轉換為常用的尺寸,使他們可分批成固定大小。

重建花文件名數據集:

 list_ds = tf.data.Dataset.list_files(str(flowers_root/'*/*'))
 

編寫一個函數,操縱數據集的元素。

 # Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def parse_image(filename):
  parts = tf.strings.split(filename, os.sep)
  label = parts[-2]

  image = tf.io.read_file(filename)
  image = tf.image.decode_jpeg(image)
  image = tf.image.convert_image_dtype(image, tf.float32)
  image = tf.image.resize(image, [128, 128])
  return image, label
 

測試它的效果。

 file_path = next(iter(list_ds))
image, label = parse_image(file_path)

def show(image, label):
  plt.figure()
  plt.imshow(image)
  plt.title(label.numpy().decode('utf-8'))
  plt.axis('off')

show(image, label)
 

PNG

地圖過來的數據集。

 images_ds = list_ds.map(parse_image)

for image, label in images_ds.take(2):
  show(image, label)
 

PNG

PNG

應用任意Python邏輯

出於性能的考慮,使用TensorFlow操作的預處理你的數據只要有可能。但是,它有時是有用的分析您的輸入數據時要調用外部Python庫。您可以使用tf.py_function()在操作Dataset.map()改造。

例如,如果你想申請一個隨機轉動, tf.image模塊只有tf.image.rot90 ,這對於圖像增強是非常有用的。

為了證明tf.py_function ,請嘗試使用scipy.ndimage.rotate函數:

 import scipy.ndimage as ndimage

def random_rotate_image(image):
  image = ndimage.rotate(image, np.random.uniform(-30, 30), reshape=False)
  return image
 
 image, label = next(iter(images_ds))
image = random_rotate_image(image)
show(image, label)
 
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

PNG

要使用此功能Dataset.map同樣告誡:與Dataset.from_generator ,您需要在應用功能描述返回形狀和類型:

 def tf_random_rotate_image(image, label):
  im_shape = image.shape
  [image,] = tf.py_function(random_rotate_image, [image], [tf.float32])
  image.set_shape(im_shape)
  return image, label
 
 rot_ds = images_ds.map(tf_random_rotate_image)

for image, label in rot_ds.take(2):
  show(image, label)
 
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

PNG

PNG

解析tf.Example協議緩衝區中的消息

許多輸入管道中提取tf.train.Example從TFRecord格式協議緩衝區中的消息。每個tf.train.Example記錄包含一個或多個“功能”,並且輸入管道通常這些功能轉換成張量。

 fsns_test_file = tf.keras.utils.get_file("fsns.tfrec", "https://storage.googleapis.com/download.tensorflow.org/data/fsns-20160927/testdata/fsns-00000-of-00001")
dataset = tf.data.TFRecordDataset(filenames = [fsns_test_file])
dataset
 
<TFRecordDatasetV2 shapes: (), types: tf.string>

您可以一起工作tf.train.Example PROTOS一個外tf.data.Dataset理解數據:

 raw_example = next(iter(dataset))
parsed = tf.train.Example.FromString(raw_example.numpy())

feature = parsed.features.feature
raw_img = feature['image/encoded'].bytes_list.value[0]
img = tf.image.decode_png(raw_img)
plt.imshow(img)
plt.axis('off')
_ = plt.title(feature["image/text"].bytes_list.value[0])
 

PNG

 raw_example = next(iter(dataset))
 
 def tf_parse(eg):
  example = tf.io.parse_example(
      eg[tf.newaxis], {
          'image/encoded': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
          'image/text': tf.io.FixedLenFeature(shape=(), dtype=tf.string)
      })
  return example['image/encoded'][0], example['image/text'][0]
 
 img, txt = tf_parse(raw_example)
print(txt.numpy())
print(repr(img.numpy()[:20]), "...")
 
b'Rue Perreyon'
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02X' ...

 decoded = dataset.map(tf_parse)
decoded
 
<MapDataset shapes: ((), ()), types: (tf.string, tf.string)>
 image_batch, text_batch = next(iter(decoded.batch(10)))
image_batch.shape
 
TensorShape([10])

時間序列窗口

對於一個端到端的時間序列的例子中看到: 時間序列預測

時間序列數據往往與時間安排軸線完好。

使用簡單Dataset.range證明:

 range_ds = tf.data.Dataset.range(100000)
 

通常情況下,基於這種數據模型將需要一個連續的時間片。

最簡單的方法是將批量數據:

使用batch

 batches = range_ds.batch(10, drop_remainder=True)

for batch in batches.take(5):
  print(batch.numpy())
 
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]

或使密集的預測一步進去以後,你可能會一步相對轉移的特點和標籤對方:

 def dense_1_step(batch):
  # Shift features and labels one step relative to each other.
  return batch[:-1], batch[1:]

predict_dense_1_step = batches.map(dense_1_step)

for features, label in predict_dense_1_step.take(3):
  print(features.numpy(), " => ", label.numpy())
 
[0 1 2 3 4 5 6 7 8]  =>  [1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]  =>  [11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28]  =>  [21 22 23 24 25 26 27 28 29]

要預測整個窗口,而不是一個固定的偏移可以批次分為兩部分:

 batches = range_ds.batch(15, drop_remainder=True)

def label_next_5_steps(batch):
  return (batch[:-5],   # Take the first 5 steps
          batch[-5:])   # take the remainder

predict_5_steps = batches.map(label_next_5_steps)

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())
 
[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]  =>  [25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]  =>  [40 41 42 43 44]

為了讓一個批次的特點和另一個標籤之間有一些重疊,使用Dataset.zip

 feature_length = 10
label_length = 5

features = range_ds.batch(feature_length, drop_remainder=True)
labels = range_ds.batch(feature_length).skip(1).map(lambda labels: labels[:-5])

predict_5_steps = tf.data.Dataset.zip((features, labels))

for features, label in predict_5_steps.take(3):
  print(features.numpy(), " => ", label.numpy())
 
[0 1 2 3 4 5 6 7 8 9]  =>  [10 11 12 13 14]
[10 11 12 13 14 15 16 17 18 19]  =>  [20 21 22 23 24]
[20 21 22 23 24 25 26 27 28 29]  =>  [30 31 32 33 34]

使用window

在使用Dataset.batch作品,有些情況下,你可能需要更精細的控制情況。該Dataset.window方法讓您完全控制,但需要一些護理:它返回一個DatasetDatasets 。請參閱數據集結構的詳細信息。

 window_size = 5

windows = range_ds.window(window_size, shift=1)
for sub_ds in windows.take(5):
  print(sub_ds)
 
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>
<_VariantDataset shapes: (), types: tf.int64>

Dataset.flat_map方法可以採取數據集的數據集,並將其壓扁成一個單一的數據集:

  for x in windows.flat_map(lambda x: x).take(30):
   print(x.numpy(), end=' ')
 
WARNING:tensorflow:AutoGraph could not transform <function <lambda> at 0x7fe0582dbbf8> and will run it as-is.
Cause: could not parse the source code:

for x in windows.flat_map(lambda x: x).take(30):

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function <lambda> at 0x7fe0582dbbf8> and will run it as-is.
Cause: could not parse the source code:

for x in windows.flat_map(lambda x: x).take(30):

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
0 1 2 3 4 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9 

在幾乎所有情況下,你會想.batch第一數據集:

 def sub_to_batch(sub):
  return sub.batch(window_size, drop_remainder=True)

for example in windows.flat_map(sub_to_batch).take(5):
  print(example.numpy())
 
[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]

現在,你可以看到shift參數控制多少每個窗口移過。

把這個你一起可能會這樣寫功能:

 def make_window_dataset(ds, window_size=5, shift=1, stride=1):
  windows = ds.window(window_size, shift=shift, stride=stride)

  def sub_to_batch(sub):
    return sub.batch(window_size, drop_remainder=True)

  windows = windows.flat_map(sub_to_batch)
  return windows

 
 ds = make_window_dataset(range_ds, window_size=10, shift = 5, stride=3)

for example in ds.take(10):
  print(example.numpy())
 
[ 0  3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34 37]
[15 18 21 24 27 30 33 36 39 42]
[20 23 26 29 32 35 38 41 44 47]
[25 28 31 34 37 40 43 46 49 52]
[30 33 36 39 42 45 48 51 54 57]
[35 38 41 44 47 50 53 56 59 62]
[40 43 46 49 52 55 58 61 64 67]
[45 48 51 54 57 60 63 66 69 72]

然後,它很容易提取的標籤,像以前一樣:

 dense_labels_ds = ds.map(dense_1_step)

for inputs,labels in dense_labels_ds.take(3):
  print(inputs.numpy(), "=>", labels.numpy())
 
[ 0  3  6  9 12 15 18 21 24] => [ 3  6  9 12 15 18 21 24 27]
[ 5  8 11 14 17 20 23 26 29] => [ 8 11 14 17 20 23 26 29 32]
[10 13 16 19 22 25 28 31 34] => [13 16 19 22 25 28 31 34 37]

重採樣

當一個數據集,這是非常類不平衡的工作,你可能要重新採樣數據集。 tf.data提供了兩種方法來做到這一點。信用卡詐騙數據集是這類問題的一個很好的例子。

 zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip',
    fname='creditcard.zip',
    extract=True)

csv_path = zip_path.replace('.zip', '.csv')
 
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/creditcard.zip
69156864/69155632 [==============================] - 10s 0us/step

 creditcard_ds = tf.data.experimental.make_csv_dataset(
    csv_path, batch_size=1024, label_name="Class",
    # Set the column types: 30 floats and an int.
    column_defaults=[float()]*30+[int()])
 

現在,檢查類的分佈,它是高度傾斜:

 def count(counts, batch):
  features, labels = batch
  class_1 = labels == 1
  class_1 = tf.cast(class_1, tf.int32)

  class_0 = labels == 0
  class_0 = tf.cast(class_0, tf.int32)

  counts['class_0'] += tf.reduce_sum(class_0)
  counts['class_1'] += tf.reduce_sum(class_1)

  return counts
 
 counts = creditcard_ds.take(10).reduce(
    initial_state={'class_0': 0, 'class_1': 0},
    reduce_func = count)

counts = np.array([counts['class_0'].numpy(),
                   counts['class_1'].numpy()]).astype(np.float32)

fractions = counts/counts.sum()
print(fractions)
 
[0.996 0.004]

與一個不平衡的數據集訓練的一種常用方法是平衡它。 tf.data包括一些方法,使該工作流程:

採樣數據集

一種方法來重新採樣數據集是使用sample_from_datasets 。當你有一個獨立的,這是更適用data.Dataset為每個類。

在這裡,只要使用過濾器,以信用卡詐騙數據生成它們:

 negative_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==0)
    .repeat())
positive_ds = (
  creditcard_ds
    .unbatch()
    .filter(lambda features, label: label==1)
    .repeat())
 
WARNING:tensorflow:AutoGraph could not transform <function <lambda> at 0x7fe0a01fd1e0> and will run it as-is.
Cause: could not parse the source code:

    .filter(lambda features, label: label==0)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function <lambda> at 0x7fe0a01fd1e0> and will run it as-is.
Cause: could not parse the source code:

    .filter(lambda features, label: label==0)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function <lambda> at 0x7fe058159620> and will run it as-is.
Cause: could not parse the source code:

    .filter(lambda features, label: label==1)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function <lambda> at 0x7fe058159620> and will run it as-is.
Cause: could not parse the source code:

    .filter(lambda features, label: label==1)

This error may be avoided by creating the lambda in a standalone statement.

To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

 for features, label in positive_ds.batch(10).take(1):
  print(label.numpy())
 
[1 1 1 1 1 1 1 1 1 1]

要使用tf.data.experimental.sample_from_datasets通過數據集,並為每個重量:

 balanced_ds = tf.data.experimental.sample_from_datasets(
    [negative_ds, positive_ds], [0.5, 0.5]).batch(10)
 

現在該數據集產生具有50/50概率每個類的例子:

 for features, labels in balanced_ds.take(10):
  print(labels.numpy())
 
[0 0 0 1 0 0 1 1 0 0]
[1 1 0 0 1 1 0 0 0 1]
[1 1 0 0 0 0 0 0 1 1]
[1 0 0 1 1 0 0 0 1 0]
[1 1 0 0 0 1 1 0 0 1]
[0 0 1 1 1 0 0 1 1 0]
[0 0 0 1 0 0 0 1 1 1]
[1 1 0 1 0 0 1 0 1 1]
[0 1 0 0 1 1 0 0 0 1]
[0 1 0 0 1 0 0 1 1 0]

拒絕重採樣

與上述的一個問題experimental.sample_from_datasets方法是,它需要一個單獨的tf.data.Dataset每類。使用Dataset.filter工作,但在所有的數據結果被加載兩次。

data.experimental.rejection_resample功能可以應用到數據集,以重新平衡它,而只加載一次。元素會從數據集被丟棄達到平衡。

data.experimental.rejection_resample需要class_func說法。此class_func被施加到每個數據集元素,並且被用於確定一個例子屬於用於平衡的目的哪個類。

的元素creditcard_ds已經(features, label)對。所以class_func只需要返回這些標籤:

 def class_func(features, label):
  return label
 

重採樣器也需要一個目標分佈,和任選的初始分佈估計:

 resampler = tf.data.experimental.rejection_resample(
    class_func, target_dist=[0.5, 0.5], initial_dist=fractions)
 

與個別的例子重採樣的交易,所以你必須unbatch應用重採樣之前的數據集:

 resample_ds = creditcard_ds.unbatch().apply(resampler).batch(10)
 
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/data/experimental/ops/resampling.py:156: Print (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2018-08-20.
Instructions for updating:
Use tf.print instead of tf.Print. Note that tf.print returns a no-output operator that directly prints the output. Outside of defuns or eager mode, this operator will not be executed unless it is directly specified in session.run or used as a control dependency for other operators. This is only a concern in graph mode. Below is an example of how to ensure tf.print executes in graph mode:


重採樣器返回創建(class, example)從所述的輸出對class_func 。在這種情況下, example已經是一個(feature, label)對,所以在使用map砸標籤的額外副本:

 balanced_ds = resample_ds.map(lambda extra_label, features_and_label: features_and_label)
 

現在該數據集產生具有50/50概率每個類的例子:

 for features, labels in balanced_ds.take(10):
  print(labels.numpy())
 
[1 0 1 1 1 1 0 1 1 1]
[0 0 1 1 1 0 1 0 1 1]
[1 0 0 1 0 0 0 0 0 1]
[1 1 1 1 1 1 0 1 1 1]
[1 1 0 1 0 0 0 0 1 0]
[1 0 0 0 1 0 1 0 1 0]
[0 0 0 0 0 0 1 0 0 0]
[0 0 0 1 1 0 0 1 0 1]
[0 0 1 1 1 1 0 0 1 1]
[0 0 1 1 0 1 0 1 1 0]

迭代器檢查點

Tensorflow支持採取檢查點 ,這樣,當你的訓練進程重新啟動它可以恢復最新的檢查點,以恢復其大部分的進展。除了檢查點的模型變量,你還可以設置檢查點的數據集迭代的進度。如果你有一個大的數據集,並且不希望從每次重啟開始啟動數據集,這可能是有用的。但是請注意,迭代器檢查點的可能很大,因為轉換,如shuffleprefetch所需要的迭代器中緩存的元素。

要在檢查點你的迭代,通過迭代器tf.train.Checkpoint構造。

 range_ds = tf.data.Dataset.range(20)

iterator = iter(range_ds)
ckpt = tf.train.Checkpoint(step=tf.Variable(0), iterator=iterator)
manager = tf.train.CheckpointManager(ckpt, '/tmp/my_ckpt', max_to_keep=3)

print([next(iterator).numpy() for _ in range(5)])

save_path = manager.save()

print([next(iterator).numpy() for _ in range(5)])

ckpt.restore(manager.latest_checkpoint)

print([next(iterator).numpy() for _ in range(5)])
 
[0, 1, 2, 3, 4]
[5, 6, 7, 8, 9]
[5, 6, 7, 8, 9]

使用高級別API

tf.keras

tf.keras API簡化創建和執行機器學習模型的許多方面。其.fit().evaluate().predict()的API支持的數據集作為輸入。這裡是一個快速的數據集和模型建立:

 train, test = tf.keras.datasets.fashion_mnist.load_data()

images, labels = train
images = images/255.0
labels = labels.astype(np.int32)
 
 fmnist_train_ds = tf.data.Dataset.from_tensor_slices((images, labels))
fmnist_train_ds = fmnist_train_ds.shuffle(5000).batch(32)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(10)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=['accuracy'])
 

路過的數據集(feature, label)對是所有的需要Model.fitModel.evaluate

 model.fit(fmnist_train_ds, epochs=2)
 
Epoch 1/2
WARNING:tensorflow:Layer flatten is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

1875/1875 [==============================] - 4s 2ms/step - loss: 0.6031 - accuracy: 0.7937
Epoch 2/2
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4620 - accuracy: 0.8416

<tensorflow.python.keras.callbacks.History at 0x7fe13f9ed3c8>

如果通過調用傳遞一個無限的數據集,例如Dataset.repeat()你只需要還通過steps_per_epoch參數:

 model.fit(fmnist_train_ds.repeat(), epochs=2, steps_per_epoch=20)
 
Epoch 1/2
20/20 [==============================] - 0s 2ms/step - loss: 0.4050 - accuracy: 0.8672
Epoch 2/2
20/20 [==============================] - 0s 2ms/step - loss: 0.4077 - accuracy: 0.8703

<tensorflow.python.keras.callbacks.History at 0x7fe0ca13edd8>

為了進行評估,您可以通過評估步數:

 loss, accuracy = model.evaluate(fmnist_train_ds)
print("Loss :", loss)
print("Accuracy :", accuracy)
 
1875/1875 [==============================] - 3s 2ms/step - loss: 0.4474 - accuracy: 0.8439
Loss : 0.4474281072616577
Accuracy : 0.843916654586792

對於長的數據集,設置的步驟,以評估數:

 loss, accuracy = model.evaluate(fmnist_train_ds.repeat(), steps=10)
print("Loss :", loss)
print("Accuracy :", accuracy)
 
10/10 [==============================] - 0s 2ms/step - loss: 0.5262 - accuracy: 0.8156
Loss : 0.5262183547019958
Accuracy : 0.815625011920929

標籤不打電話時所需Model.predict

 predict_ds = tf.data.Dataset.from_tensor_slices(images).batch(32)
result = model.predict(predict_ds, steps = 10)
print(result.shape)
 
(320, 10)

但標籤如果你通過一個包含它們的數據集忽略:

 result = model.predict(fmnist_train_ds, steps = 10)
print(result.shape)
 
(320, 10)

tf.estimator

要使用Datasetinput_fn一個的tf.estimator.Estimator ,只是返回的Datasetinput_fn和框架將消耗它的元素對你的照顧。例如:

 import tensorflow_datasets as tfds

def train_input_fn():
  titanic = tf.data.experimental.make_csv_dataset(
      titanic_file, batch_size=32,
      label_name="survived")
  titanic_batches = (
      titanic.cache().repeat().shuffle(500)
      .prefetch(tf.data.experimental.AUTOTUNE))
  return titanic_batches
 
 embark = tf.feature_column.categorical_column_with_hash_bucket('embark_town', 32)
cls = tf.feature_column.categorical_column_with_vocabulary_list('class', ['First', 'Second', 'Third']) 
age = tf.feature_column.numeric_column('age')
 
 import tempfile
model_dir = tempfile.mkdtemp()
model = tf.estimator.LinearClassifier(
    model_dir=model_dir,
    feature_columns=[embark, cls, age],
    n_classes=2
)
 
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpefmfuc4o', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

 model = model.train(input_fn=train_input_fn, steps=100)
 
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/feature_column/feature_column_v2.py:540: Layer.add_variable (from tensorflow.python.keras.engine.base_layer_v1) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/ftrl.py:144: calling Constant.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpefmfuc4o/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 0.6931472, step = 0
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100...
INFO:tensorflow:Saving checkpoints for 100 into /tmp/tmpefmfuc4o/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100...
INFO:tensorflow:Loss for final step: 0.58668363.

 result = model.evaluate(train_input_fn, steps=10)

for key, value in result.items():
  print(key, ":", value)
 
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-07-23T01:23:29Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpefmfuc4o/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/10]
INFO:tensorflow:Evaluation [2/10]
INFO:tensorflow:Evaluation [3/10]
INFO:tensorflow:Evaluation [4/10]
INFO:tensorflow:Evaluation [5/10]
INFO:tensorflow:Evaluation [6/10]
INFO:tensorflow:Evaluation [7/10]
INFO:tensorflow:Evaluation [8/10]
INFO:tensorflow:Evaluation [9/10]
INFO:tensorflow:Evaluation [10/10]
INFO:tensorflow:Inference Time : 0.83507s
INFO:tensorflow:Finished evaluation at 2020-07-23-01:23:30
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.675, accuracy_baseline = 0.58125, auc = 0.71750116, auc_precision_recall = 0.6480325, average_loss = 0.64111984, global_step = 100, label/mean = 0.41875, loss = 0.64111984, precision = 0.85714287, prediction/mean = 0.30204886, recall = 0.26865673
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 100: /tmp/tmpefmfuc4o/model.ckpt-100
accuracy : 0.675
accuracy_baseline : 0.58125
auc : 0.71750116
auc_precision_recall : 0.6480325
average_loss : 0.64111984
label/mean : 0.41875
loss : 0.64111984
precision : 0.85714287
prediction/mean : 0.30204886
recall : 0.26865673
global_step : 100

 for pred in model.predict(train_input_fn):
  for key, value in pred.items():
    print(key, ":", value)
  break
 
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpefmfuc4o/model.ckpt-100
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
logits : [-0.5965]
logistic : [0.3551]
probabilities : [0.6449 0.3551]
class_ids : [0]
classes : [b'0']
all_class_ids : [0 1]
all_classes : [b'0' b'1']