Semua DatasetBuilder
mengekspos berbagai subset data yang didefinisikan sebagai pemisahan (misalnya: train
, test
). Saat membuat instancetf.data.Dataset
menggunakan tfds.load()
atau tfds.DatasetBuilder.as_dataset()
, seseorang dapat menentukan pemisahan mana yang akan diambil. Dimungkinkan juga untuk mengambil potongan dari split serta kombinasi dari keduanya.
Slicing API
Instruksi tfds.load
ditentukan di tfds.load
atau tfds.DatasetBuilder.as_dataset
.
Instruksi dapat diberikan sebagai string atau ReadInstruction
s. String lebih ringkas dan dapat dibaca untuk kasus sederhana, sementara ReadInstruction
menyediakan lebih banyak opsi dan mungkin lebih mudah digunakan dengan parameter pemotongan variabel.
Contoh
Contoh menggunakan API string:
# The full `train` split.
train_ds = tfds.load('mnist', split='train')
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = tfds.load('mnist', split=['train', 'test'])
# The full `train` and `test` splits, interleaved together.
train_test_ds = tfds.load('mnist', split='train+test')
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = tfds.load('mnist', split='train[10:20]')
# The first 10% of train split.
train_10pct_ds = tfds.load('mnist', split='train[:10%]')
# The first 10% of train + the last 80% of train.
train_10_80pct_ds = tfds.load('mnist', split='train[:10%]+train[-80%:]')
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = tfds.load('mnist', split=[
f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
])
trains_ds = tfds.load('mnist', split=[
f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
])
Contoh menggunakan ReadInstruction
API (setara seperti di atas):
# The full `train` split.
train_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train'))
# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = tfds.load('mnist', split=[
tfds.core.ReadInstruction('train'),
tfds.core.ReadInstruction('test'),
])
# The full `train` and `test` splits, interleaved together.
ri = tfds.core.ReadInstruction('train') + tfds.core.ReadInstruction('test')
train_test_ds = tfds.load('mnist', split=ri)
# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction(
'train', from_=10, to=20, unit='abs'))
# The first 10% of train split.
train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction(
'train', to=10, unit='%'))
# The first 10% of train + the last 80% of train.
ri = (tfds.core.ReadInstruction('train', to=10, unit='%') +
tfds.core.ReadInstruction('train', from_=-80, unit='%'))
train_10_80pct_ds = tfds.load('mnist', split=ri)
# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = tfds.load('mnist', [
tfds.core.ReadInstruction('train', from_=k, to=k+10, unit='%')
for k in range(0, 100, 10)])
trains_ds = tfds.load('mnist', [
(tfds.core.ReadInstruction('train', to=k, unit='%') +
tfds.core.ReadInstruction('train', from_=k+10, unit='%'))
for k in range(0, 100, 10)])
tfds.even_splits
tfds.even_splits
menghasilkan daftar sub-pemisahan yang tidak tumpang tindih dengan ukuran yang sama.
assert tfds.even_splits('train', n=3) == [
'train[0%:33%]', 'train[33%:67%]', 'train[67%:100%]',
]
Persentase pemotongan dan pembulatan
Jika sepotong pemisahan diminta menggunakan unit persen ( %
), dan batas potongan yang diminta tidak membagi secara merata dengan 100
, maka perilaku defaultnya adalah membulatkan batas ke bilangan bulat terdekat ( closest
). Artinya, beberapa irisan mungkin berisi lebih banyak contoh daripada yang lain. Sebagai contoh:
# Assuming "train" split contains 101 records.
# 100 records, from 0 to 100.
tfds.load("mnist", split="test[:99%]")
# 2 records, from 49 to 51.
tfds.load("mnist", split="test[49%:50%]")
Alternatifnya, pengguna dapat menggunakan pembulatan pct1_dropremainder
, jadi batas persentase yang ditentukan diperlakukan sebagai kelipatan 1%. Pilihan ini harus digunakan ketika konsistensi dibutuhkan (misal: len(5%) == 5 * len(1%)
). Ini berarti contoh terakhir mungkin terpotong jika info.split[split_name].num_examples % 100 != 0
.
Contoh:
# Records 0 (included) to 99 (excluded).
split = tfds.core.ReadInstruction(
'test',
to=99,
rounding='pct1_dropremainder',
unit = '%',
)
tfds.load("mnist", split=split)
Reproduksibilitas
API sub-split menjamin bahwa potongan pemisahan apa pun (atau ReadInstruction
) akan selalu menghasilkan kumpulan rekaman yang sama pada kumpulan data tertentu, selama versi utama kumpulan data tersebut konstan.
Misalnya, tfds.load("mnist:3.0.0", split="train[10:20]")
dan tfds.load("mnist:3.2.0", split="train[10:20]")
akan selalu berisi elemen yang sama - terlepas dari platform, arsitektur, dll. - meskipun beberapa record mungkin memiliki nilai yang berbeda (misalnya: encoding imgage, label, ...).