RSVP untuk acara TensorFlow Everywhere lokal Anda hari ini!
Halaman ini diterjemahkan oleh Cloud Translation API.
Switch to English

Membelah dan mengiris

Semua DatasetBuilder mengekspos berbagai subset data yang didefinisikan sebagai pemisahan (misalnya: train , test ). Saat membuat instancetf.data.Dataset menggunakan tfds.load() atau tfds.DatasetBuilder.as_dataset() , seseorang dapat menentukan pemisahan mana yang akan diambil. Dimungkinkan juga untuk mengambil potongan dari split serta kombinasi dari keduanya.

Slicing API

Instruksi tfds.load ditentukan di tfds.load atau tfds.DatasetBuilder.as_dataset .

Instruksi dapat diberikan sebagai string atau ReadInstruction s. String lebih ringkas dan dapat dibaca untuk kasus sederhana, sementara ReadInstruction menyediakan lebih banyak opsi dan mungkin lebih mudah digunakan dengan parameter pemotongan variabel.

Contoh

Contoh menggunakan API string:

# The full `train` split.
train_ds = tfds.load('mnist', split='train')

# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = tfds.load('mnist', split=['train', 'test'])

# The full `train` and `test` splits, interleaved together.
train_test_ds = tfds.load('mnist', split='train+test')

# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = tfds.load('mnist', split='train[10:20]')

# The first 10% of train split.
train_10pct_ds = tfds.load('mnist', split='train[:10%]')

# The first 10% of train + the last 80% of train.
train_10_80pct_ds = tfds.load('mnist', split='train[:10%]+train[-80%:]')

# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = tfds.load('mnist', split=[
    f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)
])
trains_ds = tfds.load('mnist', split=[
    f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)
])

Contoh menggunakan ReadInstruction API (setara seperti di atas):

# The full `train` split.
train_ds = tfds.load('mnist', split=tfds.core.ReadInstruction('train'))

# The full `train` split and the full `test` split as two distinct datasets.
train_ds, test_ds = tfds.load('mnist', split=[
    tfds.core.ReadInstruction('train'),
    tfds.core.ReadInstruction('test'),
])

# The full `train` and `test` splits, interleaved together.
ri = tfds.core.ReadInstruction('train') + tfds.core.ReadInstruction('test')
train_test_ds = tfds.load('mnist', split=ri)

# From record 10 (included) to record 20 (excluded) of `train` split.
train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction(
    'train', from_=10, to=20, unit='abs'))

# The first 10% of train split.
train_10_20_ds = tfds.load('mnist', split=tfds.core.ReadInstruction(
    'train', to=10, unit='%'))

# The first 10% of train + the last 80% of train.
ri = (tfds.core.ReadInstruction('train', to=10, unit='%') +
      tfds.core.ReadInstruction('train', from_=-80, unit='%'))
train_10_80pct_ds = tfds.load('mnist', split=ri)

# 10-fold cross-validation (see also next section on rounding behavior):
# The validation datasets are each going to be 10%:
# [0%:10%], [10%:20%], ..., [90%:100%].
# And the training datasets are each going to be the complementary 90%:
# [10%:100%] (for a corresponding validation set of [0%:10%]),
# [0%:10%] + [20%:100%] (for a validation set of [10%:20%]), ...,
# [0%:90%] (for a validation set of [90%:100%]).
vals_ds = tfds.load('mnist', [
    tfds.core.ReadInstruction('train', from_=k, to=k+10, unit='%')
    for k in range(0, 100, 10)])
trains_ds = tfds.load('mnist', [
    (tfds.core.ReadInstruction('train', to=k, unit='%') +
     tfds.core.ReadInstruction('train', from_=k+10, unit='%'))
    for k in range(0, 100, 10)])

tfds.even_splits

tfds.even_splits menghasilkan daftar sub-pemisahan yang tidak tumpang tindih dengan ukuran yang sama.

assert tfds.even_splits('train', n=3) == [
    'train[0%:33%]', 'train[33%:67%]', 'train[67%:100%]',
]

Persentase pemotongan dan pembulatan

Jika sepotong pemisahan diminta menggunakan unit persen ( % ), dan batas potongan yang diminta tidak membagi secara merata dengan 100 , maka perilaku defaultnya adalah membulatkan batas ke bilangan bulat terdekat ( closest ). Artinya, beberapa irisan mungkin berisi lebih banyak contoh daripada yang lain. Sebagai contoh:

# Assuming "train" split contains 101 records.
# 100 records, from 0 to 100.
tfds.load("mnist", split="test[:99%]")
# 2 records, from 49 to 51.
tfds.load("mnist", split="test[49%:50%]")

Alternatifnya, pengguna dapat menggunakan pembulatan pct1_dropremainder , jadi batas persentase yang ditentukan diperlakukan sebagai kelipatan 1%. Pilihan ini harus digunakan ketika konsistensi dibutuhkan (misal: len(5%) == 5 * len(1%) ). Ini berarti contoh terakhir mungkin terpotong jika info.split[split_name].num_examples % 100 != 0 .

Contoh:

# Records 0 (included) to 99 (excluded).
split = tfds.core.ReadInstruction(
    'test',
    to=99,
    rounding='pct1_dropremainder',
    unit = '%',
)
tfds.load("mnist", split=split)

Reproduksibilitas

API sub-split menjamin bahwa potongan pemisahan apa pun (atau ReadInstruction ) akan selalu menghasilkan kumpulan rekaman yang sama pada kumpulan data tertentu, selama versi utama kumpulan data tersebut konstan.

Misalnya, tfds.load("mnist:3.0.0", split="train[10:20]") dan tfds.load("mnist:3.2.0", split="train[10:20]") akan selalu berisi elemen yang sama - terlepas dari platform, arsitektur, dll. - meskipun beberapa record mungkin memiliki nilai yang berbeda (misalnya: encoding imgage, label, ...).