tfds.core.SequentialWriter

Class to write a TFDS dataset sequentially.

The SequentialWriter can be used to generate TFDS datasets by directly appending TF Examples to the desired splits.

Once the user creates a SequentialWriter with a given DatasetInfo, they can create splits, append examples to them, and close them whenever they are finished.

Note that:

  • Not closing a split may cause data to be lost.
  • The examples are written to disk in the same order that they are given to the writer.
  • Since the SequentialWriter doesn't know how many examples are going to be written, it can't estimate the optimal number of shards per split. Use the max_examples_per_shard parameter in the constructor to control how many elements there should be per shard.

The datasets written with this writer can be read directly with tfds.builder_from_directories.

Example:

writer = SequentialWriter(ds_info=ds_info, max_examples_per_shard=1000) writer.initialize_splits(['train', 'test'])

while (...): # Code that generates the examples writer.add_examples({'train': [example1, example2], 'test': [example3]}) ...

writer.close_splits()

ds_info DatasetInfo for this dataset.
max_examples_per_shard maximum number of examples to write per shard.
overwrite if True, it ignores and overwrites any existing data. Otherwise, it loads the existing dataset and appends the new data (new data will always be created as new shards).
file_format An entry in file_adapters.FileFormat.

Methods

add_examples

View source

Adds examples to the splits.

Args
split_examples dictionary of split_name:list_of_examples that includes the list of examples that has to be added to each of the splits. Not all the existing splits have to be in the dictionary

Raises
KeyError if any of the splits doesn't exist.

close_all

View source

Closes all the open splits.

close_splits

View source

Closes the given list of splits.

Args
splits list of split names.

Raises
KeyError if any of the splits doesn't exist.

initialize_splits

View source

Adds new splits to the dataset.

Args
splits list of split names to add.
fail_if_exists will fail if this split already contains data.

Raises
KeyError if the split is already present.