tfds.testing.mock_data

Mock tfds to generate random data.

Usage

  • Usage (automated):
with tfds.testing.mock_data(num_examples=5):
  ds = tfds.load('some_dataset', split='train')

  for ex in ds:  # ds will yield randomly generated examples.
    ex

All calls to tfds.load/tfds.data_source within the context manager then return deterministic mocked data.

  • Usage (manual):

For more control over the generated examples, you can manually overwrite the DatasetBuilder._as_dataset method:

def as_dataset(self, *args, **kwargs):
  return tf.data.Dataset.from_generator(
      lambda: ({
          'image': np.ones(shape=(28, 28, 1), dtype=np.uint8),
          'label': i % 10,
      } for i in range(num_examples)),
      output_types=self.info.features.dtype,
      output_shapes=self.info.features.shape,
  )

with mock_data(as_dataset_fn=as_dataset):
  ds = tfds.load('some_dataset', split='train')

  for ex in ds:  # ds will yield the fake data example of 'as_dataset'.
    ex

Policy

For improved results, you can copy the true metadata files (dataset_info.json, label.txt, vocabulary files) in data_dir/dataset_name/version. This will allow the mocked dataset to use the true metadata computed during generation (split names,...).

If metadata files are not found, then info from the original class will be used, but the features computed during generation won't be available (e.g. unknown split names, so any splits are accepted).

Miscellaneous

  • The examples are deterministically generated. Train and test split will yield the same examples.
  • The actual examples will be randomly generated using builder.info.features.get_tensor_info().
  • Download and prepare step will always be a no-op.
  • Warning: info.split['train'].num_examples won't match len(list(ds_train))

Some of those points could be improved. If you have suggestions, issues with this functions, please open a new issue on our Github.

num_examples Number of fake example to generate.
num_sub_examples Number of examples to generate in nested Dataset features.
max_value The maximum value present in generated tensors; if max_value is None or it is set to 0, then random numbers are generated from the range from 0 to 255.
policy Strategy to use to generate the fake examples. See tfds.testing.MockPolicy.
as_dataset_fn If provided, will replace the default random example generator. This function mock the FileAdapterBuilder._as_dataset
data_dir Folder containing the metadata file (searched in data_dir/dataset_name/version). Overwrite data_dir kwargs from tfds.load. Used in MockPolicy.USE_FILES mode.
mock_array_record_data_source Overwrite a mock for the underlying ArrayRecord data source if it is used.

None