Watch talks from the 2019 TensorFlow Dev Summit Watch now

tff.simulation.datasets.emnist.load_data

tff.simulation.datasets.emnist.load_data(
    only_digits=True,
    cache_dir=None
)

Defined in simulation/datasets/emnist/load_data.py.

Loads the Federated EMNIST dataset.

Downloads and caches the dataset locally. If previously downloaded, tries to load the dataset from cache.

This dataset is derived from the Leaf repository (https://github.com/TalwalkarLab/leaf) pre-processing of the Extended MNIST dataset, grouping examples by writer. Details about Leaf were published in "LEAF: A Benchmark for Federated Settings" https://arxiv.org/abs/1812.01097.

Data set sizes:

only_digits=True: 3,383 users, 10 label classes

  • train: 341,873 examples
  • test: 40,832 examples

only_digits=False: 3,400 users, 62 label classes

  • train: 671,585 examples
  • test: 77,483 examples

Rather than holding out specific users, each user's examples are split across train and test so that all users have at least one example in train and one example in test. Writers that had less than 2 examples are excluded from the data set.

The tf.data.Datasets returned by tff.simulation.ClientData.create_tf_dataset_for_client will yield collections.OrderedDict objects at each iteration, with the following keys and values:

  • 'pixels': a tf.Tensor with dtype=tf.float32 and shape [28, 28], containing the pixels of the handwritten digit.
  • 'label': a tf.Tensor with dtype=tf.int32 and shape [1], the class label of the corresponding pixels.

Args:

  • only_digits: (Optional) whether to only include examples that are from the digits [0-9] classes. If False, includes lower and upper case characters, for a total of 62 class labels.
  • cache_dir: (Optional) directory to cache the downloaded file. If None, caches in Keras' default cache directory.

Returns:

Tuple of (train, test) where the tuple elements are tff.simulation.ClientData objects.