Missed TensorFlow Dev Summit? Check out the video playlist. Watch recordings

tff.simulation.datasets.cifar100.load_data

View source on GitHub

Loads a federated version of the CIFAR-100 dataset.

tff.simulation.datasets.cifar100.load_data(
    cache_dir=None
)

The dataset is downloaded and cached locally. If previously downloaded, it tries to load the dataset from cache.

The dataset is derived from the CIFAR-100 dataset. The training and testing examples are partitioned across 500 and 100 clients (respectively). No clients share any data samples, so it is a true partition of CIFAR-100. The train clients have string client IDs in the range [0-499], while the test clients have string client IDs in the range [0-99]. The train clients form a true partition of the CIFAR-100 training split, while the test clients form a true partition of the CIFAR-100 testing split.

The data partitioning is done using a hierarchical Latent Dirichlet Allocation (LDA) process, referred to as the Pachinko Allocation Method (PAM). This method uses a two-stage LDA process, where each client has an associated multinomial distribution over the coarse labels of CIFAR-100, and a coarse-to-fine label multinomial distribution for that coarse label over the labels under that coarse label. The coarse label multinomial is drawn from a symmetric Dirichlet with parameter 0.1, and each coarse-to-fine multinomial distribution is drawn from a symmetric Dirichlet with parameter 10. Each client has 100 samples. To generate a sample for the client, we first select a coarse label by drawing from the coarse label multinomial distribution, and then draw a fine label using the coarse-to-fine multinomial distribution. We then randomly draw a sample from CIFAR-100 with that label (without replacement). If this exhausts the set of samples with this label, we remove the label from the coarse-to-fine multinomial and renormalize the multinomial distribution.

Data set sizes:

  • train: 500,000 examples
  • test: 100,000 examples

The tf.data.Datasets returned by tff.simulation.ClientData.create_tf_dataset_for_client will yield collections.OrderedDict objects at each iteration, with the following keys and values:

  • 'coarse_label': a tf.Tensor with dtype=tf.int64 and shape [1] that corresponds to the coarse label of the associated image. Labels are in the range [0-19].
  • 'image': a tf.Tensor with dtype=tf.uint8 and shape [32, 32, 3], corresponding to the pixels of the handwritten digit, with values in the range [0, 255].
  • 'label': a tf.Tensor with dtype=tf.int64 and shape [1], the class label of the corresponding image. Labels are in the range [0-99].

Args:

  • cache_dir: (Optional) directory to cache the downloaded file. If None, caches in Keras' default cache directory.

Returns:

Tuple of (train, test) where the tuple elements are tff.simulation.ClientData objects.