tff.simulation.datasets.celeba.load_data

Loads the Federated CelebA dataset.

Downloads and caches the dataset locally. If previously downloaded, tries to load the dataset from cache.

This dataset is derived from the LEAF repository preprocessing of the CelebA dataset, grouping examples by celebrity id. Details about LEAF were published in "LEAF: A Benchmark for Federated Settings", and details about CelebA were published in "Deep Learning Face Attributes in the Wild".

The raw CelebA dataset contains 10,177 unique identities. During LEAF preprocessing, all clients with less than 5 examples are removed; this leaves 9,343 clients.

The data is available with train and test splits by clients or by examples. That is, when split by clients, ~90% of clients are selected for the train set, ~10% of clients are selected for test, and all the examples for a given user are part of the same data split. When split by examples, each client is located in both the train data and the test data, with ~90% of the examples on each client selected for train and ~10% of the examples selected for test.

Data set sizes:

split_by_clients=True:

  • train: 8,408 clients, 180,429 total examples
  • test: 935 clients, 19,859 total examples

split_by_clients=False:

  • train: 9,343 clients, 177,457 total examples
  • test: 9,343 clients, 22,831 total examples

The tf.data.Datasets returned by tff.simulation.datasets.ClientData.create_tf_dataset_for_client will yield collections.OrderedDict objects at each iteration. These objects have a key/value pair storing the image of the celebrity:

  • 'image': a tf.Tensor with dtype=tf.int64 and shape [84, 84, 3], containing the red/blue/green pixels of the image. Each pixel is a value in the range [0, 255].

The OrderedDict objects also contain an additional 40 key/value pairs for the celebrity image attributes, each of the format:

  • {attribute name}: a tf.Tensor with dtype=tf.bool and shape [1], set to True if the celebrity has this attribute in the image, or False if they don't.

The attribute names are: 'five_o_clock_shadow', 'arched_eyebrows', 'attractive', 'bags_under_eyes', 'bald', 'bangs', 'big_lips', 'big_nose', 'black_hair', 'blond_hair', 'blurry', 'brown_hair', 'bushy_eyebrows', 'chubby', 'double_chin', 'eyeglasses', 'goatee', 'gray_hair', 'heavy_makeup', 'high_cheekbones', 'male', 'mouth_slightly_open', 'mustache', 'narrow_eyes', 'no_beard', 'oval_face', 'pale_skin', 'pointy_nose', 'receding_hairline', 'rosy_cheeks', 'sideburns', 'smiling', 'straight_hair', 'wavy_hair', 'wearing_earrings', 'wearing_hat', 'wearing_lipstick', 'wearing_necklace', 'wearing_necktie', 'young'

split_by_clients There are 9,343 clients in the federated CelebA dataset with 5 or more examples. If this argument is True, clients are divided into train and test groups, with 8,408 and 935 clients respectively. If this argument is False, the data is divided by examples instead, i.e., all clients participate in both the train and test groups, with ~90% of the examples belonging to the train group and the rest belonging to the test group.
cache_dir (Optional) directory to cache the downloaded file. If None, caches in Keras' default cache directory.

Tuple of (train, test) where the tuple elements are tff.simulation.datasets.ClientData objects.