tff.simulation.datasets.SqlClientData

A tff.simulation.datasets.ClientData backed by an SQL file.

Inherits From: ClientData

This class expects that the SQL file has two tables: examples and client_metadata.

Each row of the examples table corresponds to a sample in the dataset. This table must contain at least the following three columns:

  • split_name: TEXT column used to split test, holdout, and training examples.
  • client_id: TEXT column identifying which user the example belongs to.
  • serialized_example_proto: A serialized tf.train.Example protocol buffer containing the example data.

Each row of the client_metadata table corresponds to a client in the dataset. This table must contain at least the following three columns:

  • client_id: TEXT column used to identify the client.
  • split_name: TEXT column used to split test, holdout, and training examples.
  • num_examples: INTEGER column containing the number of examples held by this client.

database_filepath A str filepath to a SQL database.
split_name An optional str identifier for the split of the database to use. This filters clients and examples based on the split_name column. A value of None means no filtering, selecting all examples.

client_ids A list of string identifiers for clients in this dataset.
dataset_computation A tff.Computation accepting a client ID, returning a dataset.

element_type_structure The element type information of the client datasets.

elements returned by datasets in this ClientData object.

serializable_dataset_fn A callable accepting a client ID and returning a tf.data.Dataset.

Note that this callable must be traceable by TF, as it will be used in the context of a tf.function.

Methods

create_tf_dataset_for_client

View source

Creates a new tf.data.Dataset containing the client training examples.

This function will create a dataset for a given client if client_id is contained in the client_ids property of the SQLClientData. Unlike self.serializable_dataset_fn, this method is not serializable.

Args
client_id The string identifier for the desired client.

Returns
A tf.data.Dataset object.

create_tf_dataset_from_all_clients

View source

Creates a new tf.data.Dataset containing all client examples.

This function is intended for use training centralized, non-distributed models (num_clients=1). This can be useful as a point of comparison against federated models.

Currently, the implementation produces a dataset that contains all examples from a single client in order, and so generally additional shuffling should be performed.

Args
seed Optional, a seed to determine the order in which clients are processed in the joined dataset. The seed can be any nonnegative 32-bit integer, an array of such integers, or None.

Returns
A tf.data.Dataset object.

datasets

View source

Yields the tf.data.Dataset for each client in random order.

This function is intended for use building a static array of client data to be provided to the top-level federated computation.

Args
limit_count Optional, a maximum number of datasets to return.
seed Optional, a seed to determine the order in which clients are processed in the joined dataset. The seed can be any nonnegative 32-bit integer, an array of such integers, or None.

from_clients_and_tf_fn

View source

Constructs a ClientData based on the given function.

Args
client_ids A non-empty list of strings to use as input to create_dataset_fn.
serializable_dataset_fn A function that takes a client_id from the above list, and returns a tf.data.Dataset. This function must be serializable and usable within the context of a tf.function and tff.Computation.

Raises
TypeError If serializable_dataset_fn is a tff.Computation.

Returns
A ClientData object.

preprocess

View source

Applies preprocess_fn to each client's data.

Args
preprocess_fn A callable accepting a tf.data.Dataset and returning a preprocessed tf.data.Dataset. This function must be traceable by TF.

Returns
A tff.simulation.datasets.ClientData.

Raises
IncompatiblePreprocessFnError If preprocess_fn is a tff.Computation.

train_test_client_split

View source

Returns a pair of (train, test) ClientData.

This method partitions the clients of client_data into two ClientData objects with disjoint sets of ClientData.client_ids. All clients in the test ClientData are guaranteed to have non-empty datasets, but the training ClientData may have clients with no data.

Args
client_data The base ClientData to split.
num_test_clients How many clients to hold out for testing. This can be at most len(client_data.client_ids) - 1, since we don't want to produce empty ClientData.
seed Optional seed to fix shuffling of clients before splitting. The seed can be any nonnegative 32-bit integer, an array of such integers, or None.

Returns
A pair (train_client_data, test_client_data), where test_client_data has num_test_clients selected at random, subject to the constraint they each have at least 1 batch in their dataset.

Raises
ValueError If num_test_clients cannot be satistifed by client_data, or too many clients have empty datasets.