API for using the service.

This module contains:

  1. server implementations for running the service.
  2. A distribute dataset transformation that moves a dataset's preprocessing to happen in the service.

The service offers a way to improve training speed when the host attached to a training device can't keep up with the data consumption of the model. For example, suppose a host can generate 100 examples/second, but the model can process 200 examples/second. Training speed could be doubled by using the service to generate 200 examples/second.

Before using the service

There are a few things to do before using the service to speed up training.

Understand processing_mode

The service uses a cluster of workers to prepare data for training your model. The processing_mode argument to describes how to leverage multiple workers to process the input dataset. Currently, there are two processing modes to choose from: "distributed_epoch" and "parallel_epochs".

"distributed_epoch" means that the dataset will be split across all service workers. The dispatcher produces "splits" for the dataset and sends them to workers for further processing. For example, if a dataset begins with a list of filenames, the dispatcher will iterate through the filenames and send the filenames to workers, which will perform the rest of the dataset transformations on those files. "distributed_epoch" is useful when your model needs to see each element of the dataset exactly once, or if it needs to see the data in a generally-sequential order. "distributed_epoch" only works for datasets with splittable sources, such as Dataset.from_tensor_slices, Dataset.list_files, or Dataset.range.

"parallel_epochs" means that the entire input dataset will be processed independently by each of the service workers. For this reason, it is important to shuffle data (e.g. filenames) non-deterministically, so that each worker will process the elements of the dataset in a different order. "parallel_epochs"