Missed TensorFlow World? Check out the recap. Learn more

The TFX User Guide

Introduction

TFX is a Google-production-scale machine learning platform based on TensorFlow. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system.

Installation

Python PyPI

pip install tensorflow
pip install tfx

Core Concepts

TFX Pipelines

A TFX pipeline defines a data flow through several components, with the goal of implementing a specific ML task (e.g., building and deploying a regression model for specific data). Pipeline components are built upon TFX libraries. The result of a pipeline is a TFX deployment target and/or service of an inference request.

Artifacts

In a pipeline, an artifact is a unit of data that is passed between components. Generally, components have at least one input artifact and one output artifact. All artifacts must have associated metadata, which defines the type and properties of the artifact. Artifacts must be strongly typed with an artifact type registered in the ML Metadata store. The concepts of artifact and artifact type originate from the data model that ML Metadata defines, as described in this document. TFX defines and implements its own artifact type ontology to realize its higher-level functionality. As of TFX 0.14, 10 known artifact types are defined and used throught the TFX system.

An artifact type has a unique name and a schema of properties of its instances. TFX utilizes artifact type as how the artifact is used by components in the pipeline, but not necessarily to determine what the artifact content physically is on a filesystem.

For instance, the Example artifact type may represent Examples materialized in TFRecord of tensorflow::Example protocol buffer, CSV, JSON, or any other physical format. Regardless, the way Examples are used in a pipeline is exactly the same: being analyzed to generate statistics, being validated against expected schema, being pre-processed in advance to training, and being supplied to a Trainer to training models, and so forth. Likewise, the Model artifact type may represent trained model objects exported in various physical formats such as TensorFlow SavedModel, ONNX, PMML or PKL (of various types of model objects in Python). In any case, models are always to be evaluated, analyzed and deployed for serving in pipelines.

NOTE: As of TFX 0.14, Examples artifact is assumed to be tensorflow::Example protocol buffer in gzip-compressed TFRecord format. Model artifact is assumed to be TensorFlow SavedModel. Future versions of TFX may expand those artifact types to support more variants.

In order to differentiate such possible variants of the same artifact type, the ML Metadata defines a set of artifact properties. For instance, one such artifact property for an Examples artifact may be format, whose values may be one of TFRecord, JSON, CSV, and so forth. Artifacts of type Examples can always be passed to a component that is designed to take Examples as an input artifact (for example, a Trainer). However, the actual implementation of the consuming component may adjust its behavior in response to a particular value of the format property, or simply raise a runtime error if it doesn’t have implementation to process the particular format of the Examples.

In summary, artifact types define the ontology of artifacts in the entire TFX pipeline system, whereas artifact properties define the ontology specific to an artifact type. Users of the pipeline system can choose to extend such ontology locally to their pipeline applications, by defining and populating new custom properties. Users can also choose to extend the ontology globally for the system as a whole, by introducing new artifact types, and/or modifying predefined type-properties, in which case such extension would be contributed back to the master repository of the pipeline system (the TFX repository).

TFX Pipeline Components

A TFX pipeline is a sequence of components that implement an ML pipeline which is specifically designed for scalable, high-performance machine learning tasks. That includes modeling, training, serving inference, and managing deployments to online, native mobile, and JavaScript targets.

A TFX pipeline typically includes the following components:

  • ExampleGen is the initial input component of a pipeline that ingests and optionally splits the input dataset.

  • StatisticsGen calculates statistics for the dataset.

  • SchemaGen examines the statistics and creates a data schema.

  • ExampleValidator looks for anomalies and missing values in the dataset.

  • Transform performs feature engineering on the dataset.

  • Trainer trains the model.

  • Evaluator performs deep analysis of the training results.

  • ModelValidator helps you validate your exported models, ensuring that they are "good enough" to be pushed to production.

  • Pusher deploys the model on a serving infrastructure.

This diagram illustrates the flow of data between these components:

Component Flow

Anatomy of a Component

TFX components consist of three main pieces:

  • Driver
  • Executor
  • Publisher

Component Anatomy

Driver and Publisher

The driver supplies metadata to the executor by querying the metadata store, while the publisher accepts the results of the executor and stores them in metadata. As a developer you will typically not need to interact with the driver and publisher directly, but messages logged by the driver and publisher may be useful during debugging. See Troubleshooting.

Executor

The executor is where a component performs its processing. As a developer you write code which runs in the executor, based on the requirements of the classes which implement the type of component that you're working with. For example, when you're working on a Transform component you will need to develop a preprocessing_fn.

TFX Libraries

TFX includes both libraries and pipeline components. This diagram illustrates the relationships between TFX libraries and pipeline components:

Libraries and Components

TFX provides several Python packages that are the libraries which are used to create pipeline components. You'll use these libraries to create the components of your pipelines so that your code can focus on the unique aspects of your pipeline.

TFX libraries include:

  • TensorFlow Data Validation (TFDV) is a library for analyzing and validating machine learning data. It is designed to be highly scalable and to work well with TensorFlow and TFX. TFDV includes:

    • Scalable calculation of summary statistics of training and test data.
    • Integration with a viewer for data distributions and statistics, as well as faceted comparison of pairs of datasets (Facets).
    • Automated data-schema generation to describe expectations about data like required values, ranges, and vocabularies.
    • A schema viewer to help you inspect the schema.
    • Anomaly detection to identify anomalies, such as missing features, out-of- range values, or wrong feature types, to name a few.
    • An anomalies viewer so that you can see what features have anomalies and learn more in order to correct them.
  • TensorFlow Transform (TFT) is a library for preprocessing data with TensorFlow. TensorFlow Transform is useful for data that requires a full- pass, such as:

    • Normalize an input value by mean and standard deviation.
    • Convert strings to integers by generating a vocabulary over all input values.
    • Convert floats to integers by assigning them to buckets based on the observed data distribution.
  • TensorFlow is used for training models with TFX. It ingests training data and modeling code and creates a SavedModel result. It also integrates a feature engineering pipeline created by TensorFlow Transform for preprocessing input data.

  • TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models. It is used along with TensorFlow to create an EvalSavedModel, which becomes the basis for its analysis. It allows users to evaluate their models on large amounts of data in a distributed manner, using the same metrics defined in their trainer. These metrics can be computed over different slices of data and visualized in Jupyter notebooks.

  • TensorFlow Metadata (TFMD) provides standard representations for metadata that are useful when training machine learning models with TensorFlow. The metadata may be produced by hand or automatically during input data analysis, and may be consumed for data validation, exploration, and transformation. The metadata serialization formats include:

    • A schema describing tabular data (e.g., tf.Examples).
    • A collection of summary statistics over such datasets.
  • ML Metadata (MLMD) is a library for recording and retrieving metadata associated with ML developer and data scientist workflows. Most often the metadata uses TFMD representations. MLMD manages persistence using SQL-Lite, MySQL, and other similar data stores.

Supporting Technologies

Required

  • Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. TFX uses Apache Beam to implement data-parallel pipelines. The pipeline is then executed by one of Beam's supported distributed processing back-ends, which include Apache Flink, Apache Spark, Google Cloud Dataflow, and others.

Optional

Orchestrators such as Apache Airflow and Kubeflow make configuring, operating, monitoring, and maintaining an ML pipeline easier.

  • Apache Airflow is a platform to programmatically author, schedule and monitor workflows. TFX uses Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

  • Kubeflow is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Kubeflow's goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Kubeflow Pipelines enable composition and execution of reproducible workflows on Kubeflow, integrated with experimentation and notebook based experiences. Kubeflow Pipelines services on Kubernetes include the hosted Metadata store, container based orchestration engine, notebook server, and UI to help users develop, run, and manage complex ML pipelines at scale. The Kubeflow Pipelines SDK allows for creation and sharing of components and composition of pipelines programmatically.

Portability and Interoperability

TFX is designed to be portable to multiple environments and orchestration frameworks, including Apache Airflow, Apache Beam and Kubeflow . It is also portable to different computing platforms, including on-premise, and cloud platforms such as the Google Cloud Platform (GCP). In particular, TFX interoperates with serveral managed GCP services, such as Cloud AI Platform for Training and Prediction, and Cloud Dataflow for distributed data processing for several other aspects of the ML lifecycle.

Model vs. SavedModel

Model

A model is the output of the training process. It is the serialized record of the weights that have been learned during the training process. These weights can be subsequently used to compute predictions for new input examples. For TFX and TensorFlow, 'model' refers to the checkpoints containing the weights learned up to that point.

Note that 'model' might also refer to the definition of the TensorFlow computation graph (i.e. a Python file) that expresses how a prediction will be computed. The two senses may be used interchangeably based on context.

SavedModel

  • What is a SavedModel: a universal, language-neutral, hermetic, recoverable serialization of a TensorFlow model.
  • Why is it important: It enables higher-level systems to produce, transform, and consume TensorFlow models using a single abstraction.

SavedModel is the recommended serialization format for serving a TensorFlow model in production, or exporting a trained model for a native mobile or JavaScript application. For example, to turn a model into a REST service for making predictions, you can serialize the model as a SavedModel and serve it using TensorFlow Serving. See Serving a TensorFlow Model for more information.

Schema

Some TFX components use a description of your input data called a schema. The schema is an instance of schema.proto. Schemas are a type of protocol buffer, more generally known as a "protobuf". The schema can specify data types for feature values, whether a feature has to be present in all examples, allowed value ranges, and other properties. One of the benefits of using TensorFlow Data Validation (TFDV) is that it will automatically generate a schema by inferring types, categories, and ranges from the training data.

Here's an excerpt from a schema protobuf:

...
feature {
  name: "age"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_fraction: 1
    min_count: 1
  }
}
feature {
  name: "capital-gain"
  value_count {
    min: 1
    max: 1
  }
  type: FLOAT
  presence {
    min_fraction: 1
    min_count: 1
  }
}
...

The following components use the schema:

  • TensorFlow Data Validation
  • TensorFlow Transform

In a typical TFX pipeline TensorFlow Data Validation generates a schema, which is consumed by the other components.

Developing with TFX

TFX provides a powerful platform for every phase of a machine learning project, from research, experimentation, and development on your local machine, through deployment. In order to avoid code duplication and eliminate the potential for training/serving skew it is strongly recommended to implement your TFX pipeline for both model training and deployment of trained models, and use Transform components which leverage the TensorFlow Transform library for both training and inference. By doing so you will use the same preprocessing and analysis code consistently, and avoid differences between data used for training and data fed to your trained models in production, as well as benefitting from writing that code once.

Data Exploration, Visualization, and Cleaning

Data Exploration, Visualization, and Cleaning

TFX pipelines typically begin with an ExampleGen component, which accepts input data and formats it as tf.Examples. Often this is done after the data has been split into training and evaluation datasets so that there are actually two copies of ExampleGen components, one each for training and evaluation. This is typically followed by a StatisticsGen component and a SchemaGen component, which will examine your data and infer a data schema and statistics. The schema and statistics will be consumed by an ExampleValidator component, which will look for anomalies, missing values, and incorrect data types in your data. All of these components leverage the capabilities of the TensorFlow Data Validation library.

TensorFlow Data Validation (TFDV) is a valuable tool when doing initial exploration, visualization, and cleaning of your dataset. TFDV examines your data and infers the data types, categories, and ranges, and then automatically helps identify anomalies and missing values. It also provides visualization tools that can help you examine and understand your dataset. After your pipeline completes you can read metadata from MLMD and use the visualization tools of TFDV in a Jupyter notebook to analyze your data.

Following your initial model training and deployment, TFDV can be used to monitor new data from inference requests to your deployed models, and look for anomalies and/or drift. This is especially useful for time series data that changes over time as a result of trend or seasonality, and can help inform when there are data problems or when models need to be retrained on new data.

Data Visualization

After you have completed your first run of your data through the section of your pipeline that uses TFDV (typically StatisticsGen, SchemaGen, and ExampleValidator) you can visualize the results in a Jupyter style notebook. For additional runs you can compare these results as you make adjustments, until your data is optimal for your model and application.

You will first query ML Metadata (MLMD) to locate the results of these executions of these components, and then use the visualization support API in TFDV to create the visualizations in your notebook. This includes tfdv.load_statistics() and tfdv.visualize_statistics() Using this visualization you can better understand the characteristics of your dataset, and if necessary modify as required.

Developing and Training Models

Feature Engineering

A typical TFX pipeline will include a Transform component, which will perform feature engineering by leveraging the capabilities of the TensorFlow Transform (TFT) library. A Transform component consumes the schema created by a SchemaGen component, and applies data transformations to create, combine, and transform the features that will be used to train your model. Cleanup of missing values and conversion of types should also be done in the Transform component if there is ever a possibility that these will also be present in data sent for inference requests. There are some important considerations when designing TensorFlow code for training in TFX.

Modeling and Training

The result of a Transform component is a SavedModel which will be imported and used in your modeling code in TensorFlow, during a Trainer component. This SavedModel includes all of the data engineering transformations that were created in the Transform component, so that the identical transforms are performed using the exact same code during both training and inference. Using the modeling code, including the SavedModel from the Transform component, you can consume your training and evaluation data and train your model.

During the last section of your modeling code you should save your model as both a SavedModel and an EvalSavedModel. Saving as an EvalSavedModel will require you to import and apply TensorFlow Model Analysis (TFMA) library in your Trainer component.

import tensorflow_model_analysis as tfma
...

tfma.export.export_eval_savedmodel(
        estimator=estimator,
        export_dir_base=eval_model_dir,
        eval_input_receiver_fn=receiver_fn)

Analyzing and Understanding Model Performance

Model Analysis

Following initial model development and training it's important to analyze and really understand you model's performance. A typical TFX pipeline will include an Evaluator component, which leverages the capabilities of the TensorFlow Model Analysis (TFMA) library, which provides a power toolset for this phase of development. An Evaluator component consumes the EvalSavedModel that you exported above, and allows you to specify a list of SliceSpecs that you can use when visualizing and analyzing your model's performance. Each SliceSpec defines a slice of your training data that you want to examine, such as particular categories for categorical features, or particular ranges for numerical features.

For example, this would be important for trying to understand your model's performance for different segments of your customers, which could be segmented by annual purchases, geographical data, age group, or gender. This can be especially important for datasets with long tails, where the performance of a dominant group may mask unacceptable performance for important, yet smaller groups. For example, your model may perform well for average employees but fail miserably for executive staff, and it might be important to you to know that.

Model Analysis and Visualization

After you have completed your first run of your data through training your model and running the Evaluator component (which leverages TFMA) on the training results, you can visualize the results in a Jupyter style notebook. For additional runs you can compare these results as you make adjustments, until your results are optimal for your model and application.

You will first query ML Metadata (MLMD) to locate the results of these executions of these components, and then use the visualization support API in TFMA to create the visualizations in your notebook. This includes tfma.load_eval_results() and tfma.view.render_slicing_metrics() Using this visualization you can better understand the characteristics of your model, and if necessary modify as required.

Deployment Targets

Once you have developed and trained a model that you're happy with, it's now time to deploy it to one or more deployment target(s) where it will receive inference requests. TFX supports deployment to three classes of deployment targets. Trained models which have been exported as SavedModels can be deployed to any or all of these deployment targets.

Component Flow

Inference: TensorFlow Serving

TensorFlow Serving (TFS) is a flexible, high-performance serving system for machine learning models, designed for production environments. It consumes a SavedModel and will accept inference requests over either REST or gRPC interfaces. It runs as a set of processes on one or more network servers, using one of several advanced architectures to handle synchronization and distributed computation. See the TFS documentation for more information on developing and deploying TFS solutions.

In a typical pipeline a Pusher component will consume SavedModels which have been trained in a Trainer component and deploy them to your TFS infrastructure. This includes handling multiple versions and model updates.

Inference in Native Mobile and IoT Applications: TensorFlow Lite

TensorFlow Lite is a suite of tools which is dedicated to help developers use their trained TensorFlow Models in native mobile and IoT applications. It consumes the same SavedModels as TensorFlow Serving, and applies optimizations such as quantization and pruning to optimize the size and performance of the resulting models for the challenges of running on mobile and IoT devices. See the TensorFlow Lite documentation for more information on using TensorFlow Lite.

Inference in JavaScript: TensorFlow JS

TensorFlow JS is a JavaScript library for training and deploying ML models in the browser and on Node.js. It consumes the same SavedModels as TensorFlow Serving and TensorFlow Lite, and converts them to the TensorFlow.js Web format. See the TensorFlow JS documentation for more details on using TensorFlow JS.

Creating a TFX Pipeline With Airflow

Check airflow workshop for details

Creating a TFX Pipeline With Kubeflow

Setup

Kubeflow requires a Kubernetes cluster to run the pipelines at scale. See the Kubeflow deployment guideline that guide through the options for deplopying the Kubeflow cluster.

Configure and run TFX pipeline

Please follow the Kubeflow Pipelines instructions to run the TFX example pipeline on Kubeflow. TFX components have been containerized to compose the Kubeflow pipeline and the sample illustrates the ability to configure the pipeline to read large public dataset and execute training and data processing steps at scale in the cloud.

Command line interface for pipeline actions

TFX provides a unified CLI which helps the perform full range of pipeline actions such as create, update, run, list, and delete pipelines on various orchestrators including Apache Airflow, Apache Beam, and Kubeflow. For details, please follow these instructions.