Apply to speak at TensorFlow World. Deadline April 23rd. Propose talk

TFX Developer Tutorial

Python PyPI

Introduction

This tutorial is designed to introduce TensorFlow Extended (TFX) and help you learn to create your own machine learning pipelines. It runs locally, and shows integration with TFX and TensorBoard as well as interaction with TFX in Jupyter notebooks.

You'll follow a typical ML development process, starting by examining the dataset, and end up with a complete working pipeline. Along the way you'll explore ways to debug and update your pipeline, and measure performance.

Learn more

Please see the TFX User Guide to learn more.

Step by step

You'll gradually create your pipeline by working step by step, following a typical ML development process. Here are the steps:

  1. Setup your environment
  2. Bring up initial pipeline skeleton
  3. Dive into your data
  4. Feature engineering
  5. Training
  6. Analyzing model performance
  7. Ready for production

Prerequisites

  • Linux or MacOS
  • Virtualenv
  • Python 2.7
  • Git

Required packages

Depending on your environment you may need to install several packages:

sudo apt-get install python-dev  \
    build-essential libssl-dev libffi-dev \
    libxml2-dev libxslt1-dev zlib1g-dev \
    python-pip

Tutorial materials

The code for this tutorial is available at: https://github.com/tensorflow/tfx/tree/master/tfx/examples/workshop

The code is organized by the steps that you're working on, so for each step you'll have the code you need and instructions on what to do with it.

The tutorial files include both an exercise and the solution to the exercise, in case you get stuck.

Exercise

  • taxi_pipeline.py
  • taxi_utils.py
  • taxi DAG

Solution

  • taxi_pipeline_solution.py
  • taxi_utils_solution.py
  • taxi_solution DAG

What you're doing

You’re learning how to create an ML pipeline using TFX

  • TFX pipelines are appropriate when you will be deploying a production ML application
  • TFX pipelines are appropriate when datasets are large
  • TFX pipelines are appropriate when training/serving consistency is important
  • TFX pipelines are appropriate when version management for inference is important
  • Google uses TFX pipelines for production ML

You’re following a typical ML development process

  • Ingesting, understanding, and cleaning our data
  • Feature engineering
  • Training
  • Analyze model performance
  • Lather, rinse, repeat
  • Ready for production

Adding the code for each step

The tutorial is designed so that all the code is included in the files, but all the code for steps 3-7 is commented out and marked with inline comments. The inline comments identify which step the line of code applies to. For example, the code for step 3 is marked with the comment # Step 3.

The code that you will add for each step typically falls into 3 regions of the code:

  • imports
  • The DAG configuration
  • The list returned from the create_pipeline() call
  • The supporting code in taxi_utils.py

As you go through the tutorial you'll uncomment the lines of code that apply to the tutorial step that you're currently working on. That will add the code for that step, and update your pipeline. As you do that we strongly encourage you to review the code that you're uncommenting.

Chicago Taxi Dataset

Taxi Chicago taxi

You're using the Taxi Trips dataset released by the City of Chicago.

You can read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.

Model Goal - Binary classification

Will the customer tip more or less than 20%?

Step 1: Setup your environment

The setup script (setup_demo.sh) installs TFX and Airflow, and configures Airflow in a way that makes it easy to work with for this tutorial.

In a shell:

cd
virtualenv -p python2.7 tfx-env
source ~/tfx-env/bin/activate
mkdir tfx; cd tfx

pip install tensorflow==1.13.1
pip install tfx==0.12.0
git clone https://github.com/tensorflow/tfx.git
cd ~/tfx/tfx/tfx/examples/workshop/setup
./setup_demo.sh

You should review setup_demo.sh to see what it's doing.

Step 2: Bring up initial pipeline skeleton

Hello World

In a shell:

# Open a new terminal window, and in that window ...
source ~/tfx-env/bin/activate
airflow webserver -p 8080

# Open another new terminal window, and in that window ...
source ~/tfx-env/bin/activate
airflow scheduler

# Open yet another new terminal window, and in that window ...
# Assuming that you've cloned the TFX repo into ~/tfx
source ~/tfx-env/bin/activate
cd ~/tfx/tfx/tfx/examples/workshop/notebooks
jupyter notebook

You started Jupyter notebook in this step. Later you will be running the notebooks in this folder.

In a browser:

  • Open a browser and go to http://127.0.0.1:8080

Troubleshooting

If you have any issues with loading the Airflow console in your web browser, or if there were any errors when you ran airflow webserver, then you may have another application running on port 8080. That's the default port for Airflow, but you can change it to any other user port that's not being used. For example, to run Airflow on port 7070 you could run:

airflow webserver -p 7070

DAG view buttons

DAG buttons

  • Use the button on the left to enable the taxi DAG
  • Use the button on the right to refresh the taxi DAG when you make changes
  • Use the button on the right to trigger the taxi DAG
  • Click on taxi to go to the graph view of the taxt DAG

Graph refresh button

Waiting for the pipeline to complete

After you've triggered your pipeline in the DAGs view, you can watch as your pipeline completes processing. As each component runs the outline color of the component in the DAG graph will change to show its state. When a component has finished processing the outline will turn dark green to show that it's done.

So far you only have the CsvExampleGen component in our pipeline, so you need to wait for it to turn dark green (~1 minutes).

Setup complete

Step 3: Dive into your data

The first task in any data science or ML project is to understand and clean the data.

  • Understand the data types for each feature
  • Look for anomalies and missing values
  • Understand the distributions for each feature

Components

Data Components Data Components

  • ExampleGen ingests and splits the input dataset.
  • StatisticsGen calculates statistics for the dataset.
  • SchemaGen SchemaGen examines the statistics and creates a data schema.
  • ExampleValidator looks for anomalies and missing values in the dataset.

In an editor:

  • In ~/airflow/dags uncomment the lines marked Step 3 in taxi_pipeline.py
  • Take a moment to review the code that you uncommented

In a browser:

  • Return to DAGs list page in Airflow by clicking on "DAGs" link in the top left corner
  • Click the refresh button on the right side for the taxi DAG
    • You should see "DAG [taxi] is now fresh as a daisy"
  • Trigger taxi
  • Wait for pipeline to complete
    • All dark green
    • Use refresh on right side or refresh page

Dive into data

Back on Jupyter:

Earlier, you ran jupyter notebook, which opened a Jupyter session in a browser tab. Now return to that tab in your browser.

  • Open step3.ipynb
  • Follow the notebook

Dive into data

More advanced example

The example presented here is really only meant to get you started. For a more advanced example see the TensorFlow Data Validation Colab.

For more information on using TFDV to explore and validate a dataset, see the examples on tensorflow.org.

Step 4: Feature engineering

You can increase the predictive quality of your data and/or reduce dimensionality with feature engineering.

  • Feature crosses
  • Vocabularies
  • Embeddings
  • PCA
  • Categorical encoding

One of the benefits of using TFX is that you will write your transformation code once, and the resulting transforms will be consistent between training and serving.

Components

Transform

  • Transform performs feature engineering on the dataset.

In an editor:

  • In ~/airflow/dags uncomment the lines marked Step 4 in both taxi_pipeline.py and taxi_utils.py
  • Take a moment to review the code that you uncommented

In a browser:

  • Return to DAGs list page in Airflow
  • Click the refresh button on the right side for the taxi DAG
    • You should see "DAG [taxi] is now fresh as a daisy"
  • Trigger taxi
  • Wait for pipeline to complete
    • All dark green
    • Use refresh on right side or refresh page

Feature Engineering

Back on Jupyter:

Return to the Jupyter tab in your browser.

  • Open step4.ipynb
  • Follow the notebook

More advanced example

The example presented here is really only meant to get you started. For a more advanced example see the TensorFlow Transform Colab.

Step 5: Training

Train a TensorFlow model with your nice, clean, transformed data.

  • Include the transformations from step 4 so that they are applied consistently
  • Save the results as a SavedModel for production
  • Visualize and explore the training process using TensorBoard
  • Also save an EvalSavedModel for analysis of model performance

Components

In an editor:

  • In ~/airflow/dags uncomment the lines marked Step 5 in both taxi_pipeline.py and taxi_utils.py
  • Take a moment to review the code that you uncommented

In a browser:

  • Return to DAGs list page in Airflow
  • Click the refresh button on the right side for the taxi DAG
    • You should see "DAG [taxi] is now fresh as a daisy"
  • Trigger taxi
  • Wait for pipeline to complete
    • All dark green
    • Use refresh on right side or refresh page

Training a Model

Back on Jupyter:

Return to the Jupyter tab in your browser.

  • Open step5.ipynb
  • Follow the notebook

Training a Model

More advanced example

The example presented here is really only meant to get you started. For a more advanced example see the TensorBoard Tutorial.

Step 6: Analyzing model performance

Understanding more than just the top level metrics.

  • Users experience model performance for their queries only
  • Poor performance on slices of data can be hidden by top level metrics
  • Model fairness is important
  • Often key subsets of users or data are very important, and may be small
    • Performance in critical but unusual conditions
    • Performance for key audiences such as influencers

Components

  • Evaluator performs deep analysis of the training results.

In an editor:

  • In ~/airflow/dags uncomment the lines marked Step 6 in both taxi_pipeline.py
  • Take a moment to review the code that you uncommented

In a browser:

  • Return to DAGs list page in Airflow
  • Click the refresh button on the right side for the taxi DAG
    • You should see "DAG [taxi] is now fresh as a daisy"
  • Trigger taxi
  • Wait for pipeline to complete
    • All dark green
    • Use refresh on right side or refresh page

Analyzing model performance

Back on Jupyter:

Return to the Jupyter tab in your browser.

  • Open step6.ipynb
  • Follow the notebook

Analyzing model performance

More advanced example

The example presented here is really only meant to get you started. For a more advanced example see the TFMA Chicago Taxi Tutorial.

Step 7: Ready for production

If the new model is ready, make it so.

  • If you’re replacing a model that is currently in production, first make sure that the new one is better
  • ModelValidator tells the Pusher component if the model is OK
  • Pusher deploys SavedModels to well-known locations

Deployment targets receive new models from well-known locations

  • TensorFlow Serving
  • TensorFlow Lite
  • TensorFlow JS
  • TensorFlow Hub

Components

  • ModelValidator ensures that the model is "good enough" to be pushed to production.
  • Pusher deploys the model to a serving infrastructure.

In an editor:

  • In ~/airflow/dags uncomment the lines marked Step 7 in both taxi_pipeline.py
  • Take a moment to review the code that you uncommented

In a browser:

  • Return to DAGs list page in Airflow
  • Click the refresh button on the right side for the taxi DAG
    • You should see "DAG [taxi] is now fresh as a daisy"
  • Trigger taxi
  • Wait for pipeline to complete
    • All dark green
    • Use refresh on right side or refresh page

Ready for production

Next Steps

You have now trained and validated your model, and exported a SavedModel file under the ~/airflow/saved_models/taxi directory. Your model is now ready for production. You can now deploy your model to any of the TensorFlow deployment targets, including:

  • TensorFlow Serving, for serving your model on a server or server farm and processing REST and/or gRPC inference requests.
  • TensorFlow Lite, for including your model in an Android or iOS native mobile application, or in a Raspberry Pi, IoT, or microcontroller application.
  • TensorFlow.js, for running your model in a web browser or Node.JS application.