Introduction
This document will provide instructions to create a TensorFlow Extended (TFX) pipeline
using templates which are provided with TFX Python package.
Most of instructions are Linux shell commands, and corresponding
Jupyter Notebook code cells which invoke those commands using !
are provided.
You will build a pipeline using Taxi Trips dataset released by the City of Chicago. We strongly encourage you to try to build your own pipeline using your dataset by utilizing this pipeline as a baseline.
We will build a pipeline which runs on local environment. If you are interested in using Kubeflow orchestrator on Google Cloud, please see TFX on Cloud AI Platform Pipelines tutorial.
Prerequisites
- Linux / MacOS
- Python >= 3.5.3
You can get all prerequisites easily by running this notebook on Google Colab.
Step 1. Set up your environment.
Throughout this document, we will present commands twice. Once as a copy-and-paste-ready shell command, once as a jupyter notebook cell. If you are using Colab, just skip shell script block and execute notebook cells.
You should prepare a development environment to build a pipeline.
Install tfx
python package. We recommend use of virtualenv
in the local environment. You can use following shell script snippet to set up your environment.
# Create a virtualenv for tfx.
virtualenv -p python3 venv
source venv/bin/activate
# Install python packages.
python -m pip install --upgrade "tfx<2"
If you are using colab:
import sys
# TFX has a constraint of 1.16 due to the removal of tf.estimator support.
!{sys.executable} -m pip install --upgrade "tfx<1.16"
ERROR: some-package 0.some_version.1 has requirement other-package!=2.0.,<3,>=1.15, but you'll have other-package 2.0.0 which is incompatible.
Please ignore these errors at this moment.
# Set `PATH` to include user python binary directory.
HOME=%env HOME
PATH=%env PATH
%env PATH={PATH}:{HOME}/.local/bin
env: PATH=/tmpfs/src/tf_docs_env/bin:/usr/local/cuda/bin:/opt/android-sdk/current/cmdline-tools/tools/bin:/opt/android-sdk/current/bin:/usr/local/go/bin:/usr/local/go/packages/bin:/opt/kubernetes/client/bin:/usr/local/cuda/bin:/opt/android-sdk/current/cmdline-tools/tools/bin:/opt/android-sdk/current/bin:/usr/local/go/bin:/usr/local/go/packages/bin:/opt/kubernetes/client/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/kbuilder/.local/bin:/home/kbuilder/.local/bin
Let's check the version of TFX.
python -c "from tfx import version ; print('TFX version: {}'.format(version.__version__))"
python3 -c "from tfx import version ; print('TFX version: {}'.format(version.__version__))"
TFX version: 1.15.1
And, it's done. We are ready to create a pipeline.
Step 2. Copy predefined template to your project directory.
In this step, we will create a working pipeline project directory and files by copying additional files from a predefined template.
You may give your pipeline a different name by changing the PIPELINE_NAME
below. This will also become the name of the project directory where your files will be put.
export PIPELINE_NAME="my_pipeline"
export PROJECT_DIR=~/tfx/${PIPELINE_NAME}
PIPELINE_NAME="my_pipeline"
import os
# Create a project directory under Colab content directory.
PROJECT_DIR=os.path.join(os.sep,"content",PIPELINE_NAME)
TFX includes the taxi
template with the TFX python package. If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.
The tfx template copy
CLI command copies predefined template files into your project directory.
tfx template copy \
--pipeline_name="${PIPELINE_NAME}" \
--destination_path="${PROJECT_DIR}" \
--model=taxi
!tfx template copy \
--pipeline_name={PIPELINE_NAME} \
--destination_path={PROJECT_DIR} \
--model=taxi
2024-08-02 09:31:07.893142: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:07.893215: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:07.894809: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Copying taxi pipeline template Traceback (most recent call last): File "/tmpfs/src/tf_docs_env/bin/tfx", line 8, in <module> sys.exit(cli_group()) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/decorators.py", line 92, in new_func return ctx.invoke(f, obj, *args, **kwargs) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/tools/cli/commands/template.py", line 66, in copy template_handler.copy_template(ctx.flags_dict) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/tools/cli/handler/template_handler.py", line 170, in copy_template _copy_and_replace_placeholder_dir(template_dir, destination_dir, ignore_paths, File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/tools/cli/handler/template_handler.py", line 110, in _copy_and_replace_placeholder_dir fileio.makedirs(dst) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/dsl/io/fileio.py", line 80, in makedirs _get_filesystem(path).makedirs(path) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tfx/dsl/io/plugins/tensorflow_gfile.py", line 71, in makedirs tf.io.gfile.makedirs(path) File "/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/lib/io/file_io.py", line 513, in recursive_create_dir_v2 _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path)) tensorflow.python.framework.errors_impl.PermissionDeniedError: /content; Permission denied
Change the working directory context in this notebook to the project directory.
cd ${PROJECT_DIR}
%cd {PROJECT_DIR}
[Errno 2] No such file or directory: '/content/my_pipeline' /tmpfs/src/temp/docs/tutorials/tfx
Step 3. Browse your copied source files.
The TFX template provides basic scaffold files to build a pipeline, including Python source code, sample data, and Jupyter Notebooks to analyse the output of the pipeline. The taxi
template uses the same Chicago Taxi dataset and ML model as the Airflow Tutorial.
In Google Colab, you can browse files by clicking a folder icon on the left. Files should be copied under the project directoy, whose name is my_pipeline
in this case. You can click directory names to see the content of the directory, and double-click file names to open them.
Here is brief introduction to each of the Python files.
pipeline
- This directory contains the definition of the pipelineconfigs.py
— defines common constants for pipeline runnerspipeline.py
— defines TFX components and a pipeline
models
- This directory contains ML model definitions.features.py
,features_test.py
— defines features for the modelpreprocessing.py
,preprocessing_test.py
— defines preprocessing jobs usingtf::Transform
estimator
- This directory contains an Estimator based model.constants.py
— defines constants of the modelmodel.py
,model_test.py
— defines DNN model using TF estimator
keras
- This directory contains a Keras based model.constants.py
— defines constants of the modelmodel.py
,model_test.py
— defines DNN model using Keras
local_runner.py
,kubeflow_runner.py
— define runners for each orchestration engine
You might notice that there are some files with _test.py
in their name. These are unit tests of the pipeline and it is recommended to add more unit tests as you implement your own pipelines.
You can run unit tests by supplying the module name of test files with -m
flag. You can usually get a module name by deleting .py
extension and replacing /
with .
. For example:
python -m models.features_test
{sys.executable} -m models.features_test
{sys.executable} -m models.keras_model.model_test
/tmpfs/src/tf_docs_env/bin/python: Error while finding module specification for 'models.features_test' (ModuleNotFoundError: No module named 'models') /tmpfs/src/tf_docs_env/bin/python: Error while finding module specification for 'models.keras_model.model_test' (ModuleNotFoundError: No module named 'models')
Step 4. Run your first TFX pipeline
You can create a pipeline using pipeline create
command.
tfx pipeline create --engine=local --pipeline_path=local_runner.py
tfx pipeline create --engine=local --pipeline_path=local_runner.py
2024-08-02 09:31:12.414063: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:12.414114: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:12.415793: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Creating pipeline Invalid pipeline path: local_runner.py
Then, you can run the created pipeline using run create
command.
tfx run create --engine=local --pipeline_name="${PIPELINE_NAME}"
tfx run create --engine=local --pipeline_name={PIPELINE_NAME}
2024-08-02 09:31:18.972885: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:18.972928: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:18.974412: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Creating a run for pipeline: my_pipeline Pipeline "my_pipeline" does not exist.
If successful, you'll see Component CsvExampleGen is finished.
When you copy the template, only one component, CsvExampleGen, is included in the pipeline.
Step 5. Add components for data validation.
In this step, you will add components for data validation including StatisticsGen
, SchemaGen
, and ExampleValidator
. If you are interested in data validation, please see Get started with Tensorflow Data Validation.
We will modify copied pipeline definition in pipeline/pipeline.py
. If you are working on your local environment, use your favorite editor to edit the file. If you are working on Google Colab,
Click folder icon on the left to open
Files
view.
Click
my_pipeline
to open the directory and clickpipeline
directory to open and double-clickpipeline.py
to open the file.
Find and uncomment the 3 lines which add
StatisticsGen
,SchemaGen
, andExampleValidator
to the pipeline. (Tip: find comments containingTODO(step 5):
).
Your change will be saved automatically in a few seconds. Make sure that the
*
mark in front of thepipeline.py
disappeared in the tab title. There is no save button or shortcut for the file editor in Colab. Python files in file editor can be saved to the runtime environment even inplayground
mode.
You now need to update the existing pipeline with modified pipeline definition. Use the tfx pipeline update
command to update your pipeline, followed by the tfx run create
command to create a new execution run of your updated pipeline.
# Update the pipeline
tfx pipeline update --engine=local --pipeline_path=local_runner.py
# You can run the pipeline the same way.
tfx run create --engine local --pipeline_name "${PIPELINE_NAME}"
# Update the pipeline
tfx pipeline update --engine=local --pipeline_path=local_runner.py
# You can run the pipeline the same way.
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
2024-08-02 09:31:25.520628: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:25.520679: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:25.522212: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Updating pipeline Invalid pipeline path: local_runner.py 2024-08-02 09:31:32.098176: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:32.098227: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:32.099971: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Creating a run for pipeline: my_pipeline Pipeline "my_pipeline" does not exist.
You should be able to see the output log from the added components. Our pipeline creates output artifacts in tfx_pipeline_output/my_pipeline
directory.
Step 6. Add components for training.
In this step, you will add components for training and model validation including Transform
, Trainer
, Resolver
, Evaluator
, and Pusher
.
Open
pipeline/pipeline.py
. Find and uncomment 5 lines which addTransform
,Trainer
,Resolver
,Evaluator
andPusher
to the pipeline. (Tip: findTODO(step 6):
)
As you did before, you now need to update the existing pipeline with the modified pipeline definition. The instructions are the same as Step 5. Update the pipeline using tfx pipeline update
, and create an execution run using tfx run create
.
tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name "${PIPELINE_NAME}"
tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
2024-08-02 09:31:38.632959: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:38.633006: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:38.634552: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Updating pipeline Invalid pipeline path: local_runner.py 2024-08-02 09:31:45.196540: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:45.196585: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:45.198139: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Creating a run for pipeline: my_pipeline Pipeline "my_pipeline" does not exist.
When this execution run finishes successfully, you have now created and run your first TFX pipeline using Local orchestrator!
Step 7. (Optional) Try BigQueryExampleGen.
[BigQuery] is a serverless, highly scalable, and cost-effective cloud data warehouse. BigQuery can be used as a source for training examples in TFX. In this step, we will add BigQueryExampleGen
to the pipeline.
You need a Google Cloud Platform account to use BigQuery. Please prepare a GCP project.
Login to your project using colab auth library or gcloud
utility.
# You need `gcloud` tool to login in local shell environment.
gcloud auth login
if 'google.colab' in sys.modules:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
You should specify your GCP project name to access BigQuery resources using TFX. Set GOOGLE_CLOUD_PROJECT
environment variable to your project name.
export GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
# Set your project name below.
# WARNING! ENTER your project name before running this cell.
%env GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
env: GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
Open
pipeline/pipeline.py
. Comment outCsvExampleGen
and uncomment the line which create an instance ofBigQueryExampleGen
. You also need to uncommentquery
argument of thecreate_pipeline
function.
We need to specify which GCP project to use for BigQuery again, and this is done by setting --project
in beam_pipeline_args
when creating a pipeline.
Open
pipeline/configs.py
. Uncomment the definition ofBIG_QUERY__WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS
andBIG_QUERY_QUERY
. You should replace the project id and the region value in this file with the correct values for your GCP project.
Open
local_runner.py
. Uncomment two arguments,query
andbeam_pipeline_args
, for create_pipeline() method.
Now the pipeline is ready to use BigQuery as an example source. Update the pipeline and create a run as we did in step 5 and 6.
tfx pipeline update --engine=local --pipeline_path=local_runner.py
tfx run create --engine local --pipeline_name {PIPELINE_NAME}
2024-08-02 09:31:51.749864: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:51.749913: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:51.751562: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Updating pipeline Invalid pipeline path: local_runner.py 2024-08-02 09:31:58.272243: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-02 09:31:58.272305: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-02 09:31:58.273844: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered CLI Creating a run for pipeline: my_pipeline Pipeline "my_pipeline" does not exist.
What's next: Ingest YOUR data to the pipeline.
We made a pipeline for a model using the Chicago Taxi dataset. Now it's time to put your data into the pipeline.
Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data.
- If your data is stored in files, modify the
DATA_PATH
inkubeflow_runner.py
orlocal_runner.py
and set it to the location of your files. If your data is stored in BigQuery, modifyBIG_QUERY_QUERY
inpipeline/configs.py
to correctly query for your data. - Add features in
models/features.py
. - Modify
models/preprocessing.py
to transform input data for training. - Modify
models/keras/model.py
andmodels/keras/constants.py
to describe your ML model.- You can use an estimator based model, too. Change
RUN_FN
constant tomodels.estimator.model.run_fn
inpipeline/configs.py
.
- You can use an estimator based model, too. Change
Please see Trainer component guide for more introduction.