tfx.components.Transform

A TFX component to transform the input examples.

Inherits From: BaseComponent

Used in the notebooks

Used in the tutorials

The Transform component wraps TensorFlow Transform (tf.Transform) to preprocess data in a TFX pipeline. This component will load the preprocessing_fn from input module file, preprocess both 'train' and 'eval' splits of input examples, generate the tf.Transform output, and save both transform function and transformed examples to orchestrator desired locations.

Providing a preprocessing function

The TFX executor will use the estimator provided in the module_file file to train the model. The Transform executor will look specifically for the preprocessing_fn() function within that file.

An example of preprocessing_fn() can be found in the user-supplied code of the TFX Chicago Taxi pipeline example.

Example

# Performs transformations and feature engineering in training and serving.
transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=infer_schema.outputs['schema'],
    module_file=module_file)

Please see https://www.tensorflow.org/tfx/transform for more details.

examples A Channel of type standard_artifacts.Examples (required). This should contain the two splits 'train' and 'eval'.
schema A Channel of type standard_artifacts.Schema. This should contain a single schema artifact.
module_file The file path to a python module file, from which the 'preprocessing_fn' function will be loaded. Exactly one of 'module_file' or 'preprocessing_fn' must be supplied.

The function needs to have the following signature:

def preprocessing_fn(inputs: Dict[Text, Any]) -> Dict[Text, Any]:
...

where the values of input and returned Dict are either tf.Tensor or tf.SparseTensor.

If additional inputs are needed for preprocessing_fn, they can be passed in custom_config:

def preprocessing_fn(inputs: Dict[Text, Any], custom_config:
Dict[Text, Any]) -> Dict[Text, Any]:
...

preprocessing_fn The path to python function that implements a 'preprocessing_fn'. See 'module_file' for expected signature of the function. Exactly one of 'module_file' or 'preprocessing_fn' must be supplied.
transform_graph Optional output 'TransformPath' channel for output of 'tf.Transform', which includes an exported Tensorflow graph suitable for both training and serving;
transformed_examples Optional output 'ExamplesPath' channel for materialized transformed examples, which includes both 'train' and 'eval' splits.
input_data Backwards compatibility alias for the 'examples' argument.
instance_name Optional unique instance name. Necessary iff multiple transform components are declared in the same pipeline.
materialize If True, write transformed examples as an output. If False, transformed_examples must not be provided.
custom_config A dict which contains additional parameters that will be passed to preprocessing_fn.

ValueError When both or neither of 'module_file' and 'preprocessing_fn' is supplied.

component_id DEPRECATED FUNCTION

component_type DEPRECATED FUNCTION
downstream_nodes

exec_properties

id Node id, unique across all TFX nodes in a pipeline.

If instance name is available, node_id will be: . otherwise, node_id will be:

inputs

outputs

type

upstream_nodes

Child Classes

class DRIVER_CLASS

class SPEC_CLASS

Methods

add_downstream_node

View source

Experimental: Add another component that must run after this one.

This method enables task-based dependencies by enforcing execution order for synchronous pipelines on supported platforms. Currently, the supported platforms are Airflow, Beam, and Kubeflow Pipelines.

Note that this API call should be considered experimental, and may not work with asynchronous pipelines, sub-pipelines and pipelines with conditional nodes. We also recommend relying on data for capturing dependencies where possible to ensure data lineage is fully captured within MLMD.

It is symmetric with add_upstream_node.

Args
downstream_node a component that must run after this node.

add_upstream_node

View source

Experimental: Add another component that must run before this one.

This method enables task-based dependencies by enforcing execution order for synchronous pipelines on supported platforms. Currently, the supported platforms are Airflow, Beam, and Kubeflow Pipelines.

Note that this API call should be considered experimental, and may not work with asynchronous pipelines, sub-pipelines and pipelines with conditional nodes. We also recommend relying on data for capturing dependencies where possible to ensure data lineage is fully captured within MLMD.

It is symmetric with add_downstream_node.

Args
upstream_node a component that must run before this node.

from_json_dict

View source

Convert from dictionary data to an object.

get_id

View source

Gets the id of a node.

This can be used during pipeline authoring time. For example: from tfx.components import Trainer

resolver = ResolverNode(..., model=Channel( type=Model, producer_component_id=Trainer.get_id('my_trainer')))

Args
instance_name (Optional) instance name of a node. If given, the instance name will be taken into consideration when generating the id.

Returns
an id for the node.

to_json_dict

View source

Convert from an object to a JSON serializable dictionary.

Class Variables

  • EXECUTOR_SPEC