tfx.components.example_gen.custom_executors.avro_executor.Executor

TFX example gen executor for processing avro format.

Inherits From: BaseExampleGenExecutor

Data type conversion:

integer types will be converted to tf.train.Feature with tf.train.Int64List. float types will be converted to tf.train.Feature with tf.train.FloatList. string types will be converted to tf.train.Feature with tf.train.BytesList and utf-8 encoding.

Note that, Single value will be converted to a list of that single value. Missing value will be converted to empty tf.train.Feature().

For details, check the dict_to_example function in example_gen.utils.

Example usage:

from tfx.components.example_gen.component import FileBasedExampleGen from tfx.components.example_gen.custom_executors import avro_executor from tfx.utils.dsl_utils import external_input

example_gen = FileBasedExampleGen( input=external_input(avro_dir_path), executor_class=avro_executor.Executor)

Child Classes

class Context

Methods

Do

View source

Take input data source and generates serialized data splits.

The output is intended to be serialized tf.train.Examples or tf.train.SequenceExamples protocol buffer in gzipped TFRecord format, but subclasses can choose to override to write to any serialized records payload into gzipped TFRecord as specified, so long as downstream component can consume it. The format of payload is added to payload_format custom property of the output Example artifact.

Args
input_dict Input dict from input key to a list of Artifacts. Depends on detailed example gen implementation.
output_dict Output dict from output key to a list of Artifacts.

  • examples: splits of serialized records.
exec_properties A dict of execution properties. Depends on detailed example gen implementation.
  • input_base: an external directory containing the data files.
  • input_config: JSON string of example_gen_pb2.Input instance, providing input configuration.
  • output_config: JSON string of example_gen_pb2.Output instance, providing output configuration.
  • output_data_format: Payload format of generated data in output artifact, one of example_gen_pb2.PayloadFormat enum.
  • Returns
    None

    GenerateExamplesByBeam

    View source

    Converts input source to serialized record splits based on configs.

    Custom ExampleGen executor should provide GetInputSourceToExamplePTransform for converting input split to serialized records. Overriding this 'GenerateExamplesByBeam' method instead if complex logic is need, e.g., custom spliting logic.

    Args
    pipeline Beam pipeline.
    exec_properties A dict of execution properties. Depends on detailed example gen implementation.

    • input_base: an external directory containing the data files.
    • input_config: JSON string of example_gen_pb2.Input instance, providing input configuration.
    • output_config: JSON string of example_gen_pb2.Output instance, providing output configuration.
    • output_data_format: Payload format of generated data in output artifact, one of example_gen_pb2.PayloadFormat enum.

    Returns
    Dict of beam PCollection with split name as key, each PCollection is a single output split that contains serialized records.

    GetInputSourceToExamplePTransform

    View source

    Returns PTransform for avro to TF examples.