![]() |
Official TFX CsvExampleGen component.
Inherits From: FileBasedExampleGen
, BaseComponent
, BaseNode
tfx.components.CsvExampleGen(
input: Optional[tfx.types.Channel
] = None,
input_base: Optional[Text] = None,
input_config: Optional[Union[example_gen_pb2.Input, Dict[Text, Any]]] = None,
output_config: Optional[Union[example_gen_pb2.Output, Dict[Text, Any]]] = None,
range_config: Optional[Union[range_config_pb2.RangeConfig, Dict[Text, Any]]] = None,
example_artifacts: Optional[tfx.types.Channel
] = None,
instance_name: Optional[Text] = None
)
Used in the notebooks
Used in the tutorials |
---|
The csv examplegen component takes csv data, and generates train and eval examples for downsteam components.
The csv examplegen encodes column values to tf.Example int/float/byte feature.
For the case when there's missing cells, the csv examplegen uses:
-- tf.train.Feature(type
_list=tf.train.type
List(value=[])), when the
type
can be inferred.
-- tf.train.Feature() when it cannot infer the type
from the column.
Note that the type inferring will be per input split. If input isn't a single split, users need to ensure the column types align in each pre-splits.
For example, given the following csv rows of a split:
header:A,B,C,D row1: 1,,x,0.1 row2: 2,,y,0.2 row3: 3,,,0.3 row4:
The output example will be example1: 1(int), empty feature(no type), x(string), 0.1(float) example2: 2(int), empty feature(no type), x(string), 0.2(float) example3: 3(int), empty feature(no type), empty list(string), 0.3(float)
Note that the empty feature is tf.train.Feature()
while empty list string
feature is tf.train.Feature(bytes_list=tf.train.BytesList(value=[]))
.
Args | |
---|---|
input
|
A Channel of type standard_artifacts.ExternalArtifact , which
includes one artifact whose uri is an external directory containing the
CSV files. (Deprecated by input_base)
|
input_base
|
an external directory containing the CSV files. |
input_config
|
An example_gen_pb2.Input instance, providing input configuration. If unset, the files under input_base will be treated as a single split. If any field is provided as a RuntimeParameter, input_config should be constructed as a dict with the same field names as Input proto message. |
output_config
|
An example_gen_pb2.Output instance, providing output configuration. If unset, default splits will be 'train' and 'eval' with size 2:1. If any field is provided as a RuntimeParameter, output_config should be constructed as a dict with the same field names as Output proto message. |
range_config
|
An optional range_config_pb2.RangeConfig instance, specifying the range of span values to consider. If unset, driver will default to searching for latest span with no restrictions. |
example_artifacts
|
Optional channel of 'ExamplesPath' for output train and eval examples. |
instance_name
|
Optional unique instance name. Necessary if multiple CsvExampleGen components are declared in the same pipeline. |
Attributes | |
---|---|
component_id
|
|
component_type
|
|
downstream_nodes
|
|
exec_properties
|
|
id
|
Node id, unique across all TFX nodes in a pipeline.
If |
inputs
|
|
outputs
|
|
type
|
|
upstream_nodes
|
Child Classes
Methods
add_downstream_node
add_downstream_node(
downstream_node
)
Experimental: Add another component that must run after this one.
This method enables task-based dependencies by enforcing execution order for synchronous pipelines on supported platforms. Currently, the supported platforms are Airflow, Beam, and Kubeflow Pipelines.
Note that this API call should be considered experimental, and may not work with asynchronous pipelines, sub-pipelines and pipelines with conditional nodes. We also recommend relying on data for capturing dependencies where possible to ensure data lineage is fully captured within MLMD.
It is symmetric with add_upstream_node
.
Args | |
---|---|
downstream_node
|
a component that must run after this node. |
add_upstream_node
add_upstream_node(
upstream_node
)
Experimental: Add another component that must run before this one.
This method enables task-based dependencies by enforcing execution order for synchronous pipelines on supported platforms. Currently, the supported platforms are Airflow, Beam, and Kubeflow Pipelines.
Note that this API call should be considered experimental, and may not work with asynchronous pipelines, sub-pipelines and pipelines with conditional nodes. We also recommend relying on data for capturing dependencies where possible to ensure data lineage is fully captured within MLMD.
It is symmetric with add_downstream_node
.
Args | |
---|---|
upstream_node
|
a component that must run before this node. |
from_json_dict
@classmethod
from_json_dict( dict_data: Dict[Text, Any] ) -> Any
Convert from dictionary data to an object.
get_id
@classmethod
get_id( instance_name: Optional[Text] = None )
Gets the id of a node.
This can be used during pipeline authoring time. For example: from tfx.components import Trainer
resolver = ResolverNode(..., model=Channel( type=Model, producer_component_id=Trainer.get_id('my_trainer')))
Args | |
---|---|
instance_name
|
(Optional) instance name of a node. If given, the instance name will be taken into consideration when generating the id. |
Returns | |
---|---|
an id for the node. |
to_json_dict
to_json_dict() -> Dict[Text, Any]
Convert from an object to a JSON serializable dictionary.
with_id
with_id(
id: Text
) -> "BaseNode"
with_platform_config
with_platform_config(
config: message.Message
) -> "BaseComponent"
Attaches a proto-form platform config to a component.
The config will be a per-node platform-specific config.
Args | |
---|---|
config
|
platform config to attach to the component. |
Returns | |
---|---|
the same component itself. |
Class Variables | |
---|---|
EXECUTOR_SPEC |
tfx.dsl.components.base.executor_spec.ExecutorClassSpec
|