tfdv.DecodeCSV

Class DecodeCSV

Decodes CSV records into Arrow tables.

Currently we assume each column in the input CSV has only a single value.

__init__

__init__(
    column_names,
    delimiter=',',
    skip_blank_lines=True,
    schema=None,
    infer_type_from_schema=False,
    desired_batch_size=constants.DEFAULT_DESIRED_INPUT_BATCH_SIZE
)

Initializes the CSV decoder.

Args:

  • column_names: List of feature names. Order must match the order in the CSV file.
  • delimiter: A one-character string used to separate fields.
  • skip_blank_lines: A boolean to indicate whether to skip over blank lines rather than interpreting them as missing values.
  • schema: An optional schema of the input data.
  • infer_type_from_schema: A boolean to indicate whether the feature types should be inferred from the schema. If set to True, an input schema must be provided.
  • desired_batch_size: Batch size. The output Arrow tables will have as many rows as the desired_batch_size.

Methods

expand

expand(lines)

Decodes the input CSV records into Arrow tables.

Args:

  • lines: A PCollection of strings representing the lines in the CSV file.

Returns:

A PCollection of Arrow tables.