tft.coders.CsvCoder

View source on GitHub

A coder to encode and decode CSV formatted data.

column_names Tuple of strings. Order must match the order in the file.
schema A Schema proto.
delimiter A one-character string used to separate fields.
secondary_delimiter A one-character string used to separate values within the same field.
multivalent_columns A list of names for multivalent columns that need to be split based on secondary delimiter.

ValueError If schema is invalid.

Methods

decode

View source

Decodes the given string record according to the schema.

Missing value handling is as follows:

  1. For FixedLenFeature:

    1. If FixedLenFeature and has a default value, use that value for missing entries.
    2. If FixedLenFeature and doesn't have default value throw an Exception on missing entries.
  2. For VarLenFeature return an empty array.

  3. For SparseFeature throw an Exception if only one of the indices or values has a missing entry. If both indices and values are missing, return a tuple of 2 empty arrays.

For the case of multivalent columns a ValueError will occur if FixedLenFeature gets the wrong number of values, or a SparseFeature gets different length indices and values.

Args
csv_string String to be decoded.

Returns
Dictionary of column name to value.

Raises
DecodeError If columns do not match specified csv headers.
ValueError If some numeric column has non-numeric data, if a SparseFeature has missing indices but not values or vice versa or multivalent data has the wrong length.

encode

View source

Encode a tf.transform encoded dict to a csv-formatted string.

Args
instance A python dictionary where the keys are the column names and the values are fixed len or var len encoded features.

Returns
A csv-formatted string. The order of the columns is given by column_names.