tfio.experimental.columnar.make_avro_dataset

View source on GitHub

Makes an avro dataset.

Reads from avro files and parses the contents into tensors.

Args:

  • filenames: A tf.string tensor containing one or more filenames.
  • reader_schema: A tf.string scalar for schema resolution.

  • features: Is a map of keys that describe a single entry or sparse vector in the avro record and map that entry to a tensor. The syntax is as follows:

      features = {'my_meta_data.size':
                  tf.FixedLenFeature([], tf.int64)}
    
      Select the 'size' field from a record metadata that is in the
      field 'my_meta_data'. In this example we assume that the size is
      encoded as a long in the Avro record for the metadata.
    
      features = {'my_map_data['source'].ip_addresses':
                  tf.VarLenFeature([], tf.string)}
    
      Select the 'ip_addresses' for the 'source' key in the map
      'my_map_data'. Notice we assume that IP addresses are encoded as
      strings in this example.
    
      features = {'my_friends[1].first_name':
                  tf.FixedLenFeature([], tf.string)}
    
      Select the 'first_name' for the second friend with index '1'.
      This assumes that all of your data has a second friend. In
      addition, we assume that all friends have only one first name.
      For this reason we chose a 'FixedLenFeature'.
    
      features = {'my_friends[*].first_name':
                  tf.VarLenFeature([], tf.string)}
    
      Selects all first_names in each row. For this example we use the
      wildcard '*' to indicate that we want to select all 'first_name'
      entries from the array.
    
      features = {'sparse_features':
                  tf.SparseFeature(index_key='index',
                                   value_key='value',
                                   dtype=tf.float32, size=10)}
    
      We assume that sparse features contains an array with records
      that contain an 'index' field that MUST BE LONG and an 'value'
      field with floats (single precision).
    
  • batch_size: Items in a batch, must be > 0

  • num_parallel_calls: Number of parallel calls

  • label_key: The label key, if None no label will be returned

  • num_epochs: The number of epochs. If number of epochs is set to None we cycle infinite times and drop the remainder automatically. This will make all batch sizes the same size and static.

  • input_stream_buffer_size: The size of the input stream buffer in By

  • avro_data_buffer_size: The size of the avro data buffer in By