tfx_bsl.public.tfxio.TFExampleRecord

TFXIO implementation for tf.Example on TFRecord.

Inherits From: TFXIO

file_pattern A file glob pattern to read TFRecords from.
validate Not used. do not set. (not used since post 0.22.1).
schema A TFMD Schema describing the dataset.
raw_record_column_name If not None, the generated Arrow RecordBatches will contain a column of the given name that contains serialized records.
telemetry_descriptors A set of descriptors that identify the component that is instantiating this TFXIO. These will be used to construct the namespace to contain metrics for profiling and are therefore expected to be identifiers of the component itself and not individual instances of source use.

raw_record_column_name

telemetry_descriptors

Methods

ArrowSchema

Returns the schema of the RecordBatch produced by self.BeamSource().

May raise an error if the TFMD schema was not provided at construction time.

BeamSource

Returns a beam PTransform that produces PCollection[pa.RecordBatch].

May NOT raise an error if the TFMD schema was not provided at construction time.

If a TFMD schema was provided at construction time, all the pa.RecordBatches in the result PCollection must be of the same schema returned by self.ArrowSchema. If a TFMD schema was not provided, the pa.RecordBatches might not be of the same schema (they may contain different numbers of columns).

Args
batch_size if not None, the pa.RecordBatch produced will be of the specified size. Otherwise it's automatically tuned by Beam.

Project

Projects the dataset represented by this TFXIO.

A Projected TFXIO:

  • Only columns needed for given tensor_names are guaranteed to be produced by self.BeamSource()
  • self.TensorAdapterConfig() and self.TensorFlowDataset() are trimmed to contain only those tensors.
  • It retains a reference to the very original TFXIO, so its TensorAdapter knows about the specs of the tensors that would be produced by the original TensorAdapter. Also see TensorAdapter.OriginalTensorSpec().

May raise an error if the TFMD schema was not provided at construction time.

Args
tensor_names a set of tensor names.

Returns
A TFXIO instance that is the same as self except that:

  • Only columns needed for given tensor_names are guaranteed to be produced by self.BeamSource()
  • self.TensorAdapterConfig() and self.TensorFlowDataset() are trimmed to contain only those tensors.

RawRecordBeamSource

Returns a PTransform that produces a PCollection[bytes].

Used together with RawRecordToRecordBatch(), it allows getting both the PCollection of the raw records and the PCollection of the RecordBatch from the same source. For example:

record_batch = pipeline | tfxio.BeamSource() raw_record = pipeline | tfxio.RawRecordBeamSource()

would result in the files being read twice, while the following would only read once:

raw_record = pipeline | tfxio.RawRecordBeamSource() record_batch = raw_record | tfxio.RawRecordToRecordBatch()

RawRecordToRecordBatch

Returns a PTransform that converts raw records to Arrow RecordBatches.

The input PCollection must be from self.RawRecordBeamSource() (also see the documentation for that method).

Args
batch_size if not None, the pa.RecordBatch produced will be of the specified size. Otherwise it's automatically tuned by Beam.

RecordBatches

Returns an iterable of record batches.

This can be used outside of Apache Beam or TensorFlow to access data.

Args
options An options object for iterating over record batches. Look at dataset_options.RecordBatchesOptions for more details.

SupportAttachingRawRecords

TensorAdapter

Returns a TensorAdapter that converts pa.RecordBatch to TF inputs.

May raise an error if the TFMD schema was not provided at construction time.

TensorAdapterConfig

Returns the config to initialize a TensorAdapter.

Returns
a TensorAdapterConfig that is the same as what is used to initialize the TensorAdapter returned by self.TensorAdapter().

TensorFlowDataset

Creates a TFRecordDataset that yields Tensors.

The serialized tf.Examples are parsed by tf.io.parse_example to create Tensors.

See base class (tfxio.TFXIO) for more details.

Args
options an options object for the tf.data.Dataset. See dataset_options.TensorFlowDatasetOptions for more details.

Returns
A dataset of dict elements, (or a tuple of dict elements and label). Each dict maps feature keys to Tensor, SparseTensor, or RaggedTensor objects.

Raises
ValueError if there is something wrong with the tensor_representation.

TensorRepresentations

Returns the TensorRepresentations.

These TensorRepresentations describe the tensors or composite tensors produced by the TensorAdapter created from self.TensorAdapter() or the tf.data.Dataset created from self.TensorFlowDataset().

May raise an error if the TFMD schema was not provided at construction time.